0%

SQL GROUP BY

GROUP BY 在日常的 SQL 使用中佔了很大的比率,也很重要。但也常常有效能的問題。這裡就從 SELECT Clause 與 GROUP BY 的關聯來看看,與你理解的有何不同 ?

SQL
1
2
3
4
5
6
7
8
9
select 'Hello Tainan' as msg,
1 as num,
deptno,
get_dept_name(deptno) dname,
(select count(*) from emp) as total,
count(*) as cnt
from emp
group by deptno
/
Result
MSG            NUM  DEPTNO DNAME             TOTAL        CNT
------------ ----- ------- ------------ ---------- ----------
Hello Tainan 1 30 SALES 19 5
Hello Tainan 1 20 RESEARCH 19 5
Hello Tainan 1 70 資訊部 19 3
Hello Tainan 1 40 OPERATIONS 19 2
Hello Tainan 1 10 會計部 19 4

要記得,Oracle Database 10g 以後 GROUP BY 不會自動排序。

上面這段 SQL 你理解了多少? 或與你理解的有何不同 ?

除了 Aggregate function COUNT( ) 外,並不是所有 SELECT clause 上的項目都必須出現在 GROUP BY 裡。這是因為 SELECT clause 是在執行完 GROUP BY 後,才決定其值。 除了 deptno 外,其他欄位與在執行 GROUP BY 時都不相干。 也就是說 msg、num、dname、total 與資料集 emp 毫無關係,再如何變動 GROUP BY,這些值都不會變。

以下條列的都有這些特性。

1. Constants (msg 與 num)
2. Scalar values returned by user-defined functions (dname)
3. Analytic Functions (標準的 ISO SQL 稱為 Window Functions)
4. Non-correlated scalar subqueries (total)

了解了這些特性,來看看這段 SQL。

SQL
1
2
3
4
5
6
7
8
9
select deptno, count(*) cnt,
case when count(*) not between 4 and 8
then '<<<==='
else null
end as mark
from emp
group by deptno
order by deptno
/
Result
 DEPTNO        CNT MARK
------- ---------- ------
10 4
20 5
30 5
40 2 <<<===
70 3 <<<===

CASE 幾乎可出現在任何地方,也很有彈性,這裡與 GROUP BY 欄位無關,所以可出現在 SELECT 中。注意 CASE 的執行時間點是在 GROUP BY 之後

出現在 GROUP BY 的項目也不一定要出現在 SELECT Clause。

SQL
1
2
3
4
select count(*)
from emp
group by deptno
/
Result
  COUNT(*)
----------
5
5
3
...
SQL
1
2
3
4
select count(*)
from emp
group by deptno, job
/
Result
  COUNT(*)
----------
1
2
3
...

來看看常見的 GROUP BY 的應用,依前面所提的規則,還會有何不同寫法的可能性 ?

SQL
1
2
3
4
5
6
select e.deptno, d.dname, sum(sal) total
from emp e,
dept d
where e.deptno = d.deptno
group by e.deptno, d.dname
/
Query Execution Plan
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19 | 380 | 7 (15)| 00:00:01 |
| 1 | HASH GROUP BY | | 19 | 380 | 7 (15)| 00:00:01 |
|* 2 | HASH JOIN | | 19 | 380 | 6 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| DEPT | 6 | 78 | 3 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 19 | 133 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------

這是非常一般的應用,因為需要部門名稱,所以 Join 了 DEPT。Query Execution Plan 顯示 Rows 19,Cost 7。

這可以用 Scalar subquery 來改寫 dname 欄位。

SQL
1
2
3
4
5
6
select e.deptno,
(select dname from dept d where e.deptno = d.deptno) dname,
sum(sal) total
from emp e
group by e.deptno
/
Query Execution Plan
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 35 | 4 (25)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 13 | 1 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | SYS_C0036617 | 1 | | 0 (0)| 00:00:01 |
| 3 | HASH GROUP BY | | 5 | 35 | 4 (25)| 00:00:01 |
| 4 | TABLE ACCESS FULL | EMP | 19 | 133 | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

將 dname 改用 Scalar Subquery,這樣寫,如果 emp 有百萬筆,會不會有 Loop 的問題 ?

從 Query Execution Plan 中注意它的執行時間點(Id 1 與 2) 在 GROUP BY 之後(Id 3 與 4),就會比較清楚。也注意 Rows 與 Cost。

效能似乎變得較好,但要注意的是,上面兩段 SQL 並不完全相同,第二段有 OUTER JOIN 的結果。

使用 GROUP BY 一個常見的錯誤就是 HAVING 的誤用:

SQL
1
2
3
4
5
6
7
8
select deptno,
trunc(hiredate, 'Y') year,
sum(sal)
from emp
group by deptno, trunc(hiredate, 'Y')
having trunc(hiredate, 'Y') > to_date('1981', 'yyyy')
order by 1, 2
/
Query Execution Plan
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 15 | 5 (40)| 00:00:01 |
| 1 | SORT ORDER BY | | 1 | 15 | 5 (40)| 00:00:01 |
|* 2 | FILTER | | | | | |
| 3 | HASH GROUP BY | | 1 | 15 | 5 (40)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 19 | 285 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

從 Query Execution Plan 中可以比較清楚的看出來,看它 FILTER(Id 2) 的位置。Filter 不應出現在 GROUP BY 後面。這裡可以將 having 的條件直接移到 where 語句中。不是 Aggregate Function 不應該出現在 HAVING 的地方。

SQL
1
2
3
4
5
6
7
8
select deptno,
trunc(hiredate, 'Y') year,
sum(sal)
from emp
where trunc(hiredate, 'Y') > to_date('1981', 'yyyy')
group by deptno, trunc(hiredate, 'Y')
order by 1, 2
/
Query Execution Plan
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 15 | 5 (40)| 00:00:01 |
| 1 | SORT ORDER BY | | 1 | 15 | 5 (40)| 00:00:01 |
| 2 | HASH GROUP BY | | 1 | 15 | 5 (40)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| EMP | 1 | 15 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------

Filter(Id 3) 應該在 GROUP BY 之前,這種誤用幾乎很難發覺,但當資料越來越大時效能就有很大的不同。

再看最後一個類似的範例。

SQL
1
2
3
4
5
6
select deptno
from emp
where trunc(hiredate, 'Y') >= to_date('1981', 'yyyy')
group by deptno
having sum(coalesce(comm, 0)) > 0
/
Query Execution Plan
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 4 (25)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 1 | 13 | 4 (25)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| EMP | 1 | 13 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------

乍看之下,這似乎很合理,但仔細看一下,sum( ) 只在判斷 comm 是否大於 0,直接就可用在 where。

SQL
1
2
3
4
5
6
select deptno
from emp
where trunc(hiredate, 'Y') >= to_date('1981', 'yyyy')
and coalesce(comm, 0) > 0
group by deptno
/
Query Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 4 (25)| 00:00:01 |
| 1 | HASH GROUP BY | | 1 | 13 | 4 (25)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| EMP | 1 | 13 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------

comm 大於 0,直接就可用在 where。這也是 HAVING 常被誤用的地方。

上面的 SQL 與使用 DISTINCT 沒甚麼不同,但 Query Execution Plan (HASH UNIQUE)卻不太相同。

SQL
1
2
3
4
5
select distinct deptno
from emp
where trunc(hiredate, 'Y') >= to_date('1981', 'yyyy')
and coalesce(comm, 0) > 0
/
Query Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 4 (25)| 00:00:01 |
| 1 | HASH UNIQUE | | 1 | 13 | 4 (25)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| EMP | 1 | 13 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------

遇到有 GROUP BY 的地方要謹慎地思考一下,效能會有很大的不同,最好養成查看 Query Execution Plan 的習慣。