GROUP BY 在日常的 SQL 使用中佔了很大的比率,也很重要。但也常常有效能的問題。這裡就從 SELECT Clause 與 GROUP BY 的關聯來看看,與你理解的有何不同 ?
SQL 1 2 3 4 5 6 7 8 9 select 'Hello Tainan' as msg, 1 as num , deptno, get_dept_name(deptno) dname, (select count (*) from emp) as total, count (*) as cnt from emp group by deptno /
Result MSG NUM DEPTNO DNAME TOTAL CNT ------------ ----- ------- ------------ ---------- ---------- Hello Tainan 1 30 SALES 19 5 Hello Tainan 1 20 RESEARCH 19 5 Hello Tainan 1 70 資訊部 19 3 Hello Tainan 1 40 OPERATIONS 19 2 Hello Tainan 1 10 會計部 19 4
要記得,Oracle Database 10g 以後 GROUP BY 不會自動排序。
上面這段 SQL 你理解了多少? 或與你理解的有何不同 ?
除了 Aggregate function COUNT( ) 外,並不是所有 SELECT clause 上的項目都必須出現在 GROUP BY 裡。這是因為 SELECT clause 是在執行完 GROUP BY 後,才決定其值 。 除了 deptno 外,其他欄位與在執行 GROUP BY 時都不相干。 也就是說 msg、num、dname、total 與資料集 emp 毫無關係 ,再如何變動 GROUP BY,這些值都不會變。
以下條列的都有這些特性。
1. Constants (msg 與 num)2. Scalar values returned by user-defined functions (dname)3. Analytic Functions (標準的 ISO SQL 稱為 Window Functions) 4. Non-correlated scalar subqueries (total)
了解了這些特性,來看看這段 SQL。
SQL 1 2 3 4 5 6 7 8 9 select deptno, count (*) cnt, case when count (*) not between 4 and 8 then '<<<===' else null end as mark from emp group by deptno order by deptno /
Result DEPTNO CNT MARK ------- ---------- ------ 10 4 20 5 30 5 40 2 <<<=== 70 3 <<<===
CASE 幾乎可出現在任何地方,也很有彈性,這裡與 GROUP BY 欄位無關,所以可出現在 SELECT 中。注意 CASE 的執行時間點是在 GROUP BY 之後 。
出現在 GROUP BY 的項目也不一定要出現在 SELECT Clause。
SQL 1 2 3 4 select count (*) from emp group by deptno /
Result COUNT(*) ---------- 5 5 3 ...
SQL 1 2 3 4 select count (*) from emp group by deptno, job /
Result COUNT(*) ---------- 1 2 3 ...
來看看常見的 GROUP BY 的應用,依前面所提的規則,還會有何不同寫法的可能性 ?
SQL 1 2 3 4 5 6 select e.deptno, d.dname, sum (sal) total from emp e, dept d where e.deptno = d.deptno group by e.deptno, d.dname /
Query Execution Plan ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 19 | 380 | 7 (15)| 00:00:01 | | 1 | HASH GROUP BY | | 19 | 380 | 7 (15)| 00:00:01 | |* 2 | HASH JOIN | | 19 | 380 | 6 (0)| 00:00:01 | | 3 | TABLE ACCESS FULL| DEPT | 6 | 78 | 3 (0)| 00:00:01 | | 4 | TABLE ACCESS FULL| EMP | 19 | 133 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------
這是非常一般的應用,因為需要部門名稱,所以 Join 了 DEPT。Query Execution Plan 顯示 Rows 19,Cost 7。
這可以用 Scalar subquery 來改寫 dname 欄位。
SQL 1 2 3 4 5 6 select e.deptno, (select dname from dept d where e.deptno = d.deptno) dname, sum (sal) total from emp e group by e.deptno /
Query Execution Plan -------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 35 | 4 (25)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 13 | 1 (0)| 00:00:01 | |* 2 | INDEX UNIQUE SCAN | SYS_C0036617 | 1 | | 0 (0)| 00:00:01 | | 3 | HASH GROUP BY | | 5 | 35 | 4 (25)| 00:00:01 | | 4 | TABLE ACCESS FULL | EMP | 19 | 133 | 3 (0)| 00:00:01 | --------------------------------------------------------------------------------------------
將 dname 改用 Scalar Subquery,這樣寫,如果 emp 有百萬筆,會不會有 Loop 的問題 ?
從 Query Execution Plan 中注意它的執行時間點(Id 1 與 2) 在 GROUP BY 之後(Id 3 與 4),就會比較清楚。也注意 Rows 與 Cost。
效能似乎變得較好,但要注意的是,上面兩段 SQL 並不完全相同,第二段有 OUTER JOIN 的結果。
使用 GROUP BY 一個常見的錯誤就是 HAVING 的誤用:
SQL 1 2 3 4 5 6 7 8 select deptno, trunc(hiredate, 'Y' ) year , sum (sal) from emp group by deptno, trunc(hiredate, 'Y' ) having trunc(hiredate, 'Y' ) > to_date ('1981' , 'yyyy' ) order by 1 , 2 /
Query Execution Plan ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 15 | 5 (40)| 00:00:01 | | 1 | SORT ORDER BY | | 1 | 15 | 5 (40)| 00:00:01 | |* 2 | FILTER | | | | | | | 3 | HASH GROUP BY | | 1 | 15 | 5 (40)| 00:00:01 | | 4 | TABLE ACCESS FULL| EMP | 19 | 285 | 3 (0)| 00:00:01 | -----------------------------------------------------------------------------
從 Query Execution Plan 中可以比較清楚的看出來,看它 FILTER(Id 2) 的位置。Filter 不應出現在 GROUP BY 後面。這裡可以將 having 的條件直接移到 where 語句中。不是 Aggregate Function 不應該出現在 HAVING 的地方。
SQL 1 2 3 4 5 6 7 8 select deptno, trunc(hiredate, 'Y' ) year , sum (sal) from emp where trunc(hiredate, 'Y' ) > to_date ('1981' , 'yyyy' ) group by deptno, trunc(hiredate, 'Y' ) order by 1 , 2 /
Query Execution Plan ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 15 | 5 (40)| 00:00:01 | | 1 | SORT ORDER BY | | 1 | 15 | 5 (40)| 00:00:01 | | 2 | HASH GROUP BY | | 1 | 15 | 5 (40)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 1 | 15 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------
Filter(Id 3) 應該在 GROUP BY 之前,這種誤用幾乎很難發覺,但當資料越來越大時效能就有很大的不同。
再看最後一個類似的範例。
SQL 1 2 3 4 5 6 select deptno from emp where trunc(hiredate, 'Y' ) >= to_date ('1981' , 'yyyy' ) group by deptno having sum (coalesce (comm, 0 )) > 0 /
Query Execution Plan ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 13 | 4 (25)| 00:00:01 | |* 1 | FILTER | | | | | | | 2 | HASH GROUP BY | | 1 | 13 | 4 (25)| 00:00:01 | |* 3 | TABLE ACCESS FULL| EMP | 1 | 13 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------
乍看之下,這似乎很合理,但仔細看一下,sum( ) 只在判斷 comm 是否大於 0,直接就可用在 where。
SQL 1 2 3 4 5 6 select deptno from emp where trunc(hiredate, 'Y' ) >= to_date ('1981' , 'yyyy' ) and coalesce (comm, 0 ) > 0 group by deptno /
Query Execution Plan --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 13 | 4 (25)| 00:00:01 | | 1 | HASH GROUP BY | | 1 | 13 | 4 (25)| 00:00:01 | |* 2 | TABLE ACCESS FULL| EMP | 1 | 13 | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------
comm 大於 0,直接就可用在 where。這也是 HAVING 常被誤用的地方。
上面的 SQL 與使用 DISTINCT 沒甚麼不同,但 Query Execution Plan (HASH UNIQUE)卻不太相同。
SQL 1 2 3 4 5 select distinct deptno from emp where trunc(hiredate, 'Y' ) >= to_date ('1981' , 'yyyy' ) and coalesce (comm, 0 ) > 0 /
Query Execution Plan --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 13 | 4 (25)| 00:00:01 | | 1 | HASH UNIQUE | | 1 | 13 | 4 (25)| 00:00:01 | |* 2 | TABLE ACCESS FULL| EMP | 1 | 13 | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------
遇到有 GROUP BY 的地方要謹慎地思考一下,效能會有很大的不同,最好養成查看 Query Execution Plan 的習慣。