Postgresql資料庫count(distinct)優化
阿新 • • 發佈:2019-01-04
基本資訊
-
基本情況
表共800W資料,從260W的結果集中計算出不同的案件數量(130萬),需要執行20多秒 -
原SQL內容
select count(distinct c_bh_aj) as ajcount
from db_znspgl.t_zlglpt_wt
where d_cjrq between '20160913' and '20170909';
- 表資訊和資料量
znspgl=# \d+ db_znspgl.t_zlglpt_wt Table "db_znspgl.t_zlglpt_wt" Column | Type | Modifiers | Storage | Stats target | Description ---------+------------------------+-----------+----------+--------------+------------- c_bh | character(32) | not null | extended | | 編號 c_bh_aj | character(32) | | extended | | 案件編號 n_ajbs | numeric(15,0) | | main | | 案件標識 c_zjgz | character varying(600) | | extended | | 質檢規則 c_zjxm | character varying(300) | | extended | | 質檢專案 d_cjrq | date | | plain | | 建立日期 Indexes: "pk_zlglpt_wt" PRIMARY KEY, btree (c_bh) "i_t_zlglpt_wt_ajbs" btree (n_ajbs) "i_t_zlglpt_wt_bh_aj" btree (c_bh_aj) "i_t_zlglpt_wt_cjrq" btree (d_cjrq) znspgl=# select count(*) from db_znspgl.t_zlglpt_wt znspgl-# ; count --------- 8000000 (1 row)
- 資料庫版本資訊
znspgl=# select version(); version -------------------------------------------------------------------------------------------- PostgreSQL 9.5.5 (ArteryBase 3.5.3, Thunisoft). on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1 7), 64-bit (1 row)
- 執行計劃
znspgl=# explain analyze select count(distinct c_bh_aj) as ajcount from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=313357.40..313357.41 rows=1 width=33) (actual time=23478.562..23478.563 rows=1 loops=1) -> Bitmap Heap Scan on t_zlglpt_wt (cost=55811.21..306782.09 rows=2630125 width=33) (actual time=366.909..3946.452 rows=2 644330 loops=1) Recheck Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date)) Rows Removed by Index Recheck: 2670504 Heap Blocks: exact=105741 lossy=105694 -> Bitmap Index Scan on i_t_zlglpt_wt_cjrq (cost=0.00..55153.68 rows=2630125 width=0) (actual time=341.468..341.468 rows=2644330 loops=1) Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date)) Planning time: 0.143 ms Execution time: 23478.624 ms
嘗試增加覆蓋索引
- 增加索引
create index i_zlglpt_wt_zh01 on db_znspgl.t_zlglpt_wt (d_cjrq,c_bh_aj);
- 再次檢視執行計劃
znspgl=# explain analyze select count(distinct c_bh_aj) as ajcount from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
--------------------------------
Aggregate (cost=134006.11..134006.12 rows=1 width=33) (actual time=21696.556..21696.557 rows=1 loops=1)
-> Index Only Scan using i_zlglpt_wt_zh01 on t_zlglpt_wt (cost=0.56..127480.16 rows=2610380 width=33) (actual time=0.055.
.2684.807 rows=2644330 loops=1)
Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))
Heap Fetches: 0
Planning time: 0.318 ms
Execution time: 21696.604 ms
- 思考
1、SQL速度提升很少!
2、時間主要話費在Aggregate上了,時間從2648一下子升級到21696。
3、理論上200W的count(distinct) 不應該花費19秒那麼長時間,而且c_bh_aj還是有序的(建立索引了)
偽loose index scan
從網上看到一片帖子《分析MySQL中優化distinct的技巧》,count distinct 慢的原因是因為掃描編號時會掃描到很多重複的項,可以通過loose index scan避免這些重複的掃描(前提distinct項是有序的!),mysql 和 abase雖然不支援原生的loose index scan(oracle支援),但是可以通過改寫SQL達到!
- 重新建立索引
drop index db_znspgl.i_zlglpt_wt_zh01;
create index i_zlglpt_wt_zh01 on db_znspgl.t_zlglpt_wt (c_bh_aj,d_cjrq);
- 改寫SQL
select count(*) from (
select distinct(c_bh_aj)
from db_znspgl.t_zlglpt_wt
where d_cjrq between '20160913' and '20170909'
) t;
- 檢視執行計劃
znspgl=# explain analyze select count(*) from (select distinct(c_bh_aj) from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909' ) t;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=347567.23..347567.24 rows=1 width=0) (actual time=6954.845..6954.846 rows=1 loops=1)
-> Unique (cost=0.56..343310.31 rows=340554 width=33) (actual time=0.034..5969.209 rows=1322165 loops=1)
-> Index Only Scan using i_zlglpt_wt_zh01 on t_zlglpt_wt (cost=0.56..336784.36 rows=2610380 width=33) (actual time=
0.031..2840.502 rows=2644330 loops=1)
Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))
Heap Fetches: 0
Planning time: 0.172 ms
Execution time: 6954.890 ms
(7 rows)
- 通過timing 計算SQL執行時間
znspgl=# \timing on
Timing is on.
znspgl=# select count(*) from (select distinct(c_bh_aj) from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909' ) t;
count
---------
1322165
(1 row)
Time: 1322.715 ms
總結
通過偽loose index scan的SQL處理可以有效提高count(distinct)的執行速度!