Postgresql資料庫count(distinct)優化

阿新 • • 發佈：2019-01-04

基本資訊

基本情況
表共800W資料，從260W的結果集中計算出不同的案件數量(130萬)，需要執行20多秒
原SQL內容

select count(distinct  c_bh_aj) as ajcount 
    from db_znspgl.t_zlglpt_wt 
    where d_cjrq between '20160913' and '20170909';

表資訊和資料量

znspgl=# \d+ db_znspgl.t_zlglpt_wt
                            Table "db_znspgl.t_zlglpt_wt"
 Column  |          Type          | Modifiers | Storage  | Stats target | Description 
---------+------------------------+-----------+----------+--------------+-------------
 c_bh    | character(32)          | not null  | extended |              | 編號
 c_bh_aj | character(32)          |           | extended |              | 案件編號
 n_ajbs  | numeric(15,0)          |           | main     |              | 案件標識
 c_zjgz  | character varying(600) |           | extended |              | 質檢規則
 c_zjxm  | character varying(300) |           | extended |              | 質檢專案
 d_cjrq  | date                   |           | plain    |              | 建立日期
Indexes:
    "pk_zlglpt_wt" PRIMARY KEY, btree (c_bh)
    "i_t_zlglpt_wt_ajbs" btree (n_ajbs)
    "i_t_zlglpt_wt_bh_aj" btree (c_bh_aj)
    "i_t_zlglpt_wt_cjrq" btree (d_cjrq)


znspgl=# select count(*) from db_znspgl.t_zlglpt_wt
znspgl-# ;
  count  
---------
 8000000
(1 row)

資料庫版本資訊

znspgl=# select version();
                                                                 version                                                      
           
--------------------------------------------------------------------------------------------
 PostgreSQL 9.5.5 (ArteryBase 3.5.3, Thunisoft). on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1
7), 64-bit
(1 row)

執行計劃

znspgl=# explain analyze select count(distinct  c_bh_aj) as ajcount from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909';
                                                                     QUERY PLAN                                               
                       
------------------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=313357.40..313357.41 rows=1 width=33) (actual time=23478.562..23478.563 rows=1 loops=1)
   ->  Bitmap Heap Scan on t_zlglpt_wt  (cost=55811.21..306782.09 rows=2630125 width=33) (actual time=366.909..3946.452 rows=2
644330 loops=1)
         Recheck Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))
         Rows Removed by Index Recheck: 2670504
         Heap Blocks: exact=105741 lossy=105694
         ->  Bitmap Index Scan on i_t_zlglpt_wt_cjrq  (cost=0.00..55153.68 rows=2630125 width=0) (actual time=341.468..341.468
 rows=2644330 loops=1)
               Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))
 Planning time: 0.143 ms
 Execution time: 23478.624 ms

嘗試增加覆蓋索引

增加索引

create index i_zlglpt_wt_zh01 on db_znspgl.t_zlglpt_wt (d_cjrq,c_bh_aj);

再次檢視執行計劃

znspgl=# explain analyze select count(distinct  c_bh_aj) as ajcount from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909';
                                                                          QUERY PLAN                                          
                                
------------------------------------------------------------------------------------------------------------------------------
--------------------------------
 Aggregate  (cost=134006.11..134006.12 rows=1 width=33) (actual time=21696.556..21696.557 rows=1 loops=1)
   ->  Index Only Scan using i_zlglpt_wt_zh01 on t_zlglpt_wt  (cost=0.56..127480.16 rows=2610380 width=33) (actual time=0.055.
.2684.807 rows=2644330 loops=1)
         Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))
         Heap Fetches: 0
 Planning time: 0.318 ms
 Execution time: 21696.604 ms

思考
1、SQL速度提升很少！
2、時間主要話費在Aggregate上了，時間從2648一下子升級到21696。
3、理論上200W的count(distinct) 不應該花費19秒那麼長時間，而且c_bh_aj還是有序的（建立索引了）

偽loose index scan

從網上看到一片帖子《分析MySQL中優化distinct的技巧》，count distinct 慢的原因是因為掃描編號時會掃描到很多重複的項，可以通過loose index scan避免這些重複的掃描（前提distinct項是有序的！），mysql 和 abase雖然不支援原生的loose index scan（oracle支援），但是可以通過改寫SQL達到！

重新建立索引

drop index db_znspgl.i_zlglpt_wt_zh01;
create index i_zlglpt_wt_zh01 on db_znspgl.t_zlglpt_wt (c_bh_aj,d_cjrq);

改寫SQL

select count(*) from  (
   select distinct(c_bh_aj)  
       from db_znspgl.t_zlglpt_wt 
       where d_cjrq between '20160913' and '20170909' 
   ) t;

檢視執行計劃

znspgl=# explain analyze select count(*) from  (select distinct(c_bh_aj)  from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909' ) t;
                                                                             QUERY PLAN                                       
                                      
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=347567.23..347567.24 rows=1 width=0) (actual time=6954.845..6954.846 rows=1 loops=1)
   ->  Unique  (cost=0.56..343310.31 rows=340554 width=33) (actual time=0.034..5969.209 rows=1322165 loops=1)
         ->  Index Only Scan using i_zlglpt_wt_zh01 on t_zlglpt_wt  (cost=0.56..336784.36 rows=2610380 width=33) (actual time=
0.031..2840.502 rows=2644330 loops=1)
               Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))
               Heap Fetches: 0
 Planning time: 0.172 ms
 Execution time: 6954.890 ms
(7 rows)

通過timing 計算SQL執行時間

znspgl=# \timing on
Timing is on.
znspgl=#  select count(*) from  (select distinct(c_bh_aj)  from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909' ) t;
  count  
---------
 1322165
(1 row)

Time: 1322.715 ms

總結

通過偽loose index scan的SQL處理可以有效提高count(distinct)的執行速度！

Postgresql資料庫count(distinct)優化

基本資訊基本情況表共800W資料，從260W的結果集中計算出不同的案件數量(130萬)，需要執行20多秒原SQL內容 select count(distinct c_bh_aj) as ajcount from db_znspgl.

Hive中的count(distinct)優化

問題描述 COUNT(DISTINCT xxx)在hive中很容易造成資料傾斜。針對這一情況，網上已有很多優化方法，這裡不再贅述。但有時，“資料傾斜”又幾乎是必然的。我們來舉個例子：假設表detail_sdk_session中記錄了訪問某網站M的客戶端會話資訊，即：

hive優化-count(distinct)

問題描述 COUNT(DISTINCT xxx)在hive中很容易造成資料傾斜。針對這一情況，網上已有很多優化方法，這裡不再贅述。但有時，“資料傾斜”又幾乎是必然

【備忘】德哥PostgreSQL 資料庫優化培訓視訊【18集】

1 優化培訓 - 授課環境搭建講解.avi2 優化培訓 - 統計資訊詳解,成本因子介紹.avi3 優化培訓 - explain輸出結構資訊詳解.avi4 優化培訓 - explain 例項講解(append,nestloop,hashjoin,mergejoin,joinf

Hive SQL優化之 Count Distinct

Hive是Hadoop的子專案，它提供了對資料的結構化管理和類SQL語言的查詢功能。SQL的互動方式極大程度地降低了Hadoop生態環境中資料處理的門檻，使用者不需要編寫程式，通過SQL語句就可以對資料進行分析和處理。目前很多計算需求都可以由Hive來完成，極大程度地降低

PostgreSQL資料庫 OLTP高併發請求效能優化

Postgres2015全國使用者大會將於11月20至21日在北京麗亭華苑酒店召開。本次大會嘉賓陣容強大，國內頂級PostgreSQL資料庫專家將悉數到場，並特邀歐洲、俄羅斯、日本、美國等國家和地區的資料庫方面專家助陣: Postgres-XC專案的發起人鈴木市一

sql優化之：count(distinct xxxx)

select count(distinct column) from table_name; 這樣一條sql在資料量比較大時可能跑的時間很長。可以用：select count(1) from (select column from table_name group

使用子查詢可提升 COUNT DISTINCT 速度 50 倍

原因 desc 精準 http user 計數而且 -1 nbsp Count distinct是SQL分析時的禍根首先：如果你有一個大的且能夠容忍不精確的數據集，那像HyperLogLog這樣的概率計數器應該是你最好的選擇。但對於需要快速、精準答案的查詢，一些簡單

SpringBoot連線PostgreSql資料庫

目錄一、介紹 1、情況說明 2、安裝軟體及依賴包二、配置連線資料庫其他情況一、介紹 1、情況說明在這裡我使用SpringBoot配置Mybaits連線到PostgreSql資料庫的。我的原始碼也會提供給大家（此文末尾），效果如下

postgresql資料庫中geometry型別的欄位插入經緯度指令碼

在postgresql資料庫中，如果欄位型別是geometry，我們要更新該欄位為經緯度（座標），可以嘗試採取以下指令碼： update device set shape = ST_GeomFromText(‘POINT(108.658463 34.1437)’, 4610) where n

在Windows中使用libpq連線postgresql資料庫

1.首先，編譯libpq 下載原始碼，進入src目錄，interface/libpq/win32.mak 檔案中，mt命令那些行刪掉。執行 nmake /f win32.mak 在interface/libpq/Release中可以看到libpq.lib 2.服務端配置修改postgresql.

大資料量 Mybatis 分頁外掛Count語句優化

前言當在大數量的情況下，進行分頁查詢，統計總數時，會自動count一次，這個語句是在我們的查詢語句的基礎上巢狀一層，如： SELECT COUNT(*) FROM (主sql) 這樣在資料量大的情況下，會出問題，很容易cpu就跑滿了優化在mapper.xml

【MySQL資料庫】效能優化之索引及優化（一）

一、Mysql效能優化之影響效能的因素 1.商業需求的影響不合理的需求造成的資源投入產出，這裡就用一個看上去很簡單的功能分析。需求：一個論壇帖子的總量統計，附加要求：實時更新。從功能上看來是非常容易實現的，執行一條select count（*）from表名就可以得到結果，但是如果我們採

postgresql資料庫varchar、char、text的比較

名字描述character varying(n), varchar(n) 變長，有長度限制character(n), char(n) 定長，不足補空白text 變長，無長度限制簡單來說，varchar的長度可變，而char的長度不可變，對於postgresql資料庫來說varchar和char的區別僅僅在於

原始碼安裝postgresql資料庫

一般情況下，postgresql由非root使用者啟動。 1、建立postgres使用者 groupadd postgres useradd -g postgres postgres 下面的操作都在postgres使用者下完成 su postgres 2、解壓原始碼包 tar -xvzf pos

如何從外網訪問本地PostgreSQL資料庫

本地安裝了一個PostgreSQL資料庫，只能在區域網內訪問到，怎樣從外網也能訪問到本地的PostgreSQL資料庫呢？本文將介紹具體的實現步驟。 1. 準備工作 1.1 安裝Java 1.7及以上版本執行命令java -version檢查Java安裝和配置是否正確。 1.

PostgreSQL資料庫核心分析學習（一）

PostgreSQL 資料庫由連線管理系統（系統控制器）、編譯執行系統、儲存管理系統、事務系統和系統表五大部分組成。 2.1系統表資料字典是關係資料庫系統管理控制資訊的核心在postgreSQL資料庫系統中，系統表扮演資料字典的角色存放結構元資料，

在Apache Kylin中使用Count Distinct

雷頓學院大資料：http://www.leidun.site/ 在OLAP多維分析中，Count Distinct（去重計數）是一種非常常用的指標度量，比如一段時間內的UV、活躍使用者數等等;從1.5.3開始，Apache Kylin提供了兩種Count Distinct計算方式，一種是近

關於python操作mysql和postgresql資料庫的sql 分頁限制語句sql語法問題

@本人使用django開發一個數據庫的管理模組，主要開發兩種資料庫的管理，遇到了一些坑 Python 使用psycopg2操作postgresql ，使用pymysql連線mysql psycopg2 下載 pip install psycopg2 pymysq

檢視postgresql資料庫使用者

SELECT u.usename AS "User name", u.usesysid AS "User ID", CASE WHEN u.usesuper AND u.usecreatedb THEN CAST('superuser, create database' AS pg_ca

Postgresql資料庫count(distinct)優化

基本資訊

嘗試增加覆蓋索引

偽loose index scan

總結

相關推薦