postgresql建立統計資訊優化

阿新 • • 發佈：2021-11-22

postgresql，建立統計資訊，優化sql

建立統計資訊

建立統計資訊

預估行數

pg中對於單列選擇性的估算比較準確，而對於多列的情況則會出現不準確的情況，因為pg預設使用獨立屬性，直接以多個欄位選擇性相乘的方法計算多個欄位條件的選擇性

建立統計資訊功能是在pg10版本引入的功能，我們在優化SQL的時候，一個最重要的點就是統計資訊是否準確，統計資訊不準會導致優化器預估的行數不準確，進而影響掃描方法和連線方式。

--建立表
postgres=# create table test2(n_id int,id1 int,id2 int);
CREATE TABLE
postgres=# insert into test2 select i,i/1000,i/10000 from generate_series(1,1000000) s(i);
INSERT 0 1000000
postgres=# analyze test2;
ANALYZE
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stats where tablename = 'test2' and attname = 'id1';
-[ RECORD 1 ]----------+-------------------------------------------------------------------------------------
schemaname             | public
tablename              | test2
attname                | id1
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 1000
most_common_vals       | {381,649,852,142,269,415,496,537,714,80,177,303,526,870,924}
most_common_freqs      | {0.0016,0.0016,0.0015666666,0.0015333333,0.0015333333...}
histogram_bounds       | {0,10,19,29,39,49,59,69,78,89,99,109,119,128,139,...}
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 

--檢視執行計劃
postgres=# explain (analyse,buffers) select * from test2 where id1 = 1;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..17906.00 rows=992 width=12) (actual time=0.144..109.573 rows=1000 loops=1)
   Filter: (id1 = 1)
   Rows Removed by Filter: 999000
   Buffers: shared hit=5406
 Planning Time: 0.118 ms
 Execution Time: 109.697 ms
(6 rows)

預估的行數為992和實際返回1000，預估和實際非常接近

如果在id1和id2上都過濾資料時，會怎麼樣？

postgres=# explain (analyse,buffers) select * from test2 where id1 = 1 and id2= 0;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..20406.00 rows=10 width=12) (actual time=0.153..138.057 rows=1000 loops=1)
   Filter: ((id1 = 1) AND (id2 = 0))
   Rows Removed by Filter: 999000
   Buffers: shared hit=5406
 Planning Time: 0.267 ms
 Execution Time: 138.184 ms
(6 rows)

預估為10但是實際為1000，相差100倍了，為什麼會發生這種情況呢？

第一列的選擇性大約是0.001（1/1000），第二列的選擇性是0.01（1/100）。為了計算被這2個 "獨立 "條件過濾的行數，planner將它們的選擇性相乘。所以，我們得到

選擇性 = 0. 001 * 0. 01 = 0. 00001

當這個乘以我們在表中的行數即1000000時，我們得到10。這就是planner估計的10的由來。但是，這幾列不是獨立的，我們怎麼告訴planner呢？

函式依賴

pg10開始支援使用者自定義統計資訊，這樣我們便可以針對這種多列的情況建立自定義多個欄位的統計資訊，目前支援多列相關性和多列唯一值兩種統計。

建立函式依賴
回到我們之前的估算問題，問題是col2

的值其實不過是col1 / 10。在資料庫術語中，我們會說col2在功能上依賴於col1。這意味著col1的值足以決定col2的值，不存在兩行col1的值相同而col2的值不同的情況。因此，col2上的第2個過濾器實際上並沒有刪除任何行！但是，planner捕捉到了足夠的統計資料。但是，規劃者捕捉到了足夠的統計資料來知道這一點。

--建立統計資訊
postgres=# create statistics s1(dependencies) on id1,id2 from test2;
CREATE STATISTICS
--analyze
postgres=# analyze test2;
ANALYZE
--檢視執行計劃
postgres=# explain (analyse,buffers) select * from test2 where id1 = 1 and id2 = 0;
                                                QUERY PLAN                                         
        
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..20406.00 rows=997 width=12) (actual time=0.159..124.450 rows=1000 l oops=1)
   Filter: ((id1 = 1) AND (id2 = 0))
   Rows Removed by Filter: 999000
   Buffers: shared hit=5406
 Planning Time: 0.364 ms
 Execution Time: 124.592 ms
(6 rows)
--檢視統計資訊
postgres=# SELECT stxname,stxkeys,extdat.stxddependencies  FROM pg_statistic_ext ext join pg_statistic_ext_data extdat on ext.oid = extdat.stxoid; 
 stxname | stxkeys |   stxddependencies   
---------+---------+----------------------
 s1      | 2 3     | {"2 => 3": 1.000000}
(1 row)
--stxkeys中的2 3表示表的第二列和第三列
--從這一點來看，我們可以看到Postgres意識到col1完全決定了col2，因此有一個係數為1來捕捉這些資訊。現在，所有對這兩列進行過濾的查詢都會有更好的估計。
postgres=# select statistics_name,attnames,dependencies from  pg_stats_ext;
 statistics_name | attnames  |     dependencies     
-----------------+-----------+----------------------
 s1              | {id1,id2} | {"2 => 3": 1.000000}
(1 row)

簡單來說就是id1相同的兩行資料，那麼id2也一定相同，（比如前面插入的資料id2=0的有10000條資料，那麼id1的值分別是（1-9）分別有1000條資料，任意兩行id1相等的資料，id2一定也是一樣的）

如果沒有函式依賴性統計，規劃器會認為兩個WHERE條件是獨立的，並且會將它們的選擇性乘以一起，以致得到太小的行數估計。通過這樣的統計，規劃器認識到WHERE條件是多餘的，並且不會低估行數。

當前只有在考慮簡單等值條件（將列與常量值比較）時，函式依賴才適用。不會使用它們來改進比較兩個列或者比較列和表示式的等值條件的估計，也不會用它們來改進範圍子句、LIKE或者任何其他型別的條件。

ndistinct 統計

單列統計資訊儲存每一列中可區分值的數量。在組合多個列（例如GROUP BY a，b）時，如果規劃器只有單列統計資料，則對可區分值數量的估計常常會錯誤，導致選擇不好的計劃

--對test2表進行group by id1,id2操作
postgres=# explain (analyse,buffers) select id1,id2,count(*) from test2  group by id1,id2;
                                                      QUERY PLAN                                   
                    
------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=22906.00..23906.00 rows=100000 width=16) (actual time=473.444..474.544 rows=1001 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=5406
   ->  Seq Scan on test2  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.022..178.253 rows=1000000 loops=1)
         Buffers: shared hit=5406
 Planning Time: 1.202 ms
 Execution Time: 479.178 ms
(7 rows)

planner預估的行數是100000，而實際行數只有1001

--建立ndistinct
postgres=# create statistics s2(ndistinct) on id1,id2 from test2;
CREATE STATISTICS

postgres=# analyze test2;
ANALYZE
--可以看到planner預估值更加準確了，預估1000，實際返回10001
postgres=# explain (analyse,buffers) select id1,id2,count(*) from test2  group by id1,id2;
                                                      QUERY PLAN                                    
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22906.00..22916.00 rows=1000 width=16) (actual time=442.839..443.160 rows=1001 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=5406
   ->  Seq Scan on test2  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.029..147.498 rows=1000000 loops=1)
         Buffers: shared hit=5406
 Planning Time: 0.364 ms
 Execution Time: 444.362 ms
(7 rows)
--統計資訊記錄了ndistinct值：為1000
postgres=# sELECT stxname,stxkeys,extdat.stxdndistinct  FROM pg_statistic_ext ext join pg_statistic_ext_data extdat on ext.oid = extdat.stxoid where stxname = 's2'; 
 stxname | stxkeys | stxdndistinct  
---------+---------+----------------
 s2      | 2 3     | {"2, 3": 1000}
(1 row)

常見的情況有月，季，年的列。省，市，區等需要聯合group by的情況

建議只對實際用於分組的列組合以及分組數錯誤估計導致了糟糕計劃的列組合建立ndistinct統計資訊物件。否則，ANALYZE迴圈只會被浪費。

高頻值-mvc

MCV（most common values）

建立表t2與兩個完全相關的列(包含相同的資料)，並且在這些列上建立一個MCV列表:

--建立表
CREATE TABLE t2 (a   int,b   int);
--插入資料
INSERT INTO t2 SELECT mod(i,100), mod(i,100)FROM generate_series(1,1000000) s(i);
analyze t2;
--檢視執行計劃預估97，實際10000
postgres=#  EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 70);
                                              QUERY PLAN                                           
    
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.00 rows=97 width=8) (actual time=0.038..123.438 rows=10000 loops=1)
   Filter: ((a = 70) AND (b = 70))
   Rows Removed by Filter: 990000
 Planning Time: 0.150 ms
 Execution Time: 124.647 ms
(5 rows)

--建立mcv
CREATE STATISTICS s3 (mcv) ON a, b FROM t2;
ANALYZE t2;

-- valid combination (found in MCV)，a=70 and b=70預估11267，實際10000，相差不大
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 70);
                                                QUERY PLAN                                           
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.00 rows=11267 width=8) (actual time=0.069..181.738 rows=10000 loops=1)
   Filter: ((a = 70) AND (b = 70))
   Rows Removed by Filter: 990000
 Planning Time: 1.120 ms
 Execution Time: 182.452 ms
(5 rows)

-- invalid combination (not found in MCV)
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 80);
                                             QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.00 rows=1 width=8) (actual time=125.878..125.879 rows=0 loops=1)
   Filter: ((a = 70) AND (b = 80))
   Rows Removed by Filter: 1000000
 Planning Time: 0.207 ms
 Execution Time: 125.945 ms
(5 rows)
----可以看到{70，70}在高頻值中,{70,80}不在高頻值中
postgres=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
postgres-# , pg_mcv_list_items(stxdmcv) m WHERE stxname = 's3';
 index | values  | nulls |      frequency       |     base_frequency     
-------+---------+-------+----------------------+------------------------
     0 | {70,70} | {f,f} | 0.011266666666666666 | 0.00012693777777777776
     1 | {78,78} | {f,f} |               0.0111 |             0.00012321
     2 | {32,32} | {f,f} | 0.011066666666666667 | 0.00012247111111111112
     3 | {13,13} | {f,f} | 0.011033333333333332 | 0.00012173444444444442
     4 | {82,82} | {f,f} |                0.011 | 0.00012099999999999999
....

--當 WHERE (a = 70) AND (b = 70)的時候rows=11267是如何計算的呢
rows=  1000000 *  0.011266666666666666
     = 11267

a和b的組合中實際頻率（在樣本中）約為1%。組合的基本頻率（根據簡單的每列頻率計算）僅為0.01％，導致兩個數量級的低估。

--計算WHERE (a = 1) AND (b = 2)的預估值
--為方便計算將a和b的statisics設定為10
alter table  t2 alter column  a SET STATISTICS 10;
alter table  t2 alter column  b SET STATISTICS 10;
analyze t2;
postgres=# SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
postgres-# WHERE tablename='t2' AND attname in('a','b');
 null_frac | n_distinct | most_common_vals | most_common_freqs 
-----------+------------+------------------+-------------------
         0 |        100 | {7,85}           | {0.015,0.015}
         0 |        100 | {7,85}           | {0.015,0.015}
(2 rows)
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
                                             QUERY PLAN                                            
  
--------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..19425.75 rows=98 width=8) (actual time=137.876..137.876 rows=0 loops=1)
   Filter: ((a = 1) AND (b = 2))
   Rows Removed by Filter: 1000000
 Planning Time: 0.452 ms
 Execution Time: 137.924 ms
(5 rows)
--a和b的選擇率都是一樣的
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
postgres=# select (1-(0.014999999664723873+0.014999999664723873))/(100-2);
        ?column?        
------------------------
 0.00989795919051583933
(1 row)
--rows，預估返回98行
rows= reltuple*selectivity(a=1) * selectivity(b = 2)
postgres=# select 1000000*0.00989795919051583933*0.00989795919051583933;
                  ?column?                   
---------------------------------------------
 97.9695961371169693741399756143748489000000
(1 row)
 
 --其他a=70 and b=80沒有在高頻中，預估也是98
postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a =70) AND (b =80);
                                             QUERY PLAN                                            
  
---------------------------------------------------------------------------------------------------

 Seq Scan on t2  (cost=0.00..19425.75 rows=98 width=8) (actual time=121.889..121.889 rows=0 loops=1)
   Filter: ((a = 70) AND (b = 80))
   Rows Removed by Filter: 1000000
 Planning Time: 0.160 ms
 Execution Time: 121.952 ms
(5 rows)

建議僅在實際在條件中一起使用的列的組合上建立MCV統計物件，對於這些組合，錯誤估計組數會導致糟糕的執行計劃。否則，只會浪費ANALYZE和規劃時間。

default_statistics_target

提升該限制可能會讓規劃器做出更準確的估計（特別是對那些有不規則資料分佈的列），其代價是在pg_statistic中消耗了更多空間，並且需要略微多一些的時間來計算估計數值。

注意：該值的取值範圍是0-1000，其中值越低取樣比例就越低，分析結果的準確性也就越低，但是ANALYZE命令執行的速度卻更快。如果將該值設定為-1，那麼該欄位的取樣比率將恢復到系統當前預設的取樣值，我們可以通過下面的命令獲取當前系統的預設取樣值。
postgres=# show default_statistics_target;
     default_statistics_target
    ---------------------------
     100
    (1 row)
從上面的結果可以看出，該資料庫的預設取樣值為100(10%)。

可以修改表字段也可以修改索引

ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]  
    action [, ... ]  
  
    ALTER [ COLUMN ] column_name SET STATISTICS integer

--建立test表插入100w資料
postgres=# CREATE TABLE test AS (SELECT random() x, random() y FROM generate_series(1,1000000));
SELECT 1000000
postgres=# ANALYZE test;
ANALYZE
--建立索引
postgres=# create index i_test_idx on test((x+y));
CREATE INDEX
postgres=# analyze test;
ANALYZE
--檢視x+y的執行計劃
postgres=# explain analyze select * from test where x+y <0.01;
                                                      QUERY PLAN                                   
                   
---------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test  (cost=7.68..673.21 rows=652 width=16) (actual time=0.036..0.283 rows=60 loops=1)
  Recheck Cond: ((x + y) < '0.01'::double precision)
  Heap Blocks: exact=60
  ->  Bitmap Index Scan on i_test_idx  (cost=0.00..7.51 rows=652 width=0) (actual time=0.017..0.017 rows=60 loops=1)
        Index Cond: ((x + y) < '0.01'::double precision)
Planning Time: 0.569 ms
Execution Time: 0.342 ms
(7 rows)

--修改statistics
postgres=# ALTER INDEX i_test_idx ALTER COLUMN expr SET STATISTICS 3000;
ALTER INDEX
postgres=# analyze test;
ANALYZE
--再次檢視執行計劃發現預估的值和實際的值比較接近了
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                      QUERY PLAN                                   
                    
---------------------------------------------------------------------------------------------------
Index Scan using i_test_idx on test  (cost=0.42..135.64 rows=121 width=16) (actual time=0.011..0.277 rows=60 loops=1)
  Index Cond: ((x + y) < '0.01'::double precision)
Planning Time: 0.515 ms
Execution Time: 0.342 ms
(4 rows)

--修改為10000
postgres=#  ALTER INDEX i_test_idx ALTER COLUMN expr SET STATISTICS 10000;
ALTER INDEX
postgres=# analyze test;
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                     QUERY PLAN                                    
                  
---------------------------------------------------------------------------------------------------
 Index Scan using i_test_idx on test  (cost=0.42..80.87 rows=71 width=16) (actual time=0.010..0.217
 rows=60 loops=1)
   Index Cond: ((x + y) < '0.01'::double precision)
 Planning Time: 0.784 ms
 Execution Time: 0.283 ms
(4 rows)

使用alter 修改statistics為3000，這個數字設定了直方圖中使用了多少個桶以及儲存了多少個最常見的值，

這樣取樣的粒度也就越細，可以檢視pg_stats中histogram_bounds記錄了更多的值。

檢視修改的值

postgres=# select cla.relname,att.attname,att.attstattarget  from pg_attribute att join pg_class cla on att.attrelid=cla.oid where cla.relname = 'i_test_idx';
  relname   | attname | attstattarget 
------------+---------+---------------
 i_test_idx | expr    |          3000
(1 row)

--或者
postgres=# \d+ i_test_idx
                       Index "public.i_test_idx"
 Column |       Type       | Key? | Definition | Storage | Stats target 
--------+------------------+------+------------+---------+--------------
 expr   | double precision | yes  | (x + y)    | plain   | 3000
btree, for table "public.test"

代價異常

今天人問為什麼index時代價顯示比最終的代價小，有時候執行計劃節點的某一個子節點的cost比總的cost大是正常情況

一般來說常見的有兩種情況

1、在有limit 1的情況下，實際的行數只有1並且執行時間遠低於開銷估計所建議的時間。這並非預估錯誤

postgres=# explain analyze select * from test3 where n_id < 1000 limit 1;
                                                           QUERY PLAN                              
                             
---------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.022..0.022 rows=1 loops=1)
   ->  Index Only Scan using i_test3_id on test3  (cost=0.29..25.13 rows=939 width=4) (actual time=
0.019..0.019 rows=1 loops=1)
         Index Cond: (n_id < 1000)
         Heap Fetches: 1
 Planning Time: 0.297 ms
 Execution Time: 0.092 ms
(6 rows)

postgres=# explain analyze select * from test3 where n_id < 1000 ;
                                                         QUERY PLAN                                
                          
---------------------------------------------------------------------------------------------------
 Index Only Scan using i_test3_id on test3  (cost=0.29..25.13 rows=939 width=4) (actual time=0.020..19.543 rows=999 loops=1)
   Index Cond: (n_id < 1000)
   Heap Fetches: 999
 Planning Time: 0.718 ms
 Execution Time: 19.707 ms
(5 rows)

2、歸併連線也有這樣的情況，如果一個歸併連線用盡了一個輸入並且其中的最後一個鍵值小於另一個輸入中的下一個鍵值，它將停止讀取另一個輸入。在這種情況下不過會有更多的匹配，因此不需要第二個輸入的剩餘部分。這會導致不讀取另一個子節點的所有內容
Index Scan using i_aj_all_bh_ysaj預估的代價是47W，而最終總的預估為39W

GroupAggregate  (cost=391236.41..391243.61 rows=188 width=45) (actual time=5839.527..5861.034 rows=184 loops=1)
  Group Key: test_1.c_jbfy
  ->  Sort  (cost=391236.41..391238.18 rows=710 width=38) (actual time=5839.324..5847.166 rows=105340 loops=1)
        Sort Key: test_1.c_jbfy
        Sort Method: quicksort  Memory: 11302kB
        ->  Merge Join  (cost=342460.08..391202.78 rows=710 width=38) (actual time=4280.410..5688.354 rows=105340 loops=1)
              Merge Cond: ((test.c_bh_ysaj)::text = (test_1.c_bh)::text)
              ->  Index Scan using i_aj_all_bh_ysaj on test  (cost=0.43..470402.81 rows=127589 width=32) (actual time=0.012..1054.034 rows=162022 loops=1)
                    Index Cond: (c_bh_ysaj IS NOT NULL)
                    Filter: ((c_ah IS NOT NULL) AND (d_jarq >= to_date('20190101'::text, 'yyyymmdd'::text)) AND (d_jarq <= to_date('20200101'::text, 'yyyymmdd'::text))))
                    Rows Removed by Filter: 254334
              ->  Sort  (cost=342459.60..343014.47 rows=221949 width=38) (actual time=4279.757..4342.135 rows=357118 loops=1)
                    Sort Key: test_1.c_bh
                    Sort Method: quicksort  Memory: 40188kB
.......
--另外由於實現的限制，BitmapAnd和BitmapOr節點總時報告實際行數為0

總結

1、當pg中的列之間有相關性的時候，統計資訊很可能不準確，這個時候使用函式依賴能解決統計不準的問題

2、針對有group by a,b這種的可以建立ndistinct來改善執行計劃

3、建立高頻值可以讓統計資訊更加準確，default_statistics_target設定桶的數量可以讓統計資訊更加準確，以此來提高查詢效率

5、再有limit和merge join的情況下代價是不一樣的

參考資料：https://www.citusdata.com/blog/2018/03/06/postgres-planner-and-its-usage-of-statistics

http://mysql.taobao.org/monthly/2016/05/09/

https://blog.fearcat.in/a?ID=00001-3cf74023-0519-423d-aefe-64eee59bfdbe

https://blog.csdn.net/weixin_39540651/article/details/103928235

postgresql建立統計資訊優化

建立統計資訊

預估行數

函式依賴

ndistinct 統計

高頻值-mvc

default_statistics_target

代價異常

總結

postgresql建立統計資訊優化

建立索引不會級聯收集表的統計資訊？

dmsql優化之收集統計資訊

PostgreSQL 建立分割槽表以及優化

Oracle統計資訊的匯出匯入測試示例詳解

SQL Server統計資訊更新時取樣百分比對資料預估準確性的影響詳解

概述MySQL統計資訊

python讀取raw binary圖片並提取統計資訊的例項

Python爬取阿拉丁統計資訊過程圖解

oracle 12c 關閉統計資訊收集和啟用統計資訊收集

資料庫自動收集統計資訊：auto optimizer stats collection 相關的操作和注意事項

PostgreSQL 建立資料庫

PostgreSQL 建立表格

Nebula Graph 特性講解——RocksDB 統計資訊的收集和展示

MySQL 8.0統計資訊不準確的原因

MySQL學習筆記建立學生資訊相關表

ORACLE 統計資訊

postgresql建立序列查詢序列

詳解mysql持久化統計資訊

解決JMap抓取heap使用統計資訊報錯的問題

postgresql建立統計資訊優化

建立統計資訊

預估行數

函式依賴

ndistinct 統計

高頻值-mvc

default_statistics_target

代價異常

總結

相關推薦