1. 程式人生 > 其它 >視窗函式查詢優化案例

視窗函式查詢優化案例

視窗函式常用用於分組排序運算中,方便使用者實現各種分組需求。由於視窗函式需要通常需要全表掃描資料,同時還需排序聚集,消耗大量的CPU資源,視窗函式執行效率較低。以下介紹一例視窗函式的優化案例。

1、準備例子

有這樣一個功能需求。系統中存在資訊資訊這樣一個模組,用於釋出一些和業務相關的活動動態,其中每條資訊資訊都有一個所屬型別(如科技類的資訊、娛樂類、軍事類···)和瀏覽量欄位。官網上需要滾動展示一些熱門資訊資訊列表(瀏覽量越大代表越熱門),而且每個類別的相關資訊記錄至多顯示3條,換句話:“按照資訊分類分組,取每組的前3條資訊資訊列表”。 表結構及初始資料如下:

Create table info(
   id numeric not null primary key ,
   title varchar(100) ,
   Viewnum numeric ,
   info_type_id numeric ,
   Code text 
);

create index info_infotypeid on info (info_type_id);

Create table info_type(
   Id numeric  not null primary key,
   Name varchar(100) 
);

--插入100個新聞分類
Insert into info_type select id, 'TYPE' || lpad(id::text, 5, '0' ) from generate_series(1, 100) id;

--插入1000000個新聞
Insert into info select id, 'TTL' || lpad(id::text, 20, '0' ) title, ceil(random()*1000000) Viewnum, ceil(random()*100) info_type_id , md5(id) code from generate_series(1, 1000000) id;

vacuum analyse info_type,info;

2、方法一:使用視窗函式

explain (analyse ,buffers )
with i as ( select i.*, row_number() over (partition by i.info_type_id order by i.viewnum desc) sn from info i)
select * from info_type t left join i on i.sn <= 3 and i.info_type_id = t.id;

                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=211867.09..245279.17 rows=333333 width=97) (actual time=4223.126..6169.377 rows=300 loops=1)
   Hash Cond: (i.info_type_id = t.id)
   Buffers: shared hit=11582 read=1753, temp read=17860 written=17901
   ->  Subquery Scan on i  (cost=211863.84..244363.84 rows=333333 width=82) (actual time=4223.080..6168.742 rows=300 loops=1)
         Filter: (i.sn <= 3)
         Rows Removed by Filter: 999700
         Buffers: shared hit=11582 read=1752, temp read=17860 written=17901
         ->  WindowAgg  (cost=211863.84..231863.84 rows=1000000 width=82) (actual time=4223.079..6080.518 rows=1000000 loops=1)
               Buffers: shared hit=11582 read=1752, temp read=17860 written=17901
               ->  Sort  (cost=211863.84..214363.84 rows=1000000 width=74) (actual time=4223.065..5224.438 rows=1000000 loops=1)
                     Sort Key: i_1.info_type_id, i_1.viewnum DESC
                     Sort Method: external merge  Disk: 84128kB
                     Buffers: shared hit=11582 read=1752, temp read=17860 written=17901
                     ->  Seq Scan on info i_1  (cost=0.00..23334.00 rows=1000000 width=74) (actual time=0.006..249.981 rows=1000000 loops=1)
                           Buffers: shared hit=11582 read=1752
   ->  Hash  (cost=2.00..2.00 rows=100 width=15) (actual time=0.037..0.037 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         Buffers: shared read=1
         ->  Seq Scan on info_type t  (cost=0.00..2.00 rows=100 width=15) (actual time=0.015..0.021 rows=100 loops=1)
               Buffers: shared read=1
 Planning Time: 0.328 ms
 Execution Time: 6182.496 ms
(22 rows)

可以看到,這裡消耗資源最大的是在 sort 操作上。那麼,我們能否避免sort 操作了? 索引可以避免sort 操作

3、方法二:只取第3名的記錄

方法一,由於讀取了大量資料塊,耗時過多,功能要求只需返回每組1條記錄,希望避免讀冗餘資料塊。新的SQL特點,每個型別使用子查詢通過info表的info_type_id列的索引,可以避免讀取多餘的資料。select list的子查詢作為計算列,只能返回一個值,所以使用row (i.*)::info 先整合,然後使用 (inf).* 再分解,同時使用 offset2 limit 1獲取第三名的一行記錄。

explain (analyse ,buffers )
select id, name, (inf).*
from (select t.*,
             (select row (i.*)::info
              from info i
              where i.info_type_id = t.id
              order by i.viewnum desc
                  offset 2
              limit 1) inf
      from info_type t
     ) t;

                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on info_type t  (cost=0.00..6708942.94 rows=100 width=361) (actual time=127.552..10513.868 rows=100 loops=1)
   Buffers: shared hit=3544406 read=3255
   SubPlan 1
     ->  Limit  (cost=13417.88..13417.88 rows=1 width=38) (actual time=21.744..21.745 rows=1 loops=100)
           Buffers: shared hit=706280 read=3252
           ->  Sort  (cost=13417.87..13442.87 rows=10000 width=38) (actual time=21.740..21.740 rows=3 loops=100)
                 Sort Key: i.viewnum DESC
                 Sort Method: top-N heapsort  Memory: 25kB
                 Buffers: shared hit=706280 read=3252
                 ->  Bitmap Heap Scan on info i  (cost=185.93..13288.63 rows=10000 width=38) (actual time=3.985..18.371 rows=10000 loops=100)
                       Recheck Cond: (info_type_id = t.id)
                       Heap Blocks: exact=706728
                       Buffers: shared hit=706280 read=3252
                       ->  Bitmap Index Scan on info_infotypeid  (cost=0.00..183.43 rows=10000 width=0) (actual time=2.615..2.615 rows=10000 loops=100)
                             Index Cond: (info_type_id = t.id)
                             Buffers: shared hit=1272 read=1532
   SubPlan 2
     ->  Limit  (cost=13417.88..13417.88 rows=1 width=38) (actual time=20.599..20.600 rows=1 loops=100)
           Buffers: shared hit=709529 read=3
           ->  Sort  (cost=13417.87..13442.87 rows=10000 width=38) (actual time=20.595..20.595 rows=3 loops=100)
                 Sort Key: i_1.viewnum DESC
                 Sort Method: top-N heapsort  Memory: 25kB
                 Buffers: shared hit=709529 read=3
                 ->  Bitmap Heap Scan on info i_1  (cost=185.93..13288.63 rows=10000 width=38) (actual time=3.640..17.373 rows=10000 loops=100)
                       Recheck Cond: (info_type_id = t.id)
                       Heap Blocks: exact=706728
                       Buffers: shared hit=709529 read=3
                       ->  Bitmap Index Scan on info_infotypeid  (cost=0.00..183.43 rows=10000 width=0) (actual time=2.291..2.291 rows=10000 loops=100)
                             Index Cond: (info_type_id = t.id)
                             Buffers: shared hit=2801 read=3
   SubPlan 3
     ->  Limit  (cost=13417.88..13417.88 rows=1 width=38) (actual time=21.284..21.285 rows=1 loops=100)
           Buffers: shared hit=709532
           ->  Sort  (cost=13417.87..13442.87 rows=10000 width=38) (actual time=21.279..21.279 rows=3 loops=100)
                 Sort Key: i_2.viewnum DESC
                 Sort Method: top-N heapsort  Memory: 25kB
                 Buffers: shared hit=709532
                 ->  Bitmap Heap Scan on info i_2  (cost=185.93..13288.63 rows=10000 width=38) (actual time=3.609..17.868 rows=10000 loops=100)
                       Recheck Cond: (info_type_id = t.id)
                       Heap Blocks: exact=706728
                       Buffers: shared hit=709532
                       ->  Bitmap Index Scan on info_infotypeid  (cost=0.00..183.43 rows=10000 width=0) (actual time=2.267..2.267 rows=10000 loops=100)
                             Index Cond: (info_type_id = t.id)
                             Buffers: shared hit=2804
   SubPlan 4
     ->  Limit  (cost=13417.88..13417.88 rows=1 width=38) (actual time=20.763..20.763 rows=1 loops=100)
           Buffers: shared hit=709532
           ->  Sort  (cost=13417.87..13442.87 rows=10000 width=38) (actual time=20.759..20.759 rows=3 loops=100)
                 Sort Key: i_3.viewnum DESC
                 Sort Method: top-N heapsort  Memory: 25kB
                 Buffers: shared hit=709532
                 ->  Bitmap Heap Scan on info i_3  (cost=185.93..13288.63 rows=10000 width=38) (actual time=3.769..17.505 rows=10000 loops=100)
                       Recheck Cond: (info_type_id = t.id)
                       Heap Blocks: exact=706728
                       Buffers: shared hit=709532
                       ->  Bitmap Index Scan on info_infotypeid  (cost=0.00..183.43 rows=10000 width=0) (actual time=2.390..2.390 rows=10000 loops=100)
                             Index Cond: (info_type_id = t.id)
                             Buffers: shared hit=2804
   SubPlan 5
     ->  Limit  (cost=13417.88..13417.88 rows=1 width=38) (actual time=20.713..20.713 rows=1 loops=100)
           Buffers: shared hit=709532
           ->  Sort  (cost=13417.87..13442.87 rows=10000 width=38) (actual time=20.709..20.709 rows=3 loops=100)
                 Sort Key: i_4.viewnum DESC
                 Sort Method: top-N heapsort  Memory: 25kB
                 Buffers: shared hit=709532
                 ->  Bitmap Heap Scan on info i_4  (cost=185.93..13288.63 rows=10000 width=38) (actual time=3.689..17.432 rows=10000 loops=100)
                       Recheck Cond: (info_type_id = t.id)
                       Heap Blocks: exact=706728
                       Buffers: shared hit=709532
                       ->  Bitmap Index Scan on info_infotypeid  (cost=0.00..183.43 rows=10000 width=0) (actual time=2.288..2.288 rows=10000 loops=100)
                             Index Cond: (info_type_id = t.id)
                             Buffers: shared hit=2804
 Planning Time: 0.729 ms
 Execution Time: 10514.326 ms
(74 rows)

方法二:針對 info_type 的每一行,info 表都要根據 info_type_id 索引訪問 info 表 5 次 (5個列)。 總時間消耗: 100 (行)*5(列)* 20 (每次大概20ms),大約 10000ms

執行計劃分析:根據 info_type_id 索引,需要訪問的行數太多,而且還是需要排序。基於這些考慮,我們可以建立個 info_type_id + viewnum 複合索引,減少每訪問的時間消耗,避免排序。

create index info_typeview on info(info_type_id,viewnum);

                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on info_type t  (cost=0.00..4627.72 rows=100 width=361) (actual time=0.255..13.391 rows=100 loops=1)
   Buffers: shared hit=2881 read=120
   SubPlan 1
     ->  Limit  (cost=6.31..9.25 rows=1 width=38) (actual time=0.041..0.041 rows=1 loops=100)
           Buffers: shared hit=480 read=120
           ->  Index Scan Backward using info_typeview on info i  (cost=0.42..29421.91 rows=10000 width=38) (actual time=0.034..0.040 rows=3 loops=100)
                 Index Cond: (info_type_id = t.id)
                 Buffers: shared hit=480 read=120
   SubPlan 2
     ->  Limit  (cost=6.31..9.25 rows=1 width=38) (actual time=0.022..0.022 rows=1 loops=100)
           Buffers: shared hit=600
           ->  Index Scan Backward using info_typeview on info i_1  (cost=0.42..29421.91 rows=10000 width=38) (actual time=0.018..0.021 rows=3 loops=100)
                 Index Cond: (info_type_id = t.id)
                 Buffers: shared hit=600
   SubPlan 3
     ->  Limit  (cost=6.31..9.25 rows=1 width=38) (actual time=0.021..0.021 rows=1 loops=100)
           Buffers: shared hit=600
           ->  Index Scan Backward using info_typeview on info i_2  (cost=0.42..29421.91 rows=10000 width=38) (actual time=0.018..0.020 rows=3 loops=100)
                 Index Cond: (info_type_id = t.id)
                 Buffers: shared hit=600
   SubPlan 4
     ->  Limit  (cost=6.31..9.25 rows=1 width=38) (actual time=0.021..0.021 rows=1 loops=100)
           Buffers: shared hit=600
           ->  Index Scan Backward using info_typeview on info i_3  (cost=0.42..29421.91 rows=10000 width=38) (actual time=0.018..0.020 rows=3 loops=100)
                 Index Cond: (info_type_id = t.id)
                 Buffers: shared hit=600
   SubPlan 5
     ->  Limit  (cost=6.31..9.25 rows=1 width=38) (actual time=0.023..0.023 rows=1 loops=100)
           Buffers: shared hit=600
           ->  Index Scan Backward using info_typeview on info i_4  (cost=0.42..29421.91 rows=10000 width=38) (actual time=0.020..0.022 rows=3 loops=100)
                 Index Cond: (info_type_id = t.id)
                 Buffers: shared hit=600
 Planning Time: 0.730 ms
 Execution Time: 13.552 ms
(34 rows)

可以看到,建立新索引後,單次的訪問從 20ms 降低到 0.023ms ,將近降了 1000 倍。

存在問題:限制了返回行數,僅一行,同時info表有5個列,所以有5個subplan,其中4個是冗餘的。

以下再修改新的SQL,新的SQL特點,select list的子查詢作為計算列,只能返回一行值,所以使用array() 先轉換成陣列型別,然後使用 unnest() 再分解成多行,同時使用 limit 3獲取前三名的三行記錄。

explain (analyse ,buffers )
select id, name, (inf).*
from (select t.id, t.name, unnest(inf) inf
      from (select t.*,
                   array(select row (i.*)::info
                         from info i
                         where i.info_type_id = t.id
                         order by i.viewnum desc
                         limit 3) inf
            from info_type t
           ) t) t;

                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.00..942.89 rows=1000 width=361) (actual time=0.092..2.526 rows=300 loops=1)
   Buffers: shared hit=601
   ->  ProjectSet  (cost=0.00..932.89 rows=1000 width=47) (actual time=0.089..2.406 rows=300 loops=1)
         Buffers: shared hit=601
         ->  Seq Scan on info_type t_1  (cost=0.00..2.00 rows=100 width=15) (actual time=0.008..0.020 rows=100 loops=1)
               Buffers: shared hit=1
         SubPlan 1
           ->  Limit  (cost=0.42..9.25 rows=3 width=38) (actual time=0.018..0.021 rows=3 loops=100)
                 Buffers: shared hit=600
                 ->  Index Scan Backward using info_typeview on info i  (cost=0.42..29421.91 rows=10000 width=38) (actual time=0.017..0.020 rows=3 loops=100)
                       Index Cond: (info_type_id = t_1.id)
                       Buffers: shared hit=600
 Planning Time: 0.295 ms
 Execution Time: 2.639 ms
(14 rows)

4、結論

1、整個優化關鍵點是建立了 info_type_id + viewnum 複合索引,也就是視窗查詢 partition by 和 order by 兩部分列的複合索引。

2、array 的應用也是關鍵的地方,解決了需要返回多行的問題。

KINGBASE研究院