PG優化實戰系列-32執行緒CPU100%？

阿新 • • 發佈：2020-12-26

背景

　　前段時間在某現場遇到個問題，早上到專案現場，技術支援反映資料庫不可查，32個現場的CPU直接全部100%。

為了儘快恢復使用，在得到業務同意情況下，直接重啟Postgresql，重啟後CPU恢復正常，未出現再持續增長情況。

問題定位

　　現場已經裝了外掛pg_stat_statements，直接檢視pg_stat_statements發現有個SQL從凌晨到早上9點一直再跑總共16800+次，耗時很高平均4.5s。詢問得知這是個新上定時任務，重啟後未發現有此SQL再執行；從日誌中撈出SQL，此SQL很長達到150+行，各種左連線、子查詢,要知道這種SQL優化起來如果光梳理裡面邏輯關係那將是一個巨大工作量。直接用explain（analyze,buffers）檢視對應執行計劃。可想而知執行計劃也是很長的，檢視執行計劃發現瓶頸點

->  Nested Loop Left Join  (cost=1000.14..628415.70 rows=8047 width=196) (actual time=141.912..1794.075 rows=1968 loops=1)
                      Buffers: shared hit=360497
                      ->  Gather  (cost=1000.00..627036.71 rows=8047 width=75) (actual time=141.886..1792.694 rows=1968 loops=1)
                            Workers Planned:  
6
                            Workers Launched: 6
                            Buffers: shared hit=358529
                            ->  Parallel Seq Scan on pt  (cost=0.00..625232.01 rows=1341 width=75) (actual time=106.850..1782.563 rows=281 loops=7)
                                  Filter: ((status  
= 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                                  Rows Removed by Filter: 2096053
                                  Buffers: shared hit=358529
                      ->  Index Scan using "idxxxxx ASC NULLS LAST" on plab  (cost=0.14..0.16 rows=1 width=300) (actual time=0.000..0.000 rows=0 loops=1968)
                            Index Cond: ((pt.property_id)::text = (property_id)::text)
                            Buffers: shared hit=1968

從以上擷取的執行計劃可以看到，6個並行全表掃描pt表花了1.6s。

其他用到pt表的執行計劃程式碼

          ->  Materialize  (cost=1000.00..632096.42 rows=6 width=34) (actual time=5.555..8.164 rows=27 loops=81)
                Buffers: shared hit=358526
                ->  Gather  (cost=1000.00..632096.39 rows=6 width=34) (actual time=449.888..661.183 rows=27 loops=1)
                      Workers Planned: 6
                      Workers Launched: 6
                      Buffers: shared hit=358526
                      ->  Parallel Seq Scan on pt t3  (cost=0.00..631095.79 rows=1 width=34) (actual time=464.883..649.423 rows=4 loops=7)
                            Filter: (((property_type)::text = 'xx2'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                            Rows Removed by Filter: 2096331
                            Buffers: shared hit=358526

          ->  Materialize  (cost=1000.00..632096.52 rows=7 width=34) (actual time=5.551..8.138 rows=28 loops=81)
                Buffers: shared hit=358526
                ->  Gather  (cost=1000.00..632096.49 rows=7 width=34) (actual time=449.649..659.117 rows=28 loops=1)
                      Workers Planned: 6
                      Workers Launched: 6
                      Buffers: shared hit=358526
                      ->  Parallel Seq Scan on pt t3_1  (cost=0.00..631095.79 rows=1 width=34) (actual time=432.199..643.978 rows=4 loops=7)
                            Filter: (((property_type)::text = 'xx1'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                            Rows Removed by Filter: 2096330
                            Buffers: shared hit=358526

從以上執行計劃能看出，掃描pt表使用並行順序掃描效能不容樂觀。

建立索引試試？

create index idx_xxx on pt (entity_id,status);

對應執行計劃

--原1.6s NestLoop變成採用hash left join耗時變成了1.5ms
　　　　　　　　->  Hash Left Join  (cost=13.26..6795.78 rows=8391 width=196) (actual time=0.027..1.547 rows=1968 loops=1)
                      Hash Cond: ((pt.property_id)::text = (plab.property_id)::text)
                      Buffers: shared hit=635
                      ->  Index Scan using idx_xxx on pt  (cost=0.56..6751.44 rows=8391 width=75) (actual time=0.020..1.100 rows=1968 loops=1)
                            Index Cond: ((entity_id = ANY ('{xxx*n}'::bigint[])) AND (status = 1))
                            Buffers: shared hit=635
                      ->  Hash  (cost=11.20..11.20 rows=120 width=300) (actual time=0.003..0.003 rows=0 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 8kB
                            ->  Seq Scan on plab  (cost=0.00..11.20 rows=120 width=300) (actual time=0.002..0.002 rows=0 loops=1)

最終整體SQL由4.5s->44ms。

問題解決總結：

1.如果短時間無法定位問題，為了儘快恢復先申請重啟，重啟後某些定時任務會斷開不會再連線，可儘快恢復；

2.如果重啟CPU依舊瘋狂增長，可以用pg_stat_statements_reset()函式重新整理pg_stat_statements表，監控一段時間看哪個SQL耗時長導致；

3.CPU100%通常情況下就是有慢sql造成的，這裡的慢sql包括全表掃描，掃描資料量過大，記憶體排序，磁碟排序，鎖爭用等;本次CPU100%，根據後續分析，是全表掃描磁碟達到效能瓶頸導致CPU一直等待由此導致32執行緒CPU佔滿。

4.關於如何定位瓶頸點，關鍵字 actual、Filter前者能定位到耗時瓶頸，後者能看到哪些欄位沒用上索引。

5.如果之前一直正常，突然某天出現此問題，可能是新上業務SQL導致，此時要詢問業務具體做了什麼更改，尤其是線上系統必須要測試才能上到正式環境；

6.如果週期性出現CPU瓶頸，那極有可能是定時任務了。具體怎麼定位可能採用週期對比：每天同時間對比，每週同時間對比，每月同時間對比等；

7.關於長SQL如何優化，建議直接讀執行計劃，找到瓶頸點後做相應操作，如：建索引，索引是否合理（多列索引順序是否正確、索引是否重複），引數是否使用當前場景等；如本次故障就是該新表無合理索引導致。

PG優化實戰系列-32執行緒CPU100%？

PG優化實戰系列-32執行緒CPU100%？

PG優化實戰系列-200+s到毫秒級記錄？

英特爾官宣：13 代酷睿桌面處理器 Raptor Lake 最高 24 核 32 執行緒，相容 600 系列主機板

Scala併發程式設計實戰：Executor執行緒池

英特爾 Alder Lake S 處理器現身：16 核 32 執行緒

Java併發實戰一：執行緒與執行緒安全

英特爾 13 代酷睿 i9-13900K 曝光：24 核 32 執行緒，小核心更多

英特爾 i9-13900K 現身《奇點灰燼》基準測試，24 核 32 執行緒

AMD 銳龍 7000 處理器新爆料：最高 16 核 32 執行緒，170W TDP

TOMCAT效能優化_十大執行緒

Java面試系列-多執行緒

什麼是執行緒、併發-J.U.C併發系列（2）

Java高效能程式設計實戰 --- 執行緒封閉與ThreadLocal

Netty系列文章之Netty執行緒模型

Java執行緒池用法實戰案例分析

Redis實戰(4)-資料結構List實戰之佇列特性實現訊息多執行緒廣播通知

Jmeter系列（45）- 詳解 Jmeter 跨執行緒組取引數值的方法，免程式碼！

Java 執行緒與同步的效能優化

實戰單執行緒爬取，單執行緒+協程爬取，多執行緒爬取

Qt之先用了再說系列-串列埠通訊(單串列埠單執行緒)

PG優化實戰系列-32執行緒CPU100%？

相關推薦