PG優化實戰系列-32執行緒CPU100%?
背景
前段時間在某現場遇到個問題,早上到專案現場,技術支援反映資料庫不可查,32個現場的CPU直接全部100%。
為了儘快恢復使用,在得到業務同意情況下,直接重啟Postgresql,重啟後CPU恢復正常,未出現再持續增長情況。
問題定位
現場已經裝了外掛pg_stat_statements,直接檢視pg_stat_statements發現有個SQL從凌晨到早上9點一直再跑總共16800+次,耗時很高平均4.5s。詢問得知這是個新上定時任務,重啟後未發現有此SQL再執行;從日誌中撈出SQL,此SQL很長達到150+行,各種左連線、子查詢,要知道這種SQL優化起來如果光梳理裡面邏輯關係那將是一個巨大工作量。直接用explain(analyze,buffers)檢視對應執行計劃。可想而知執行計劃也是很長的,檢視執行計劃發現瓶頸點
-> Nested Loop Left Join (cost=1000.14..628415.70 rows=8047 width=196) (actual time=141.912..1794.075 rows=1968 loops=1) Buffers: shared hit=360497 -> Gather (cost=1000.00..627036.71 rows=8047 width=75) (actual time=141.886..1792.694 rows=1968 loops=1) Workers Planned:6 Workers Launched: 6 Buffers: shared hit=358529 -> Parallel Seq Scan on pt (cost=0.00..625232.01 rows=1341 width=75) (actual time=106.850..1782.563 rows=281 loops=7) Filter: ((status= 1) AND (entity_id = ANY ('{xxx*n}'::bigint[]))) Rows Removed by Filter: 2096053 Buffers: shared hit=358529 -> Index Scan using "idxxxxx ASC NULLS LAST" on plab (cost=0.14..0.16 rows=1 width=300) (actual time=0.000..0.000 rows=0 loops=1968) Index Cond: ((pt.property_id)::text = (property_id)::text) Buffers: shared hit=1968
從以上擷取的執行計劃可以看到,6個並行全表掃描pt表花了1.6s。
其他用到pt表的執行計劃程式碼
-> Materialize (cost=1000.00..632096.42 rows=6 width=34) (actual time=5.555..8.164 rows=27 loops=81) Buffers: shared hit=358526 -> Gather (cost=1000.00..632096.39 rows=6 width=34) (actual time=449.888..661.183 rows=27 loops=1) Workers Planned: 6 Workers Launched: 6 Buffers: shared hit=358526 -> Parallel Seq Scan on pt t3 (cost=0.00..631095.79 rows=1 width=34) (actual time=464.883..649.423 rows=4 loops=7) Filter: (((property_type)::text = 'xx2'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[]))) Rows Removed by Filter: 2096331 Buffers: shared hit=358526
-> Materialize (cost=1000.00..632096.52 rows=7 width=34) (actual time=5.551..8.138 rows=28 loops=81) Buffers: shared hit=358526 -> Gather (cost=1000.00..632096.49 rows=7 width=34) (actual time=449.649..659.117 rows=28 loops=1) Workers Planned: 6 Workers Launched: 6 Buffers: shared hit=358526 -> Parallel Seq Scan on pt t3_1 (cost=0.00..631095.79 rows=1 width=34) (actual time=432.199..643.978 rows=4 loops=7) Filter: (((property_type)::text = 'xx1'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[]))) Rows Removed by Filter: 2096330 Buffers: shared hit=358526
從以上執行計劃能看出,掃描pt表使用並行順序掃描效能不容樂觀。
建立索引試試?
create index idx_xxx on pt (entity_id,status);
對應執行計劃
--原1.6s NestLoop變成採用hash left join耗時變成了1.5ms
-> Hash Left Join (cost=13.26..6795.78 rows=8391 width=196) (actual time=0.027..1.547 rows=1968 loops=1) Hash Cond: ((pt.property_id)::text = (plab.property_id)::text) Buffers: shared hit=635 -> Index Scan using idx_xxx on pt (cost=0.56..6751.44 rows=8391 width=75) (actual time=0.020..1.100 rows=1968 loops=1) Index Cond: ((entity_id = ANY ('{xxx*n}'::bigint[])) AND (status = 1)) Buffers: shared hit=635 -> Hash (cost=11.20..11.20 rows=120 width=300) (actual time=0.003..0.003 rows=0 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 8kB -> Seq Scan on plab (cost=0.00..11.20 rows=120 width=300) (actual time=0.002..0.002 rows=0 loops=1)
最終整體SQL由4.5s->44ms。
問題解決總結:
1.如果短時間無法定位問題,為了儘快恢復先申請重啟,重啟後某些定時任務會斷開不會再連線,可儘快恢復;
2.如果重啟CPU依舊瘋狂增長,可以用pg_stat_statements_reset()函式重新整理pg_stat_statements表,監控一段時間看哪個SQL耗時長導致;
3.CPU100%通常情況下就是有慢sql造成的,這裡的慢sql包括全表掃描,掃描資料量過大,記憶體排序,磁碟排序,鎖爭用等;本次CPU100%,根據後續分析,是全表掃描磁碟達到效能瓶頸導致CPU一直等待由此導致32執行緒CPU佔滿。
4.關於如何定位瓶頸點,關鍵字 actual、Filter前者能定位到耗時瓶頸,後者能看到哪些欄位沒用上索引。
5.如果之前一直正常,突然某天出現此問題,可能是新上業務SQL導致,此時要詢問業務具體做了什麼更改,尤其是線上系統必須要測試才能上到正式環境;
6.如果週期性出現CPU瓶頸,那極有可能是定時任務了。具體怎麼定位可能採用週期對比:每天同時間對比,每週同時間對比,每月同時間對比等;
7.關於長SQL如何優化,建議直接讀執行計劃,找到瓶頸點後做相應操作,如:建索引,索引是否合理(多列索引順序是否正確、索引是否重複),引數是否使用當前場景等;如本次故障就是該新表無合理索引導致。