生產環境sqlldr載入效能問題及分析之二(r2第20天)

阿新 • • 發佈：2022-05-04

上一節討論了在資料遷移中發現數據載入的速度一下子慢了很多，和之前在測試環境相比有很大的差距。一個原因就是由於在資料載入的過程中有一些額外的session也在操作訪問資料庫，造成了undo的使用率急劇上升，資料庫負載從某種程度上也加劇了。通過檢視awr,ash報告可以發現更多的內容。 測試環境的資料庫負載情況

Load Profile

	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	98.2	2.2	0.73	0.67
DB CPU(s):	6.4	0.1	0.05	0.04
Redo size:	57,163,094.1	1,277,840.0
Logical reads:	395,710.8	8,845.8
Block changes:	279,863.1	6,256.1
Physical reads:	1,931.7	43.2
Physical writes:	8,592.7	192.1
User calls:	145.8	3.3
Parses:	67.4	1.5
Hard parses:	0.8	0.0
W/A MB processed:	0.1	0.0
Logons:	0.8	0.0
Executes:	133.9	3.0
Rollbacks:	0.0	0.0
Transactions:	44.7

生產環境：

Load Profile

	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	149.1	5.8	2.22	1.44
DB CPU(s):	2.0	0.1	0.03	0.02
Redo size:	22,033,012.0	863,923.6
Logical reads:	143,043.3	5,608.8
Block changes:	89,599.3	3,513.2
Physical reads:	355.4	13.9
Physical writes:	3,529.4	138.4
User calls:	103.3	4.1
Parses:	38.6	1.5
Hard parses:	0.1	0.0
W/A MB processed:	0.8	0.0
Logons:	0.8	0.0
Executes:	67.1	2.6
Rollbacks:	0.0	0.0
Transactions:	25.5

同樣的資料載入執行緒（150個並行，採用sqlldr parllel=true),在測試環境中有57M每秒的速度，但是在生產環境中卻少了一倍多。 來看看例項的工作情況： 測試環境：

Buffer Nowait %:	99.76	Redo NoWait %:	99.98
Buffer Hit %:	99.51	In-memory Sort %:	100.00
Library Hit %:	98.31	Soft Parse %:	98.80
Execute to Parse %:	49.66	Latch Hit %:	97.52
Parse CPU to Parse Elapsd %:	70.81	% Non-Parse CPU:	99.92

生產環境：

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:	99.60	Redo NoWait %:	99.64
Buffer Hit %:	99.76	In-memory Sort %:	99.99
Library Hit %:	99.76	Soft Parse %:	99.84
Execute to Parse %:	42.46	Latch Hit %:	96.21
Parse CPU to Parse Elapsd %:	14.27	% Non-Parse CPU:	99.92

通過上面的指標可以看到，cpu的解析佔用的時間不多，根據那天資料遷移時的監控，發現cpu的使用率怎麼也沒上去，對於個別比較大的分割槽表，耗費的快取也大的驚人，180G的記憶體，除去sga，系統使用的，剩下的基本都被耗光了。只剩下400M左右，如果表小一些，空餘記憶體馬上會回覆到100G。 undo的使用情況過高，可以從alert中一條資訊加以驗證。這條sql語句執行了近2個多小時，Undo都被耗光了。 Fri Jun 27 06:04:08 2014 ORA-01555 caused by SQL statement below (SQL ID: 7wx3cgjqsmnn4, Query Duration=5616 sec, SCN: 0x0a08.4205da20): SELECT "A3"."SUBSCRIBER_NO","A3"."CUSTOMER_BAN",SUBSTR("A2"."TITLE"||' '||"A2"."FIRST_NAME",1,1500),......... ....... "SYS_CREATION_DATE">=SYSDATE@!-5) .... 如果cpu的使用率不高，肯定是有什麼等待事件。來看看等待事件。測試環境：

Top 5 Timed Foreground Events

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
log buffer space	3,074,452	177,382	58	50.41	Configuration
db file sequential read	6,754,182	49,729	7	14.13	User I/O
log file sync	206,372	26,221	127	7.45	Commit
DB CPU		22,823		6.49
buffer busy waits	2,041,520	21,531	11	6.12	Concurrency

生產環境：

Top 5 Timed Foreground Events

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
free buffer waits	71,094	121,336	1707	22.82	Configuration
buffer busy waits	1,349,891	83,772	62	15.76	Concurrency
log buffer space	738,424	76,090	103	14.31	Configuration
log file switch (checkpoint incomplete)	6,231	57,456	9221	10.81	Configuration
enq: TX - index contention	109,479	57,413	524	10.80	Concurrency

可以看到等待事件也有很大的不同。如果不太確定第一個等待事件“free buffer waits"主要代表什麼，可以從addm裡面得到更多的資訊。

Finding 3: Free Buffer Waits
Impact is 44.79 active sessions, 30.05% of total activity.
----------------------------------------------------------
Database writers (DBWR) were unable to keep up with the demand for free
buffers.


   Recommendation 1: Database Configuration
   Estimated benefit is 44.79 active sessions, 30.05% of total activity.
   ---------------------------------------------------------------------
   Action
      Consider increasing the number of database writers (DBWR) by setting the
      parameter "db_writer_processes". Also consider if asynchronous I/O is
      appropriate for your architecture.
   Rationale
      The value of parameter "db_writer_processes" was "4" during the analysis
      period.
   Rationale
      The value of parameter "disk_asynch_io" was "TRUE" during the analysis
      period.

看到這，我就恍然大悟了，測試環境和生產環境有一個引數是不一樣的，在生產環境中修改，啟用非同步IO需要重啟例項，但是在生產資料遷移之前，時間已經很緊張了，所以計劃把這個引數變更推遲到第二階段，沒想到有這麼大的影響。對應的引數有file_system_io需要設定為setall. disk_asynch_io=TRUE 第二個選項預設就是TRUE. 有了這個思路，再去看其他的影響因素，有些是關聯的，有些影響比例很小。所以總結這次資料遷移的問題，主要有兩個主要原因導致效能下降，一個是有外部的使用者在訪問資料庫，他們在不間斷的做一些查詢，有個別查詢還比較大，耗費了大量的undo資源。大概佔40%以上的因素一個原因是因為資料庫的非同步IO設定的問題。大概佔50%以上的因素。

生產環境sqlldr載入效能問題及分析之二(r2第20天)

生產環境sqlldr載入效能問題及分析之二(r2第20天)

CPU 100%負載的效能優化分析（r7筆記第40天)

巧用shell指令碼分析資料庫使用者(r2第4天)

MySQL45講之生產環境下的效能問題

sqlldr載入效能問題的排查 (r2第2天)

python資料分析之Numpy資料庫第三期陣列的運算

網站資料統計分析之二：前端日誌採集是與非

效能調優之redo切換頻率(47天)

關於desc的一個奇怪問題及分析(r2第18天)

快速定位隱蔽的sql效能問題及調優(r5筆記第38天)

一條insert語句導致的效能問題分析（二)（r8筆記第43天)

前端基礎之JS（第三天）

前端基礎之jQuery（第四天）

海量資料遷移之衝突資料篩查（r2 第1天)

impdp ORA-39002,ORA-39166,ORA-39164的問題及解決(r2第6天)

巧用parallel極速提升資料載入速度(r2第21天)

隻言片語分析datapump的工作原理（r2第18天)

通過addm分析io問題(r2筆記64天)

關於查詢轉換的一些簡單分析(三) (r3筆記第69天)

關於奇怪的並行程序分析(一) (r6筆記第41天)

生產環境sqlldr載入效能問題及分析之二(r2第20天)

相關推薦