生產環境sqlldr載入效能問題及分析之二(r2第20天)
上一節討論了在資料遷移中發現數據載入的速度一下子慢了很多,和之前在測試環境相比有很大的差距。一個原因就是由於在資料載入的過程中有一些額外的session也在操作訪問資料庫,造成了undo的使用率急劇上升,資料庫負載從某種程度上也加劇了。通過檢視awr,ash報告可以發現更多的內容。 測試環境的資料庫負載情況
Load Profile
Per Second |
Per Transaction |
Per Exec |
Per Call |
|
---|---|---|---|---|
DB Time(s): |
98.2 |
2.2 |
0.73 |
0.67 |
DB CPU(s): |
6.4 |
0.1 |
0.05 |
0.04 |
Redo size: |
57,163,094.1 |
1,277,840.0 |
||
Logical reads: |
395,710.8 |
8,845.8 |
||
Block changes: |
279,863.1 |
6,256.1 |
||
Physical reads: |
1,931.7 |
43.2 |
||
Physical writes: |
8,592.7 |
192.1 |
||
User calls: |
145.8 |
3.3 |
||
Parses: |
67.4 |
1.5 |
||
Hard parses: |
0.8 |
0.0 |
||
W/A MB processed: |
0.1 |
0.0 |
||
Logons: |
0.8 |
0.0 |
||
Executes: |
133.9 |
3.0 |
||
Rollbacks: |
0.0 |
0.0 |
||
Transactions: |
44.7 |
生產環境:
Load Profile
Per Second |
Per Transaction |
Per Exec |
Per Call |
|
---|---|---|---|---|
DB Time(s): |
149.1 |
5.8 |
2.22 |
1.44 |
DB CPU(s): |
2.0 |
0.1 |
0.03 |
0.02 |
Redo size: |
22,033,012.0 |
863,923.6 |
||
Logical reads: |
143,043.3 |
5,608.8 |
||
Block changes: |
89,599.3 |
3,513.2 |
||
Physical reads: |
355.4 |
13.9 |
||
Physical writes: |
3,529.4 |
138.4 |
||
User calls: |
103.3 |
4.1 |
||
Parses: |
38.6 |
1.5 |
||
Hard parses: |
0.1 |
0.0 |
||
W/A MB processed: |
0.8 |
0.0 |
||
Logons: |
0.8 |
0.0 |
||
Executes: |
67.1 |
2.6 |
||
Rollbacks: |
0.0 |
0.0 |
||
Transactions: |
25.5 |
同樣的資料載入執行緒(150個並行,採用sqlldr parllel=true),在測試環境中有57M每秒的速度,但是在生產環境中卻少了一倍多。 來看看例項的工作情況: 測試環境:
Buffer Nowait %: |
99.76 |
Redo NoWait %: |
99.98 |
---|---|---|---|
Buffer Hit %: |
99.51 |
In-memory Sort %: |
100.00 |
Library Hit %: |
98.31 |
Soft Parse %: |
98.80 |
Execute to Parse %: |
49.66 |
Latch Hit %: |
97.52 |
Parse CPU to Parse Elapsd %: |
70.81 |
% Non-Parse CPU: |
99.92 |
生產環境:
Instance Efficiency Percentages (Target 100%)
Buffer Nowait %: |
99.60 |
Redo NoWait %: |
99.64 |
---|---|---|---|
Buffer Hit %: |
99.76 |
In-memory Sort %: |
99.99 |
Library Hit %: |
99.76 |
Soft Parse %: |
99.84 |
Execute to Parse %: |
42.46 |
Latch Hit %: |
96.21 |
Parse CPU to Parse Elapsd %: |
14.27 |
% Non-Parse CPU: |
99.92 |
通過上面的指標可以看到,cpu的解析佔用的時間不多,
根據那天資料遷移時的監控,發現cpu的使用率怎麼也沒上去,對於個別比較大的分割槽表,耗費的快取也大的驚人,180G的記憶體,除去sga,系統使用的,剩下的基本都被耗光了。只剩下400M左右,如果表小一些,空餘記憶體馬上會回覆到100G。
undo的使用情況過高,可以從alert中一條資訊加以驗證。這條sql語句執行了近2個多小時,Undo都被耗光了。
Fri Jun 27 06:04:08 2014
ORA-01555 caused by SQL statement below (SQL ID: 7wx3cgjqsmnn4, Query Duration=5616 sec, SCN: 0x0a08.4205da20):
SELECT "A3"."SUBSCRIBER_NO","A3"."CUSTOMER_BAN",SUBSTR("A2"."TITLE"||' '||"A2"."FIRST_NAME",1,1500),.........
.......
"SYS_CREATION_DATE">=SYSDATE@!-5) ....
如果cpu的使用率不高,肯定是有什麼等待事件。來看看等待事件。
測試環境:
Top 5 Timed Foreground Events
Event |
Waits |
Time(s) |
Avg wait (ms) |
% DB time |
Wait Class |
---|---|---|---|---|---|
log buffer space |
3,074,452 |
177,382 |
58 |
50.41 |
Configuration |
db file sequential read |
6,754,182 |
49,729 |
7 |
14.13 |
User I/O |
log file sync |
206,372 |
26,221 |
127 |
7.45 |
Commit |
DB CPU |
22,823 |
6.49 |
|||
buffer busy waits |
2,041,520 |
21,531 |
11 |
6.12 |
Concurrency |
生產環境:
Top 5 Timed Foreground Events
Event |
Waits |
Time(s) |
Avg wait (ms) |
% DB time |
Wait Class |
---|---|---|---|---|---|
free buffer waits |
71,094 |
121,336 |
1707 |
22.82 |
Configuration |
buffer busy waits |
1,349,891 |
83,772 |
62 |
15.76 |
Concurrency |
log buffer space |
738,424 |
76,090 |
103 |
14.31 |
Configuration |
log file switch (checkpoint incomplete) |
6,231 |
57,456 |
9221 |
10.81 |
Configuration |
enq: TX - index contention |
109,479 |
57,413 |
524 |
10.80 |
Concurrency |
可以看到等待事件也有很大的不同。 如果不太確定第一個等待事件“free buffer waits"主要代表什麼,可以從addm裡面得到更多的資訊。
Finding 3: Free Buffer Waits
Impact is 44.79 active sessions, 30.05% of total activity.
----------------------------------------------------------
Database writers (DBWR) were unable to keep up with the demand for free
buffers.
Recommendation 1: Database Configuration
Estimated benefit is 44.79 active sessions, 30.05% of total activity.
---------------------------------------------------------------------
Action
Consider increasing the number of database writers (DBWR) by setting the
parameter "db_writer_processes". Also consider if asynchronous I/O is
appropriate for your architecture.
Rationale
The value of parameter "db_writer_processes" was "4" during the analysis
period.
Rationale
The value of parameter "disk_asynch_io" was "TRUE" during the analysis
period.
看到這,我就恍然大悟了,測試環境和生產環境有一個引數是不一樣的,在生產環境中修改,啟用非同步IO需要重啟例項,但是在生產資料遷移之前,時間已經很緊張了,所以計劃把這個引數變更推遲到第二階段,沒想到有這麼大的影響。 對應的引數有file_system_io需要設定為setall. disk_asynch_io=TRUE 第二個選項預設就是TRUE. 有了這個思路,再去看其他的影響因素,有些是關聯的,有些影響比例很小。 所以總結這次資料遷移的問題,主要有兩個主要原因導致效能下降, 一個是有外部的使用者在訪問資料庫,他們在不間斷的做一些查詢,有個別查詢還比較大,耗費了大量的undo資源。大概佔40%以上的因素 一個原因是因為資料庫的非同步IO設定的問題。大概佔50%以上的因素。