wait for a undo record等待時間的分析與模擬
RDBMS 11.2.0.4 RAC
昨天庫上發生了死鎖,原因是有個job,job呼叫procedure,而procudure有呼叫package。而package裡面寫了很多成對的insert、delete語句,大約有10幾對。而package裡面是沒有commit語句的。而procedure最後,有一個commit語句。開發在除錯這個job的時候,因為一些欄位問題,job中止了。這個時候,剛好又另一個開發在釋出程式,剛好用到job裡面insert和delete的那些表。結果就是庫卡的很厲害。查詢了下,發下有死鎖。感覺很熱鬧,雖然這個問題很快就處理了。
今天看了下alert log,發現該job當是除錯了不下10次,報錯了10幾次。這得回滾到..... 幸虧庫還沒有正式使用。
當是的alert log 。注意裡面的parallel query server .
Wed Dec 26 05:46:36 2018 Archived Log entry 1190 added for thread 2 sequence 622 ID 0x589404d6 dest 1: Wed Dec 26 06:01:10 2018 Errors in file /u01/app/oracle/diag/rdbms/XXXX/XXXX2/trace/XXXX2_p012_13163.trc (incident=48793): ORA-00600: internal error code, arguments: [kcbzwfcro_2], [90329], [1], [32768], [0], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/XXXX/XXXX2/incident/incdir_48793/XXXX2_p012_13163_i48793.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/XXXX/XXXX2/trace/XXXX2_p012_13163.trc: ORA-10388: parallel query server interrupt (failure) ORA-00600: internal error code, arguments: [kcbzwfcro_2], [90329], [1], [32768], [0], [], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/XXXX/XXXX2/trace/XXXX2_p012_13163.trc: ORA-10388: parallel query server interrupt (failure) ORA-00600: internal error code, arguments: [kcbzwfcro_2], [90329], [1], [32768], [0], [], [], [], [], [], [], [] Wed Dec 26 06:01:13 2018 Dumping diagnostic data in directory=[cdmp_20181226060113], requested by (instance=2, osid=13163 (P012)), summary=[incident=48793]. Wed Dec 26 06:01:15 2018 Sweep [inc][48793]: completed Sweep [inc2][48793]: completed Wed Dec 26 06:06:26 2018 Errors in file /u01/app/oracle/diag/rdbms/XXXX/XXXX2/trace/XXXX2_p012_13163.trc (incident=48794): ORA-00600: internal error code, arguments: [kcbzwfcro_2], [90329], [1], [32768], [0], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/XXXX/XXXX2/incident/incdir_48794/XXXX2_p012_13163_i48794.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/XXXX/XXXX2/trace/XXXX2_p012_13163.trc: ORA-10388: parallel query server interrupt (failure) ORA-00600: internal error code, arguments: [kcbzwfcro_2], [90329], [1], [32768], [0], [], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/XXXX/XXXX2/trace/XXXX2_p012_13163.trc: ORA-10388: parallel query server interrupt (failure) ORA-00600: internal error code, arguments: [kcbzwfcro_2], [90329], [1], [32768], [0], [], [], [], [], [], [], [] Wed Dec 26 06:06:27 2018 Dumping diagnostic data in directory=[cdmp_20181226060627], requested by (instance=2, osid=13163 (P012)), summary=[incident=48794]. Wed Dec 26 06:06:28 2018 Sweep [inc][48794]: completed Sweep [inc2][48794]: completed
今天看了下當時的awr報告,發現有個等待時間wait for a undo record.
這個等待時間,查了下MOS,IF: Undo Related Wait Event - Wait for an Undo Record (文件 ID 1951704.1) 上面有一些說明。
官方建議是修改fast_start_parallel_rollback = false ,但是修改這個引數,也給出了一些建議,建議查下MOS。
關於這個引數,在官方文件上有說明
官方文件:https://docs.oracle.com/cd/E11882_01/server.112/e40402/initparams091.htm#REFRN10059
FAST_START_PARALLEL_ROLLBACK
specifies the degree of parallelism used when recovering terminated transactions. Terminated transactions are transactions that are active before a system failure. If a system fails when there are uncommitted parallel DML or DDL transactions, then you can speed up transaction recovery during startup by using this parameter.
Values:
-
FALSE
Parallel rollback is disabled
-
LOW
Limits the maximum degree of parallelism to 2 *
CPU_COUNT
-
HIGH
Limits the maximum degree of parallelism to 4 *
CPU_COUNT
If you change the value of this parameter, then transaction recovery will be stopped and restarted with the new implied degree of parallelism.
這個引數,預設設定是low,也就是2*cpu_count。 所以,當回滾的時候,系統性能下降就很正常了。
下面模擬下這個等待事件的產生。
RDBMS 12.2.0.1
首先,建立一個表,然後插入大量的資料,不要提交
create table rollback as select * from dba_objects;
insert into rollback select * from rollback;
[email protected]>insert into rollback select * from rollback;
80801 rows created.
[email protected]>/
161602 rows created.
[email protected]>/
323204 rows created.
[email protected]>/
646408 rows created.
[email protected]>/
1292816 rows created.
[email protected]>
檢視當前session對應的process id,並在os層面kill掉該程序
[email protected]>select spid from v$process where addr in (select paddr from v$session where sid in (select sid from v$mystat where rownum=1));
SPID
------------------------
17514
[email protected]>
kill -9 17514
此時,檢視v$fast_start_transactions
檢視session等待時間
回滾完畢後,session等待事件沒有了。v$fast_start_transactions檢視
根據上圖的xid查詢,是哪個sql引起的
[email protected]>select distinct sql_id from V$ACTIVE_SESSION_HISTORY where xid=hextoraw('01000C0006510200');
SQL_ID
-------------
2ux4jwjr3g52b
[email protected]>select sql_id,sql_text from v$sql where sql_id='2ux4jwjr3g52b';
SQL_ID
-------------
SQL_TEXT
--------------------------------------------------------------------------------
2ux4jwjr3g52b
insert into rollback select * from rollback
[email protected]>
檢視awr報告。可以看到等待時間有wait for a undo record.
到此,這個問題搞清楚了。
END