1. 程式人生 > 其它 >記一次cursor pin s wait on X的處理

記一次cursor pin s wait on X的處理

今天遇到個問題,客戶說某天的11:45開始,系統遇到了大量的cursor pin s wait on X,經歷一個小時後自動消失,需要查詢原因。

這報錯一般是某個會話需要申請S模式的mutex,而mutex被其他會話以X模式佔有了。查holder也很容易,11g版本前看p2raw的前8位,將16進位制轉換成10進位制即為holder的sid,在11g之後只需直接看blocking_session即可。

在11g中,我們有ash,這個很方便能查到過去發生的一切。

(1)根據客戶反饋的情況,cursor pin S wait on X最嚴重的一個時間段為12:25~12:35,取這個時間段為例。發現cursor: pin S wait on X的程序都被sid為2的程序堵塞。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 SQL> select event,blocking_session,sql_id,count(*) from dba_hist_active_sess_history ash 2where sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd hh24:mi:ss') 3and sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd hh24:mi:ss')
4and event='cursor: pin S wait on X' 5group by event,blocking_session,sql_id 6/ EVENTBLOCKING_SESSION SQL_IDCOUNT(*) ---------------------------------------------------------------- ---------------- ------------- ---------- cursor: pin S wait on X 2 8duehra3kdx62 1050 --cursor: pin S wait on X的SQL為:
SQL> select sql_text from v$sql where sql_id='8duehra3kdx62'; SQL_TEXT -------------------------------------------------------------------------------- insert into mytable_a1_p1_h1 select * from mytable_a2_p2_h2

(2)SID為2的程序被sid為1129的程序堵塞:

1 2 3 4 5 6 7 8 9 10 11 12 SQL> select event,blocking_session,sql_id,count(*) from dba_hist_active_sess_history ash 2where sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd hh24:mi:ss') 3and sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd hh24:mi:ss') 4and ash.session_id=2 5group by event,blocking_session,sql_id 6/ EVENTBLOCKING_SESSION SQL_IDCOUNT(*) ---------------------------------------------------------------- ---------------- ------------- ---------- library cache lock 1129 8duehra3kdx62 60 SQL>

(3)sid為1129的程序被sid 951的堵塞:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 SQL> select event,blocking_session,sql_id,count(*) from dba_hist_active_sess_history ash 2where sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd hh24:mi:ss') 3and sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd hh24:mi:ss') 4and ash.session_id=1129 5group by event,blocking_session,sql_id 6/ EVENTBLOCKING_SESSION SQL_IDCOUNT(*) ---------------------------------------------------------------- ---------------- ------------- ---------- row cache lock951 5nkcsttxny4wz 60 SQL> --而sid 1129的sql為: SQL> select sql_text from v$sql where sql_id='5nkcsttxny4wz'; SQL_TEXT -------------------------------------------------------------------------------- truncate table mytable_a2_p2_h2 SQL>

(4)sid 951的程序被Wnnn的程序堵塞:

1 2 3 4 5 6 7 8 9 10 11 SQL> select event,blocking_session,sql_id,program,count(*) from dba_hist_active_sess_history ash 2where sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd hh24:mi:ss') 3and sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd hh24:mi:ss') 4and ash.session_id=951 5group by event,blocking_session,sql_id,program 6/ EVENTBLOCKING_SESSION SQL_IDPROGRAMCOUNT(*) ---------------------------------------------------------------- ---------------- ------------- ---------------------------------------------------------------- ---------- oracle@g4as1010 (W004)8 log file sequential readoracle@g4as1010 (W004) 52

(5)最終的堵塞程序為Wnnn。

Wnnn是11g新的程序,有SMCO程序spawn出來,用於主動空間的管理(It perform proactive space allocation and space reclaimation)。
從上述SQL看,由於有insert又有truncate,因此空間清理將有SMCO和Wnnn程序介入。在操作的過程中,堵塞其他的程序的操作。

到這裡,我們可以有個初步的解決方案:

既然是Wnnn程序堵塞了其他程序的操作,那麼我們禁用這個11g的新功能,禁用SMCO的空間管理 “Tablespace-level space(Extent)pre-allocation”,即可避免該問題的發生。

1 alter system set "_ENABLE_SPACE_PREALLOCATION"=0

還原的反向操作為:

1 alter system set "_ENABLE_SPACE_PREALLOCATION"=3

注:

1 2 3 4 5 * 0 to turn off the tbs pre-extension feature. * 1 To enable tablespace extension. * 2 To enable segment growth. * 3 To enable extent growth. * 4 To enable chunk allocation.

但是我們到這裡也會問一個問題,為什麼Wnnn的操作沒有blocking_session,卻要操作那麼久的時間。這時一個不常見的等待事件映入眼簾:log file sequential read 。

log file sequential read 一般是在恢復或者logmnr的時候,才將redo log file的內容順序的讀入到記憶體中,此時為什麼需要做log file sequential read?

根據上述的時間點,我們進一步結合alertlog看,我們發現在對應的時間點,都有關於w004的ora600的報錯。

1 2 3 4 5 Sun Jun 23 01:45:01 2013 Errors in file /aabb/sx/kgg/data/app/oracle/diag/rdbms/kgg/kgg/trace/kgg_w007_8283.trc(incident=104227): ORA-00600: internal error code, arguments: [ktecgsc:kcbz_objdchk], [0], [0], [1], [], [], [], [], [], [], [], [] Incident details in: /aabb/sx/kgg/data/app/oracle/diag/rdbms/kgg/kgg/incident/incdir_104227/kgg_w007_8283_i104227.trc ……

結合ora -600 [ktecgsc:kcbz_objdchk]查詢MOS,我們可以定位到一個11.2.0.2的bug 10237773,並且從該bug的Rediscovery Note中,也可以看到:The problem is likely to occur in a Wnnn background process ktssupd_segment_extblks_bkg is likely to appear on the stack.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Bug 10237773ORA-600 [kcbz_check_objd_typ] / ORA-600 [ktecgsc:kcbz_objdchk] This note gives a brief overview of bug 10237773. The content was last updated on: 20-OCT-2011 Click here for details of each of the sections below. Affects: Product (Component) Oracle Server (Rdbms) Range of versions believed to be affected Versions >= 11.2 but BELOW 12.1 Versions confirmed as being affected ? 11.2.0.2 Platforms affected Generic (all / most platforms affected) Fixed: This issue is fixed in ? 12.1 (Future Release) ? 11.2.0.3 ? 11.2.0.2.4 Patch Set Update ? 11.2.0.2 Bundle Patch 12 for Exadata Database ? 11.2.0.2 Patch 5 on Windows Platforms Symptoms: Related To: ? Internal Error May Occur (ORA-600) ? ORA-600 [ktecgsc:kcbz_objdchk] ? ORA-600 [kcbz_check_objd_typ] ? Stack is likely to include ktssupd_segment_extblks_bkg ? (None Specified) Description ORA-600 [kcbz_check_objd_typ]/ ORA-600 [ktecgsc:kcbz_objdchk] can occur as a background process tries to read a segment header into the cache from disk for an object that has undergone a TRUNCATE or similar operation changing the blocks data object id. Rediscovery Notes: The problem is likely to occur in a Wnnn background process ktssupd_segment_extblks_bkg is likely to appear on the stack. HOOKS OERI:ktecgsc:kcbz_objdchk OERI:kcbz_check_objd_typ STACKHAS:ktssupd_segment_extblks_bkg LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_OERI OERI FIXED_11.2.0.2.4 FIXED_11.2.0.2.BP12 FIXED_11.2.0.3 FIXED_12.1.0.0 FIXED_WIN:B202P05 Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support. References Bug:10237773 (This link will only work for PUBLISHED bugs) Note:245840.1 Information on the sections in this article

進一步分析trace檔案,我們可以看到11:40到12:43都在dump redo record。因此,我們在那段時間看到wnnn程序的log file sequential read的等待。即wnnn程序在讀取redo log file做dump。

Dump redo record(11:40到12:43):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 *** 2013-06-24 11:40:14.021 <<<<<<<<redo record dump的開始時間 REDO RECORD - Thread:1 RBA: 0x011002.00080f4e.0024 LEN: 0x003c VLD: 0x01 SCN: 0x0830.ffe6a968 SUBSCN:3 06/23/2013 11:40:03 (LWN RBA: 0x011002.00080f27.0010 LEN: 0133 NST: 04d SCN: 0x0830.ffe6a967) CHANGE #1 TYP:0 CLS:4 AFN:4 DBA:0x01003c6a OBJ:341379 SCN:0x0830.ffe6a8a7 SEQ:1 OP:14.2 ENC:0 RBL:0 ktelk redo: xid:0x0004.015.00137dd7 REDO RECORD - Thread:1 RBA: 0x011002.00080f69.019c LEN: 0x00f8 VLD: 0x01 SCN: 0x0830.ffe6a96a SUBSCN: 38 06/23/2013 11:40:03 CHANGE #1 TYP:0 CLS:24 AFN:3 DBA:0x00cd58a4 OBJ:4294967295 SCN:0x0830.ffe6a96a SEQ:3 OP:5.1 ENC:0 RBL:0 ktudb redo: siz: 96 spc: 7696 flg: 0x0022 seq: 0x8a5b rec: 0x06 xid:0x0004.015.00137dd7 ktubu redo: slt: 21 rci: 5 opc: 14.5 objn: 1 objd: 341379 tsn: 4 ..... REDO RECORD - Thread:1 RBA: 0x011002.00080fad.0178 LEN: 0x0034 VLD: 0x01 SCN: 0x0830.ffe6a96f SUBSCN:1 06/23/2013 11:40:03 (LWN RBA: 0x011002.00080fac.0010 LEN: 0026 NST: 04d SCN: 0x0830.ffe6a96d) CHANGE #1 TYP:0 CLS:4 AFN:4 DBA:0x01003c6a OBJ:341379 SCN:0x0830.ffe6a96a SEQ:1 OP:14.1 ENC:0 RBL:0 ktecush redo: clear extent control lock REDO RECORD - Thread:1 RBA: 0x011002.00080fae.0010 LEN: 0x00c4 VLD: 0x01 SCN: 0x0830.ffe6a96f SUBSCN:3 06/23/2013 11:40:03 CHANGE #1 TYP:0 CLS:4 AFN:4 DBA:0x01003c6a OBJ:341379 SCN:0x0830.ffe6a96f SEQ:1 OP:13.28 ENC:0 RBL:0 High HWM Highwater::0x033ffb58ext#: 1blk#: 8ext size: 8 #blocks in seg. hdr's freelists: 0 #blocks below: 13 mapblk0x00000000offset: 1 lfdba:0x01003c68 CHANGE #2 TYP:0 CLS:8 AFN:4 DBA:0x01003c68 OBJ:341379 SCN:0x0830.ffe6a96f SEQ:1 OP:13.22 ENC:0 RBL:0 Redo on Level1 Bitmap Block Redo to set hwm Opcode: 32Highwater::0x033ffb58ext#: 1blk#: 8ext size: 8 #blocks in seg. hdr's freelists: 0 #blocks below: 13 mapblk0x00000000offset: 1 REDO RECORD - Thread:1 RBA: 0x011002.00080fb0.0050 LEN: 0x0084 VLD: 0x01 SCN: 0x0830.ffe6a970 SUBSCN: 10 06/23/2013 11:40:03 CHANGE #1 TYP:0 CLS:4 AFN:4 DBA:0x01003c6a OBJ:341379 SCN:0x0830.ffe6a96f SEQ:2 OP:13.28 ENC:0 RBL:0 Low HWM Highwater::0x033ffb58ext#: 1blk#: 8ext size: 8 #blocks in seg. hdr's freelists: 0 #blocks below: 13 .... *** 2013-06-24 11:40:20.839 *** 2013-06-24 12:43:25.874 descrip:"Thread 0001, Seq# 0000069858, SCN 0x08312e298f39-0xffffffffffff" *** 2013-06-24 12:43:27.003 <<<<<<<<<<redo record dump的結束時間 END OF DUMP REDO

另外,在trace檔案中,也查到的bug中所說的ktssupd_segment_extblks_bkg的呼叫。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ----- Call Stack Trace ----- callingcall entryargument values in hex location type point(? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst1()+96 CALL skdstdst() FFFFFFFF7FFF1420 ? 1006B0CA0 ? 000000000 ? 00000000A ? 000000001 ? 10BD98EB8 ? …… FFFFFFFF7FFFF8F0 ? FFFFFFFF7BD00200 ? FFFFFFFF7FFFB4A1 ? ktecgshx()+28CALL ktecgetsh()FFFFFFFF7FFFBA50 ? 000000002 ? 000000002 ? FFFFFFFF7FFFB558 ? 00010C222 ? 733B9EE60 ? ktssupd_segment_extCALL ktecgshx() FFFFFFFF7FFFBA50 ? <<<<<此處,發現呼叫ktssupd_segment_extblks_bkg函式 blks_bkg()+636 000000002 ? 000000002 ? 000000155 ? 733B9EE60 ? 00010C212 ? ktsj_execute_task()PTR_CALL ktssupd_segment_ext71F1F2380 ? 000000024 ? +676blks_bkg() 000000004 ? 10A6E6000 ? 10A6E6000 ? 00010C20D ? ktsj_slave_main()+1CALL ktsj_execute_task()FFFFFFFF7FFFC570 ? 0247417E8D78 ? 10C20E240 ? 1023C8300 ? 000380025 ? 10C20E090 ? ksvrdp()+2304PTR_CALL ktsj_slave_main()000000000 ? 000380000 ? 000380025 ? 7448C78F8 ? 000380025 ? 380025000 ? opirip()+912 CALL ksvrdp() 000002000 ? 000000062 ? 00038000B ? 74ABE9CD0 ? 000001400 ? 000001648 ? opidrv()+780 CALL opirip() 10A6E7000 ? 000000000 ? 000380000 ? 00038000B ? 38002A000 ? 000000001 ? sou2o()+92 CALL opidrv() 000000032 ? 000000004 ? FFFFFFFF7FFFF530 ? 0001EB250 ? FFFFFFFF7AB45110 ? FFFFFFFF7FFFF9B0 ? opimai_real()+516CALL sou2o()FFFFFFFF7FFFF508 ? 000000032 ? 000000004 ? FFFFFFFF7FFFF530 ? 00010C000 ? 00010B800 ? ssthrdmain()+320 PTR_CALL opimai_real()000000003 ? FFFFFFFF7FFFF7D8 ? FFFFFFFF7FFFFB2D ? FFFFFFFF7FFFFB8A ? 000000000 ? 000000000 ? main()+308 CALL ssthrdmain() 00010C000 ? 000000003 ? 00044D000 ? 100644B60 ? 10C233000 ? 00010C233 ? _start()+380 CALL main() 000000003 ? FFFFFFFF7FFFF8E8 ? 000000000 ? FFFFFFFF7FFFF7E0 ? FFFFFFFF7FFFF8F0 ? FFFFFFFF7BD00200 ? --------------------- Binary Stack Dump ---------------------

因此,問題的原因應該是這樣的:

1 wnnn程序做空間管理時,讀取segment header,此時有另外的程序執行truncate。觸發bug 10237773 ,報錯ora600[ktecgsc:kcbz_objdchk] >> 觸發bug後需要dump redo,從11:40到12:43w004程序都在做redo的dump >>dump時w004程序(sid 951)處於log file sequential read等待 >>堵塞sid 1129程序,處於 row cache lock等待 >>堵塞sid 2程序,處於 library cache lock等待 >>堵塞其他的程序,處於 cursor pin s wait on x等待。

找到了問題根源,對應的解決方法也就明瞭了。workaround是禁用11g新特性,SMCO的空間管理,fix是打bug 10237773的補丁,或者升級到11.2.0.3。

喜歡請讚賞一下啦^_^

微信讚賞

支付寶讚賞