關於shared pool的深入探討(六)-高Latch競爭案例
研究了幾天shared pool,沒想到忽然就撞到問題上來了.
作為一個案例寫出來給大家參考一下吧.
問題起因是公司做短信群發,就是那個18萬買的4000字的短信小說(噓,小聲點,我也沒看過...).
群發的時候每隔一段時間就會發生一次消息隊列擁堵的情況
在數據庫內部實際上是向一個數據表中記錄發送日誌.
我們介入來檢查數據庫的問題,在一個擁堵時段我開始診斷:
SQL> select sid,event,p1,p1raw from v$session_wait; SID EVENT P1 P1RAW ---------- ---------------------------------------------------------------- ---------- -------- 76 latch free 2147535824 8000CBD0 83 latch free 2147535824 8000CBD0 148 latch free 3415346832 CB920E90 288 latch free 2147535824 8000CBD0 285 latch free 2147535824 8000CBD0 196 latch free 2147535824 8000CBD0 317 latch free 2147535824 8000CBD0 2 pmon timer 300 0000012C 1 rdbms ipc message 300 0000012C 4 rdbms ipc message 300 0000012C 6 rdbms ipc message 180000 0002BF20 SID EVENT P1 P1RAW ---------- ---------------------------------------------------------------- ---------- -------- 18 rdbms ipc message 6000 00001770 102 rdbms ipc message 6000 00001770 311 rdbms ipc message 6000 00001770 194 rdbms ipc message 6000 00001770 178 rdbms ipc message 6000 00001770 3 log file parallel write 1 00000001 13 log file sync 2705 00000A91 16 log file sync 2699 00000A8B 104 log file sync 2699 00000A8B 308 log file sync 2694 00000A86 262 log file sync 2705 00000A91 SID EVENT P1 P1RAW ---------- ---------------------------------------------------------------- ---------- -------- 172 log file sync 2689 00000A81 169 log file sync 2705 00000A91 108 log file sync 2694 00000A86 38 log file sync 2707 00000A93 34 db file scattered read 63 0000003F 5 smon timer 300 0000012C 27 SQL*Net message to client 1413697536 54435000 60 SQL*Net message to client 1413697536 54435000 239 SQL*Net message to client 1413697536 54435000 ...ignore some idle waiting here... 11 SQL*Net message from client 675562835 28444553 12 SQL*Net message from client 1413697536 54435000 170 rows selected. |
在這次查詢中,我發現大量的latch free等待,再次查詢時這些等待消失,應用也恢復了正常.
SQL> select sid,event,p1,p1raw from v$session_wait where event not like ‘SQL*Net%‘; SID EVENT P1 P1RAW ---------- ---------------------------------------------------------------- ---------- -------- 2 pmon timer 300 0000012C 1 rdbms ipc message 300 0000012C 4 rdbms ipc message 300 0000012C 6 rdbms ipc message 180000 0002BF20 18 rdbms ipc message 6000 00001770 102 rdbms ipc message 6000 00001770 178 rdbms ipc message 6000 00001770 194 rdbms ipc message 6000 00001770 311 rdbms ipc message 6000 00001770 3 log file parallel write 1 00000001 148 log file sync 2547 000009F3 SID EVENT P1 P1RAW ---------- ---------------------------------------------------------------- ---------- -------- 273 log file sync 2544 000009F0 190 log file sync 2545 000009F1 5 smon timer 300 0000012C 14 rows selected. |
接下來我們來看這些latch free等待的是哪些latch
SQL> select addr,latch#,name,gets,spin_gets from v$latch order by spin_gets; ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 80001398 3 session switching 111937 0 80002010 6 longop free list 37214 0 800023A0 7 cached attr list 0 0 80002628 10 event group latch 2391668 0 ..... 80003F3C 28 message pool operations parent latch 3 0 ..... 80006030 60 mostly latch-free SCN 19 0 80005F8C 59 file number translation table 68 0 80005F14 58 dlm cr bast queue latch 0 0 80005E8C 57 name-service request 0 0 80005E14 56 name-service memory objects 0 0 80005DA0 55 name-service namespace bucket 0 0 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 80005D2C 54 name-service pending queue 0 0 80005CB4 53 name-service request queue 0 0 80004E08 52 name-service entry 0 0 80008AB0 76 KCL lock element parent latch 0 0 80008A48 75 KCL instance latch 0 0 80007F18 73 redo copy 816 0 80007BBC 71 archive process latch 0 0 80007B54 70 archive control 1 0 80006A10 68 Active checkpoint queue latch 2003308 0 800064B0 66 large memory latch 0 0 80006448 65 cache protection latch 0 0 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 800060EC 61 batching SCNs 0 0 8000CAB0 96 global transaction 6833807 0 8000CA48 95 global tx free list 58258 0 8000C238 93 cost function 0 0 80009FCC 91 temp lob duration state obj allocation 0 0 8000995C 87 ktm global data 8118 0 80009228 84 transaction branch allocation 282388 0 80008EC4 80 begin backup scn array 6968 0 80008D54 79 loader state object freelist 42712 0 80008B80 78 KCL freelist latch 0 0 80008B18 77 KCL name table latch 0 0 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 8000D484 118 presentation list 0 0 8000D41C 117 session timer 855944 0 ..... 8000E9D0 129 process queue 44 0 8000E900 127 query server freelists 66 0 8000FC84 140 AQ Propagation Scheduling System Load 0 0 8000E898 126 query server process 10 0 8000E27C 125 job_queue_processes parameter latch 111937 0 8000DA1C 124 NLS data objects 2 0 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 8000D95C 123 ncodef allocation latch 111937 0 8000D674 122 virtual circuits 0 0 8000D60C 121 virtual circuit queues 159877 0 8000D5A4 120 virtual circuit buffers 0 0 8000D4EC 119 address list 2 0 ..... 8000CD70 102 Direct I/O Adaptor 2 0 ..... 80002408 8 GDS latch 30 0 800092E4 85 sort extent pool 69834 1 8000EC38 132 parallel query alloc buffer 80 1 8000E968 128 error message lists 22 1 80001400 4 process group creation 2615542 2 8000EAA0 131 parallel query stats 14 2 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 8000CD08 101 Token Manager 1151107 2 8000CB18 97 global tx hash mapping 507846 2 80006378 63 cache buffer handles 315924 4 8000EA38 130 process queue reference 190993 5 80003E3C 26 channel handle pool latch 2391680 18 80003EAC 27 channel operations parent latch 4783425 24 80009B90 89 intra txn parallel recovery 32654 33 8000FCF8 141 fixed table rows for x$hs_session 161368 41 800012C8 1 process allocation 2391688 154 80009B28 88 parallel txn reco latch 174519 271 8000CCA0 100 library cache load lock 14947545 5958 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 8000C8D0 94 user lock 13086412 6078 8000914C 82 list of block allocation 120650357 12024 80006A78 69 Checkpoint queue latch 154361751 17686 80009D34 90 sequence cache 64611720 32027 80009090 81 dml lock allocation 234465024 45351 800091C0 83 transaction allocation 214227648 48345 800096AC 86 undo global data 188271244 49641 800028A0 13 enqueue hash chains 373244264 131322 80007E04 72 redo allocation 439389808 201498 80001468 5 session idle bit 2039097976 204969 80002838 12 enqueues 471338482 273695 ADDR LATCH# NAME GETS SPIN_GETS -------- ---------- ---------------------------------------------------------------- ---------- ---------- 80001330 2 session allocation 261826230 428312 800063E0 64 multiblock read objects 1380614923 1366278 800026B8 11 messages 207935758 1372606 80001218 0 latch wait list 203479569 1445342 80006310 62 cache buffers chains 3.8472E+10 2521699 8000A17C 92 row cache objects 1257586714 2555872 80007F80 74 redo writing 264722932 4458044 80006700 67 cache buffers lru chain 5664313769 30046921 8000CBD0 98 shared pool 122433688 59070585 8000CC38 99 library cache 4414533796 1037032730 142 rows selected. SQL> select startup_time from v$instance; STARTUP_T --------- 13-AUG-04 檢查數據庫啟動時間 |
我們註意到,在當前數據庫中競爭最嚴重的兩個latch是shared pool和library cache.
顯然這極有可能是SQL的過度解析造成的.
進一步我們檢查v$sqlarea發現:
SQL> select sql_text,VERSION_COUNT,INVALIDATIONS,PARSE_CALLS,OPTIMIZER_MODE, PARSING_USER_ID,PARSING_SCHEMA_ID,ADDRESS,HASH_VALUE 2 from v$sqlarea where version_count >1000; SQL_TEXT ------------------------------------------------------------------------------------------------------------------------ VERSION_COUNT INVALIDATIONS PARSE_CALLS OPTIMIZER_MODE PARSING_USER_ID PARSING_SCHEMA_ID ADDRESS HASH_VALUE ------------- ------------- ----------- ------------------------- --------------- ----------------- -------- ---------- insert into sms_log (MSGDATE,MSGTIME,MSGID,MSGKIND,MSGTYPE,MSGTYPE_MOMT,MSGLEN,MSGSTATUS,AREAID,IFIDDEST,IFIDSRC,ADDRSRC ,ADDRDEST,ADDRFEE,ADDRUSER,SERVICECODE,PLANID,FEETYPE,FEEVALUE,DATACODING,FLAGS,SMLEN,SMCONT) values (:b0,:b1,:b2,:b3,:b 4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19,:b20,:b21,:b22) 7023 0 1596 MULTIPLE CHILDREN PRESENT 36 36 C82AF1C8 3974744754 |
這就是寫日誌記錄的代碼,這段代碼使用了綁定變量,但是version_count卻有7023個.
也就是這個sql有7023個子指針.這是不可想象的.
通過前面幾節的研究我們知道,如果這個sql有7023個子指針
那麽意味著這些子指針都將存在於同一個Bucket的鏈表上
那麽這也就意味著,如果同樣SQL再次執行,Oracle將不得不搜索這個鏈表以尋找可以共享的SQL.
這將導致大量的library cache latch的競爭.
這時候我開始猜測原因:
1.可能代碼存在問題,在每次執行之前程序修改某些session參數,導致sql不能共性
2.可能是8.1.5的v$sqlarea記錄存在問題,我們看到的結果是假象:)
3.Bug
Ok,我們的診斷不能停.
最直接的我dump內存來看:
SQL> ALTER SESSION SET EVENTS ‘immediate trace name LIBRARY_CACHE level 4‘;
察看trace文件得到如下結果(摘錄包含該段代碼的片斷):
BUCKET 21049: LIBRARY OBJECT HANDLE: handle=c82af1c8 name= insert into sms_log (MSGDATE,MSGTIME,MSGID,MSGKIND,MSGTYPE,MSGTYPE_MOMT,MSGLEN,MSGSTATUS,AREAID,IFIDDEST,IFIDSRC, ADDRSRC,ADDRDEST,ADDRFEE,ADDRUSER,SERVICECODE,PLANID,FEETYPE,FEEVALUE,DATACODING,FLAGS,SMLEN,SMCONT) values (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19,:b20,:b21,:b22) hash=ece9cab2 timestamp=09-09-2004 12:51:29 namespace=CRSR flags=RON/TIM/PN0/LRG/[10010001] kkkk-dddd-llll=0000-0001-0001 lock=N pin=S latch=5 lwt=c82af1e0[c82af1e0,c82af1e0] ltm=c82af1e8[c82af1e8,c82af1e8] pwt=c82af1f8[c82af1f8,c82af1f8] ptm=c82af250[c82af250,c82af250] ref=c82af1d0[c82af1d0,c82af1d0] LIBRARY OBJECT: object=c1588e84 type=CRSR flags=EXS[0001] pflags= [00] status=VALD load=0 CHILDREN: size=7024 child# table reference handle ------ -------- --------- -------- 0 c1589040 c1589008 c668c2bc 1 c1589040 bfd179c4 c6ec9ee8 2 c1589040 bfd179e0 c2dd9b3c 3 c1589040 bfd179fc c5a46614 4 c1589040 bfd17a18 c35f1388 5 c1589040 bfd17a34 c77401bc 6 c1589040 bfd17a50 c4092838 7 c1589040 bfddb310 c6cd5258 8 c1589040 bfddb32c c63c6650 9 c1589040 bfddb348 c7e4e3d0 10 c1589040 bfddb364 c4c4b110 11 c1589040 bfddb380 c5950348 12 c1589040 bfddb39c c6c33aa4 13 c1589040 bfddb3b8 c672b0bc ........................................... .....ignore losts of child cursor here..... ........................................... 7001 bf595bc8 c641fba0 c6467890 7002 bf595bc8 c641fbbc c3417168 7003 bf595bc8 c641fbd8 c3417bb0 7004 bf595bc8 c641fbf4 c2fdccbc 7005 bf595bc8 c641fc10 c7f7ca50 7006 bf595bc8 c641fc2c c7f508ec 7007 bf595bc8 c641fc48 c268d8d8 7008 c641fcb8 c641fc64 bec61ed8 7009 c641fcb8 c641fc80 c4a6cc5c 7010 c641fcb8 c641fc9c c1a8aa34 7011 c641fcb8 c0ae4ea0 c0ae4ddc 7012 c641fcb8 c0ae4ebc bd55fe60 7013 c641fcb8 c0ae4ed8 c226914c 7014 c641fcb8 c0ae4ef4 c51dd2e0 7015 c641fcb8 c0ae4f10 c480c468 7016 c641fcb8 c0ae4f2c c60196d0 7017 c641fcb8 c0ae4f48 c4675d2c 7018 c641fcb8 c0ae4f64 bd5e2750 7019 c641fcb8 c0ae4f80 c09b1bb0 7020 c641fcb8 c0ae4f9c bf2d6044 7021 c641fcb8 c0ae4fb8 c332c1c4 7022 c641fcb8 c0ae4fd4 cbdde0f8 DATA BLOCKS: data# heap pointer status pins change ----- -------- -------- ------ ---- ------ 0 c3ef2c50 c1588f08 I/P/A 0 NONE |
這裏確實存在7023個子指針
查詢v$sql得到相同的結果:
SQL> select CHILD_NUMBER,EXECUTIONS,OPTIMIZER_MODE,OPTIMIZER_COST,PARSING_USER_ID,PARSING_SCHEMA_ID,ADDRESS,HASH_VALUE 2 from v$sql where HASH_VALUE=‘3974744754‘; CHILD_NUMBER EXECUTIONS OPTIMIZER_ OPTIMIZER_COST PARSING_USER_ID PARSING_SCHEMA_ID ADDRESS HASH_VALUE ------------ ---------- ---------- -------------- --------------- ----------------- -------- ---------- 0 12966 CHOOSE 238150 36 36 C82AF1C8 3974744754 1 7111 CHOOSE 238150 36 36 C82AF1C8 3974744754 2 9160 CHOOSE 238150 36 36 C82AF1C8 3974744754 3 9127 CHOOSE 238150 36 36 C82AF1C8 3974744754 4 8109 CHOOSE 238150 36 36 C82AF1C8 3974744754 5 4386 CHOOSE 238150 36 36 C82AF1C8 3974744754 6 4913 CHOOSE 238150 36 36 C82AF1C8 3974744754 7 3764 CHOOSE 238150 36 36 C82AF1C8 3974744754 8 3287 CHOOSE 238150 36 36 C82AF1C8 3974744754 9 3156 CHOOSE 238150 36 36 C82AF1C8 3974744754 ..... 7015 1 CHOOSE 238150 36 36 C82AF1C8 3974744754 7016 1 CHOOSE 238150 36 36 C82AF1C8 3974744754 7017 0 CHOOSE 238150 36 36 C82AF1C8 3974744754 CHILD_NUMBER EXECUTIONS OPTIMIZER_ OPTIMIZER_COST PARSING_USER_ID PARSING_SCHEMA_ID ADDRESS HASH_VALUE ------------ ---------- ---------- -------------- --------------- ----------------- -------- ---------- 7018 9396 NONE 0 0 C82AF1C8 3974744754 7019 5008 CHOOSE 237913 36 36 C82AF1C8 3974744754 7020 625 CHOOSE 237913 36 36 C82AF1C8 3974744754 7021 10101 CHOOSE 237913 36 36 C82AF1C8 3974744754 7022 7859 CHOOSE 237913 36 36 C82AF1C8 3974744754 7023 rows selected. |
這裏確實存在7023個子指針,第二種猜測被否定了,同時研發發過來的代碼也不存在第一種情況.
那麽只能是第三種情況了,Oracle的Bug,Ok,那我們需要找到解決辦法.
搜索Metalink,發現Bug:1210242
該Bug描述為:
On certain SQL statements cursors are not shared when TIMED_STATISTICS is enabled.
碰巧我這個數據庫的TIMED_STATISTICS設置為True
修改TIMED_STATISTICS為False以後,觀察v$sql,發現有效子指針很快下降到2個.
SQL> select CHILD_NUMBER,OPTIMIZER_COST,OPTIMIZER_MODE,EXECUTIONS,ADDRESS from v$sql where hash_value=3974744754 and OPTIMIZER_MODE=‘CHOOSE‘; CHILD_NUMBER OPTIMIZER_COST OPTIMIZER_ EXECUTIONS ADDRESS ------------ -------------- ---------- ---------- -------- 0 238167 CHOOSE 63943 C82AF1C8 1 238300 CHOOSE 28915 C82AF1C8 |
第二天下降到只有一個.
SQL> select CHILD_NUMBER,OPTIMIZER_COST,OPTIMIZER_MODE,EXECUTIONS,ADDRESS from v$sql where hash_value=3974744754 and OPTIMIZER_MODE=‘CHOOSE‘; CHILD_NUMBER OPTIMIZER_COST OPTIMIZER_ EXECUTIONS ADDRESS ------------ -------------- ---------- ---------- -------- 0 238702 CHOOSE 578124 C82AF1C8 |
短信群發從此正常.
對於這個問題,另外一個可選的方法是設置一個隱含參數:
_sqlexec_progression_cost = 0
這個參數的具體含義為:
SQL execution progression monitoring cost threshold
即:SQL執行進度監控成本閥值
這個參數根據COST來決定需要監控的SQL.執行進度監控會引入額外的函數調用和Row Sources
這可能導致SQL的執行計劃或成本發生改變,從而產生不同的子指針.
_sqlexec_progression_cost 的缺省值為1000,成本大於1000的所有SQL都會被監控
如果該參數設置為0,那麽SQL的執行進度將不會被跟蹤.
執行進度監控信息會被記錄到V$SESSION_LONGOPS視圖中,如果Time_statistics參數設置為False,那麽這個信息就不會被記錄.
所以,Time_statistics參數和_sqlexec_progression_cost是解決問題的兩個途徑.
通過查詢我們也可以看到,在這個數據庫中,OPTIMIZER_COST >1000的SQL主要有以下五個:
SQL> select distinct(sql_text) from v$sql where OPTIMIZER_COST >1000; SQL_TEXT -------------------------------------------------------------------------------- insert into sms_detail_error (msgdate,addruser,msgid,areaid,reason,spnumber,msgt ime,ifiddest,msqkey,servicecode,planid,feetype,feevalue,smcont,submittimes,submi tdate,submittime,msgstate_resp,errorcode_resp,msgstate_rept,errorcode_rept) valu es (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,: b17,:b18,:b19,:b20) insert into sms_detail_success (msgdate,addruser,msgid,areaid,spnumber,msgtime,i fiddest,servicecode,planid,feetype,feevalue,smcont,submittimes,submitdate,submit time,respdate,resptime,reptdate,repttime,msqkey) values (:b0,:b1,:b2,:b3,:b4,:b5 ,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19) insert into sms_log (MSGDATE,MSGTIME,MSGID,MSGKIND,MSGTYPE,MSGTYPE_MOMT,MSGLEN,M SGSTATUS,AREAID,IFIDDEST,IFIDSRC,ADDRSRC,ADDRDEST,ADDRFEE,ADDRUSER,SERVICECODE,P LANID,FEETYPE,FEEVALUE,DATACODING,FLAGS,SMLEN,SMCONT) values (:b0,:b1,:b2,:b3,:b 4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19,:b20,:b2 1,:b22) insert into sms_resprept_error (msgdate,areaid,addruser,msgid,submittimes,submit date,submittime,msgid_gw,msgstate_resp,errorcode_resp,msgstate_rept,errorcode_re pt,servicecode) values (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12) insert into sms_statusrept (reptdate,addruser,msgid_gw,repttime,statustype,msgid _stus,msgstate,errorcode) values (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7) |
而這五個SQL中,在v$sqlarea中,有四個version_count都在10以上:
SQL> select sql_text,version_count from v$sqlarea where version_count>10; SQL_TEXT -------------------------------------------------------------------------------- VERSION_COUNT ------------- insert into sms_detail_error (msgdate,addruser,msgid,areaid,reason,spnumber,msgt ime,ifiddest,msqkey,servicecode,planid,feetype,feevalue,smcont,submittimes,submi tdate,submittime,msgstate_resp,errorcode_resp,msgstate_rept,errorcode_rept) valu es (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,: b17,:b18,:b19,:b20) 42 insert into sms_log (MSGDATE,MSGTIME,MSGID,MSGKIND,MSGTYPE,MSGTYPE_MOMT,MSGLEN,M SGSTATUS,AREAID,IFIDDEST,IFIDSRC,ADDRSRC,ADDRDEST,ADDRFEE,ADDRUSER,SERVICECODE,P LANID,FEETYPE,FEEVALUE,DATACODING,FLAGS,SMLEN,SMCONT) values (:b0,:b1,:b2,:b3,:b 4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19,:b20,:b2 1,:b22) 7026 insert into sms_resprept_error (msgdate,areaid,addruser,msgid,submittimes,submit date,submittime,msgid_gw,msgstate_resp,errorcode_resp,msgstate_rept,errorcode_re pt,servicecode) values (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12) 301 insert into sms_statusrept (reptdate,addruser,msgid_gw,repttime,statustype,msgid _stus,msgstate,errorcode) values (:b0,:b1,:b2,:b3,:b4,:b5,:b6,:b7) 41 |
具體可以參考Metalink: Note 62143
至此這個關於shared pool的問題找到了原因,並得以及時解決.
關於shared pool的深入探討(六)-高Latch競爭案例