1. 程式人生 > 其它 >ASM無法啟動的問題分析(二)(r7筆記第88天)

ASM無法啟動的問題分析(二)(r7筆記第88天)

第一篇的內容可以參考。ASM無法啟動的問題分析(一),有不少的朋友給了一些建議,我也糅合了進來。一併感謝。

當然重啟服務發現CSSD服務是Online,但是ASM是無法啟動。

[grid@testbiadmin]$ crs_stat -t
Name           Type           Target    State    Host        
------------------------------------------------------------
ora.DATA01.dg  ora....up.type ONLINE    ONLINE   rolequery   
ora.FLASH01.dgora....up.type ONLINE    ONLINE    rolequery  
ora....ER.lsnrora....er.type ONLINE    ONLINE    rolequery  
ora.asm        ora.asm.type   ONLINE   OFFLINE               
ora.cssd       ora.cssd.type  ONLINE   ONLINE    rolequery   
ora.testbi.db  ora....se.type ONLINE    OFFLINE               
ora.testbi1.dbora....se.type OFFLINE   OFFLINE               
ora.diskmon    ora....on.type OFFLINE   OFFLINE               
ora.evmd       ora.evm.type   ONLINE   OFFLINE               
ora.ons       ora.ons.type   OFFLINE  OFFLINE               
ora....rydb.dbora....se.type ONLINE    OFFLINE        

手工啟動ASM的服務也是報錯

[grid@testbiadmin]$ srvctl start asm
PRCR-1079: Failed to start resource ora.asm
CRS-5017:The resource action "ora.asm start" encountered the following error: 
ORA-01078:failure in processing system parameters
ORA-29701:unable to connect to Cluster Synchronization Service
.For details refer to "(:CLSN00107:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674:Start of 'ora.asm' on 'rolequery' failed

對應的trace日誌也沒有發現特別的異常之處。

2016-01-2221:36:46.513: [ora.DATA01.dg][1444443904] {0:10:2} [check] abort command: check
2016-01-2221:36:46.513: [ora.FLASH01.dg][1442342656] {0:10:2} [check] abort command:check
2016-01-2221:36:46.513: [ora.DATA01.dg][1444443904] {0:10:2} [check] tryActionLock {
2016-01-2221:36:46.513: [ora.FLASH01.dg][1442342656] {0:10:2} [check] tryActionLock {
2016-01-2221:36:48.506: [ora.DATA01.dg][1457051392] {0:10:2} [check] DgpAgent::getConnxnabort
2016-01-2221:36:48.506: [   AGENT][1457051392]{0:10:2} UserErrorException: Locale is 
2016-01-2221:36:48.506: [ora.DATA01.dg][1457051392] {0:10:2} [check] clsnUtils::errorException type=2 string=
CRS-5017:The resource action "ora.DATA01.dg check" encountered the followingerror: 
DgpAgent::getConnxnaborted. For details refer to "(:CLSN00109:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log".
2016-01-2221:36:48.506: [ora.FLASH01.dg][1452848896] {0:10:2} [check] DgpAgent::getConnxnabort
2016-01-2221:36:48.506: [   AGENT][1452848896]{0:10:2} UserErrorException: Locale is 
2016-01-2221:36:48.507: [ora.FLASH01.dg][1452848896] {0:10:2} [check] clsnUtils::errorException type=2 string=
CRS-5017:The resource action "ora.FLASH01.dg check" encountered the followingerror: 
DgpAgent::getConnxnaborted. For details refer to "(:CLSN00109:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log".
2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check] LsnrAgent::check{
2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check] lsnrctl statusLISTENER
2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check]getOracleHomeAttrib: oracle_home = /home/U01/app/grid/11.2.3
2016-01-2221:36:48.507: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check]getOracleHomeAttrib: oracle_home = /home/U01/app/grid/11.2.3
2016-01-2221:36:48.508: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check]Utils::getCrsHome crsHome /home/U01/app/grid/11.2.3
2016-01-2221:36:48.508: [ora.LISTENER.lsnr][1452848896] {0:10:2} [check] Utils:execCmdaction = 3 flags = 38 ohome = (null) cmdname = lsnrctl.

不過看起來似乎是在LISTENER的部分出現了問題。

檢視grid中的listener配置,發現ASM例項的狀態是UNKNOWN,這個著實有些奇怪。

LSNRCTL>status
Connectingto (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))
ListenerParameter File  /home/U01/app/grid/11.2.3/network/admin/listener.ora
ListenerLog File        /home/U01/app/grid/diag/tnslsnr/rolequery/listener/alert/log.xml
ListeningEndpoints Summary...
 (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rolequery.test.com)(PORT=1521)))
ServicesSummary...
Service"+ASM" has 1 instance(s).
  Instance "+ASM", status UNKNOWN,has 1 handler(s) for this service...
Service"testbi" has 1 instance(s).
  Instance "testbi", status UNKNOWN,has 1 handler(s) for this service...
Thecommand completed successfully
LSNRCTL>service
Connectingto (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))
ServicesSummary...
Service"+ASM" has 1 instance(s).
  Instance "+ASM", status UNKNOWN,has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:494refused:0
         LOCAL SERVER
Service"testbi" has 1 instance(s).
  Instance "testbi", status UNKNOWN,has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:6069refused:748
         LOCAL SERVER
Thecommand completed successfully

這個時候開始檢視從什麼時候開始出現這個ASM的報錯資訊,看看系統層面有什麼特別的改動。

ThuJan 14 17:03:00 2016
LNS:Standby redo logfile selected for thread 1 sequence 544 for destinationLOG_ARCHIVE_DEST_3
ArchivedLog entry 1261 added for thread 1 sequence 543 ID 0xb380aad0 dest 1:
ThuJan 14 17:46:18 2016
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 16626
ThuJan 14 17:48:18 2016
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 16847
ThuJan 14 17:50:08 2016
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 16888

可以看到是在1月14日 開始出現這個問題。

那麼1月14日還有什麼變更呢,系統級響應的配置也非常有限了。/etc/hosts是其中一個,目前發現最有可疑,因為時間點比較接近,當然也有一些疑點需要說服自己。

[oracle@testbi~]$ ll /etc/hosts
-rw-r--r--1 root root 921 Jan 14 14:28 /etc/hosts

那我們來看看/etc/hosts,到底有什麼改動。當然重點是下面的部分。

[grid@rolequery~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4localhost4.localdomain4
::1         localhost localhost.localdomain localhost6localhost6.localdomain6
10.11.64.174   dbbak_yq_10.11.64.174_tel      # db_ccs
10.11.1.26     testmon.test.com        
10.11.54.190    testmon.test.tc         
192.168.97.143  xxx.no.sohu.com
10.11.2.86     testmon.test.com        
10.11.64.28   testbi.test.com
10.11.64.28    rolequery.test.com
10.11.0.68     s2testbi.test.com

從配置裡面看給這個伺服器配置了多個主機名。看起來這個問題最可疑,可以簡單驗證一下是否是這樣。把多餘的那個主機名給註釋掉。#10.11.64.28 testbi.test.com

然後重啟cssd的服務,雖然報錯,但是似乎有了reload的功效。

[grid@rolequery~]$ crs_start ora.cssd 
CRS-5702:Resource 'ora.cssd' is already running on 'rolequery'
CRS-0223:Resource 'ora.cssd' has placement error.

再次啟動asm就成功了,看來還真是這個主機名的配置導致的問題。

[grid@rolequery~]$  crs_start ora.asm 
Attemptingto start `ora.asm` on member `rolequery`
Startof `ora.asm` on member `rolequery` succeeded.

所以通過上面的情況就可以基本判定就是這個主機名導致的問題了,在這個基礎上資料庫例項啟動就很順利。

使用kfod檢視asm中的檔案情況,這個時候都是正常的了。

[grid@rolequery~]$ kfod asm_diskstring='/dev/asm-sda9' disk=all
--------------------------------------------------------------------------------
 Disk         Size Path                                     User     Group  
================================================================================
   1:    153605 Mb /dev/asm-sda9                            grid     asmadmin
--------------------------------------------------------------------------------
ORACLE_SIDORACLE_HOME                                                         
================================================================================
      +ASM /home/U01/app/grid/11.2.3

這個時候有一個疑問是為什麼1月14號的問題,到了21號才開始出現發現真正有問題。

所以還是需要說服自己。

我們可以從alert日誌看出來,其實後臺是在強制終止了一些程序,那麼這些程序的情況是怎麼樣的呢。

得到了問題時間段的一些日誌情況,可以看到大多數都是j000和m000的程序。會有個別的ora程序會被終止,但是數量很少。對於統計的系統來說還是很難發現的了。

-rw-r----- 1 oracle asmadmin      3252 Jan 14 23:58 testbi_j000_12684.trc
-rw-r-----1 oracle asmadmin        73 Jan 15 00:00 testbi_m000_12855.trm
-rw-r-----1 oracle asmadmin      3163 Jan 15 00:00 testbi_m000_12855.trc
-rw-r-----1 oracle asmadmin        73 Jan 15 00:00 testbi_m001_12853.trm
-rw-r-----1 oracle asmadmin      3529 Jan 15 00:00 testbi_m001_12853.trc
-rw-r-----1 oracle asmadmin        84 Jan 15 00:00 testbi_j000_12874.trm
-rw-r-----1 oracle asmadmin      3254 Jan 15 00:00 testbi_j000_12874.trc
-rw-r-----1 oracle asmadmin        73 Jan 15 00:01 testbi_ora_12934.trm
-rw-r-----1 oracle asmadmin      1339 Jan 15 00:01 testbi_ora_12934.trc
-rw-r-----1 oracle asmadmin        73 Jan 15 00:01 testbi_ora_12937.trm
-rw-r-----1 oracle asmadmin      1325 Jan 15 00:01 testbi_ora_12937.trc
-rw-r-----1 oracle asmadmin        73 Jan 15 00:02 testbi_j000_13142.trm
-rw-r-----1 oracle asmadmin      3255 Jan 15 00:02 testbi_j000_13142.trc
-rw-r-----1 oracle asmadmin        73 Jan 15 00:04 testbi_j000_13201.trm

具體的日誌情況如下。對於j000的程序,對於的日誌如下:

testbi_j000_12874.trc
kgxgncin:CLSS init failed with status 3
kgxgncin:return status 3 (1311719766 SKGXN not av) from CLSS
NOTE:kfmsInit: ASM failed to initialize group services
ErrorORA-29701 signaled atksedsts()+461<-ksf_short_stack()+77<-kge_snap_callstack()+63<-kge_sigtrace_dump()+69<-kgepop()+750<-kgeselv()+290<-ksesecl0()+162<-kfmsInit()+175<-kfmsSlvReg()+1142<-kfmdSlvOpPriv()+2207<-kfmdWriteSubmitted()+414<-kfk_process_an_ioq()+210<-kfk_submit_io()+74<-kfk_io1()+1089<-kfkRequest()+14<-kfk_transitIO()+1696<-kfioSubmitIO()+4740<-kfioRequestPriv()+182<-kfioRequest()+706<-ksfd_kfioRequest()+649<-ksfd_osmgo()+256<-ksfdgo()+861<-ksfdaio()+2504<-kcflbi()+908<-kcbldio()+3036<-kcblio()+318<-kdblsync()+436<-kdblcmtt()+3216<-klclisync()+256<-qerltFetch()+799<-insdlexe()+409<-insExecStmtExecIniEngine()+82<-insexe()+1845<-qes3tExecSQL()+710<-qerleStart()+176<-rwsstd()+249<-qerjoStart()+577<-selexe0()+818<-opiexe()+5726<-opipls()+2135<-opiodr()+916<-rpidrus()+211<-skgmstack()+148<-rpiswu2()+638<-rpidrv()+1384<-psddr0()+473<-psdnal()+457<-pevm_EXIM()+308<-pfrinstr_EXIM()+53<-pfrrun_no_tool()+63<-pfrrun()+627<-plsql_run()+649<-peicnt()+301<-kkxexe()+525<-opiexe()+17677<-opiodr()+916<-rpidrus()+211<-skgmstack()+148<-
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 12874
-----Abridged Call Stack Trace -----
ksedsts()+461<-kfk_io1()+2863<-kfkRequest()+14<-kfk_transitIO()+1696<-kfioSubmitIO()+4740<-kfioRequestPriv()+182<-kfioRequest()+706<-ksfd_kfioRequest()+649<-ksfd_osmgo()+256<-ksfdgo()+861<-ksfdaio()+2504<-kcflbi()+908<-kcbldio()+3036<-kcblio()+318<-kdblsync()+436
<-kdblcmtt()+3216<-klclisync()+256<-qerltFetch()+799<-insdlexe()+409<-insExecStmtExecIniEngine()+82<-insexe()+1845<-qes3tExecSQL()+710<-qerleStart()+176<-rwsstd()+249<-qerjoStart()+577<-selexe0()+818<-opiexe()+5726<-opipls()+2135<-opiodr()+916<-rpidrus()+211
<-skgmstack()+148<-rpiswu2()+638<-rpidrv()+1384<-psddr0()+473<-psdnal()+457<-pevm_EXIM()+308<-pfrinstr_EXIM()+53<-pfrrun_no_tool()+63<-pfrrun()+627<-plsql_run()+649<-peicnt()+301<-kkxexe()+525<-opiexe()+17677<-opiodr()+916<-rpidrus()+211<-skgmstack()+148<-rpiswu2()+638
<-rpidrv()+1384<-rpiexe()+74<-kkjex1e()+8650<-kkjsexe()+711<-kkjrdp()+694<-opirip()+958<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+253<-_start()+36
-----End of Abridged Call Stack Trace -----

對於應用連線程序,仔細檢視發現竟然是rman的程序,所以也可以排除。

$less testbi_ora_12934.trc
Tracefile /home/U01/app/oracle/diag/rdbms/testbi/testbi/trace/testbi_ora_12934.trc
OracleDatabase 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
Withthe Partitioning, Automatic Storage Management, OLAP, Data Mining
andReal Application Testing options
ORACLE_HOME= /home/U01/app/oracle/product/11.2.3/db_1
Systemname:    Linux
Nodename:      testbi.test.com
Release:        2.6.32-279.el6.x86_64
Version:        #1 SMP Wed Jun 13 18:24:36 EDT 2012
Machine:        x86_64
Instancename: testbi
Redothread mounted by this instance: 1
Oracleprocess number: 55
Unixprocess pid: 12934, image: [email protected] (TNS V1-V3)
***2016-01-15 00:01:02.534
***SESSION ID:(1988.8917) 2016-01-15 00:01:02.534
***CLIENT ID:() 2016-01-15 00:01:02.534
***SERVICE NAME:(SYS$USERS) 2016-01-15 00:01:02.534
***MODULE NAME:([email protected] (TNS V1-V3)) 2016-01-15 00:01:02.534
***ACTION NAME:(0000001 FINISHED70) 2016-01-15 00:01:02.534

最後好不容易找到一個JDBC連線的程序trace日誌,已經是幾天以後的一個日誌了,所以這個問題還是比較難發現的,當然到了21號開始,因為後期有一些大的推廣活動,大批量的應用要連進來,沒想到在這個時候就出現了問題。

[oracle@testbitrace]$ cat testbi_ora_28550.trc
Tracefile /home/U01/app/oracle/diag/rdbms/testbi/testbi/trace/testbi_ora_28550.trc
OracleDatabase 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
Withthe Partitioning, Automatic Storage Management, OLAP, Data Mining
andReal Application Testing options
ORACLE_HOME= /home/U01/app/oracle/product/11.2.3/db_1
Systemname:    Linux
Nodename:      testbi.test.com
Release:        2.6.32-279.el6.x86_64
Version:        #1 SMP Wed Jun 13 18:24:36 EDT 2012
Machine:        x86_64
Instancename: testbi
Redothread mounted by this instance: 1
Oracleprocess number: 45
Unixprocess pid: 28550, image: [email protected]
***2016-01-16 00:52:34.100
***SESSION ID:(3687.4155) 2016-01-16 00:52:34.100
***CLIENT ID:() 2016-01-16 00:52:34.100
***SERVICE NAME:(SYS$USERS) 2016-01-16 00:52:34.100
***MODULE NAME:(JDBC Thin Client) 2016-01-16 00:52:34.100
***ACTION NAME:() 2016-01-16 00:52:34.100
CursorDiagnosticsNodes:  
  ChildNode: ChildNumber=3 ID=34 reason=Rolling Invalidate Window Exceeded(2)size=0x0 details=already_processed 
  ChildNode: ChildNumber=4 ID=34 reason=Rolling Invalidate Window Exceeded(3)size=2x4 invalidation_window=1452700223 ksugctm=1452700354 
  ChildNode: ChildNumber=1 ID=34 reason=Rolling Invalidate Window Exceeded(2)size=0x0 details=already_processed 
  ChildNode: ChildNumber=3 ID=34 reason=Rolling Invalidate Window Exceeded(3)size=2x4 invalidation_window=1452620854 ksugctm=1452621139 
  ChildNode: ChildNumber=5 ID=34 reason=Rolling Invalidate Window Exceeded(2)size=0x0 details=already_processed 

當然對於這個問題,儘管是找到了原因是主機名,但是還是帶有一定的運氣。怎麼來驗證這個問題呢。可以簡單做個測試。我們建立一個數據檔案,當然這個時候肯定是沒有問題的。

在修復之後的$ sqlplus / assysdba

SQL>create tablespace testdata datafile size 10M;
Tablespacecreated.

我們修改主機名,這個時候/etc/hosts裡面就有了兩個主機名的配置資訊。然後修改主機名

[root@testbidev]# hostname testbi.test.com
[root@testbidev]# su - oracle
[ [oracle@testbi~]$ sqlplus / as sysdba
SQL>create tablespace testdata2 datafile size 10M;
createtablespace testdata2 datafile size 10M
                                     *
ERRORat line 1:
ORA-03113:end-of-file on communication channel
ProcessID: 21248
SessionID: 2266 Serial number: 13

然後檢視alert日誌發現了期望之中的ASM報錯日誌。

VKRMstarted with pid=37, OS id=20767 
SatJan 23 19:08:09 2016
createtablespace testdata datafile size 10M
Completed:create tablespace testdata datafile size 10M
SatJan 23 19:09:07 2016
createtablespace testdata2 datafile size 10M
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 21248
SatJan 23 19:09:33 2016
Startingbackground process SMCO
SatJan 23 19:09:33 2016
SMCOstarted with pid=24, OS id=21280

當然這些思路也參考了一些材料,短短几個小時,其實花費的時間其實也不多。但是解決了這個問題,就很釋然了。至少可以很理性的規避這個問題了。

當然對於這個問題的追溯其實也是一個畫蛇添足的操作,比如原來有一臺伺服器A,主機名為rolequery,但是後面移作他用,但是主機名一直沒有改動,最後為了規範,想改動一下,就配置了兩個主機名,對於備庫來說還是能夠識別的,結果在主庫這麼做就會有一些奇怪的問題,當然在線修改主機名還是一個敏感的操作,儘量避免,而且需要綜合評估。

參考資料如下:

HighOCSSD CPU Usage/ ERROR: unrecoverable error ORA-29701 raised in ASM I/O path;terminating process {nnnnn} (Doc ID 1491852.1)

ASMInstance Cannot Start Due To ORA-29701 Standalone (non-Rac) (Doc ID1489552.1)

Oracle Restart: ORA-29701 raised in ASM I/Opath; terminating process, Permission denied (13) (Doc ID 1328629.1)

http://www.freelists.org/post/oracle-l/unrecoverable-error-ORA29701-raised-in-ASM-IO-path-terminating-process

http://database.com.mk/wordpress/2011/02/19/error-unrecoverable-error-ora-29701-raised-in-asm-io-path-when-using-goldengate-replicat-process/

http://blog.csdn.net/rgb_rgb/article/details/7937928