1. 程式人生 > 其它 >rac節點無法啟動ORA-29702的問題及分析(70天)

rac節點無法啟動ORA-29702的問題及分析(70天)

今天在虛擬機器上啟動rac,發現有一個節點怎麼都起不了。另外一個節點沒問題。

SQL> startup nomount 
ORA-29702: error occurred in Cluster Group Service operation

嘗試使用crs_stat檢視crs的元件狀態,也報錯了。

-bash-4.1$ crs_stat -t 
CRS-0184: Cannot communicate with the CRS daemon.

檢視alert日誌,發現在最後是因為29702的錯誤導致的。

SMON started with pid=20, OS id=12344 
Sun May 11 04:10:28 2014 
RECO started with pid=21, OS id=12346 
Sun May 11 04:10:28 2014 
MMON started with pid=22, OS id=12348 
Sun May 11 04:10:28 2014 
MMNL started with pid=23, OS id=12350 
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... 
starting up 1 shared server(s) ... 
USER (ospid: 12242): terminating the instance due to error 29702 
Instance terminated by USER, pid = 12242

對於這個錯誤,oracle給出的解釋如下。

-bash-4.1$ oerr ora 29702 
29702, 00000, "error occurred in Cluster Group Service operation" 
// *Cause: An unexpected error occurred while performing a CGS operation. 
// *Action: Verify that the LMON process is still active. 
//          Check the Oracle LMON trace files for errors. 
//          Also, check the related CSS trace file for errors.

檢視lmon的日誌如下:

Trace file /u04/app/11.2.0/db/diag/rdbms/racdb/RACDB1/trace/RACDB1_lmon_12324.trc 
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production 
With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP, 
Data Mining and Real Application Testing options 
ORACLE_HOME = /u04/app/11.2.0/db/product/11.2.0/dbhome_1 
System name:    Linux 
Node name:      rac1 
Release:        2.6.32-71.el6.x86_64 
Version:        #1 SMP Wed Sep 1 01:33:01 EDT 2010 
Machine:        x86_64 
VM name:        VMWare Version: 6 
Instance name: RACDB1 
Redo thread mounted by this instance: 0 <none> 
Oracle process number: 11 
Unix process pid: 12324, image: oracle@rac1 (LMON)
*** 2014-05-11 04:10:27.777 
*** SESSION ID:(130.1) 2014-05-11 04:10:27.777 
*** CLIENT ID:() 2014-05-11 04:10:27.777 
*** SERVICE NAME:() 2014-05-11 04:10:27.777 
*** MODULE NAME:() 2014-05-11 04:10:27.777 
*** ACTION NAME:() 2014-05-11 04:10:27.777 
GES resources 5720 pool 3 
GES enqueues 8361 
GES IPC: Receivers 2  Senders 2 
GES IPC: Buffers  Receive 1000  Send (i:1030 b:471) Reserve 301 
GES IPC: Msg Size  Regular 1176  Batch 8376 
Batching factor: enqueue replay 206, ack 229 
Batching factor: cache replay 128 size per lock 64
*** 2014-05-11 04:10:28.644 
kjxggin: CGS tickets = 1000 
kgxgncin: CLSS init failed with status 3 
kgxgncin: return status 3 (1311719766 SKGXN not av) from CLSS 
kjxgmin: kgxgncin fails - (2) 
kjxggin: generic group layer init fails
*** 2014-05-11 04:10:28.655 
Global Enqueue Service Shutdown

對於該節點,使用crs_stat,crsctl的操作都無濟於事。

-bash-4.1$ crsctl check crs 
CRS-4638: Oracle High Availability Services is online 
CRS-4535: Cannot communicate with Cluster Ready Services 
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon 
CRS-4534: Cannot communicate with Event Manager
-bash-4.1$ crs_start -all 
CRS-0184: Cannot communicate with the CRS daemon.

檢視程序,確實都起來了。

-bash-4.1$ ps -ef|grep d.bin 
root      2103     1  0 May10 ?        00:00:51 /u04/app/11.2.0/grid/bin/ohasd.bin reboot 
grid      2297     1  0 May10 ?        00:00:32 /u04/app/11.2.0/grid/bin/oraagent.bin 
grid      2309     1  0 May10 ?        00:00:01 /u04/app/11.2.0/grid/bin/mdnsd.bin 
grid      2320     1  0 May10 ?        00:00:36 /u04/app/11.2.0/grid/bin/gpnpd.bin 
root      2330     1  0 May10 ?        00:00:14 /u04/app/11.2.0/grid/bin/orarootagent.bin 
grid      2333     1  0 May10 ?        00:02:39 /u04/app/11.2.0/grid/bin/gipcd.bin 
root      2348     1  1 May10 ?        00:12:00 /u04/app/11.2.0/grid/bin/osysmond.bin 
root      2569     1  0 May10 ?        00:03:55 /u04/app/11.2.0/grid/bin/ologgerd -M -d /u04/app/11.2.0/grid/crf/db/rac1 
grid     12569  9580  0 04:25 pts/1    00:00:00 grep d.bin
使用root使用者來停掉crs。但是報了錯。 
root 
[root@rac1 bin]# ./crsctl disable crs 
CRS-4621: Oracle High Availability Services autostart is disabled.
[root@rac1 bin]# ./crsctl stop crs 
CRS-2796: The command may not proceed when Cluster Ready Services is not running 
CRS-4687: Shutdown command has completed with errors. 
CRS-4000: Command Stop failed, or completed with errors.

再次嘗試啟動,也是報錯。

[root@rac1 bin]# ./crsctl enable crs 
CRS-4622: Oracle High Availability Services autostart is enabled. 
[root@rac1 bin]# ./crsctl start crs 
CRS-4640: Oracle High Availability Services is already active 
CRS-4000: Command Start failed, or completed with errors.

最後看到mos上有一個workaround,可以手動Kill掉那些crs的程序。當然了,在正式環境中還是得把psu打上。

[root@rac1 bin]# ps -fea | grep ohasd.bin | grep -v grep 
root      2103     1  0 May10 ?        00:00:52 /u04/app/11.2.0/grid/bin/ohasd.bin reboot 
[root@rac1 bin]# ps -fea | grep gipcd.bin | grep -v grep 
grid      2333     1  0 May10 ?        00:02:41 /u04/app/11.2.0/grid/bin/gipcd.bin 
[root@rac1 bin]# ps -fea | grep mdnsd.bin | grep -v grep 
grid      2309     1  0 May10 ?        00:00:01 /u04/app/11.2.0/grid/bin/mdnsd.bin 
[root@rac1 bin]# ps -fea | grep gpnpd.bin | grep -v grep 
grid      2320     1  0 May10 ?        00:00:37 /u04/app/11.2.0/grid/bin/gpnpd.bin 
[root@rac1 bin]# ps -fea | grep evmd.bin | grep -v grep 
[root@rac1 bin]# ps -fea | grep crsd.bin | grep -v grep 
[root@rac1 bin]# kill -9 2103 2333  2309 2320 

再次嘗試啟動crs

[root@rac1 bin]# ./crsctl start crs 
CRS-4123: Oracle High Availability Services has been started.
[root@rac1 bin]# ./crs_stat -t 
CRS-0184: Cannot communicate with the CRS daemon.

啟動的時候有些慢,稍等一下,直接自己來啟庫了。這次起庫就沒有問題了。

-bash-4.1$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Sun May 11 04:41:03 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> startup nomount 
ORACLE instance started.
Total System Global Area  638853120 bytes 
Fixed Size                  2231072 bytes 
Variable Size             482346208 bytes 
Database Buffers          146800640 bytes 
Redo Buffers                7475200 bytes 
SQL> alter database mount;
Database altered.
SQL> alter database open;
Database altered.
SQL>

檢視crs的狀態,該起的都起了。兩個節點建立了一個小表做測試,沒有問題了。那個workaround的細節可以從MOS文件 ID 1233580.1裡面檢視。

-bash-4.1$ crs_stat -t 
Name           Type           Target    State     Host        
------------------------------------------------------------ 
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1        
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac2        
ora.asm        ora.asm.type   OFFLINE   OFFLINE               
ora.cvu        ora.cvu.type   OFFLINE   OFFLINE               
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE               
ora....network ora....rk.type ONLINE    ONLINE    rac1        
ora.oc4j       ora.oc4j.type  OFFLINE   OFFLINE               
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1        
ora....SM1.asm application    OFFLINE   OFFLINE               
ora....C1.lsnr application    ONLINE    ONLINE    rac1        
ora.rac1.gsd   application    OFFLINE   OFFLINE               
ora.rac1.ons   application    ONLINE    ONLINE    rac1        
ora.rac1.vip   ora....t1.type ONLINE    ONLINE    rac1        
ora....SM2.asm application    OFFLINE   OFFLINE               
ora....C2.lsnr application    ONLINE    ONLINE    rac2        
ora.rac2.gsd   application    OFFLINE   OFFLINE               
ora.rac2.ons   application    ONLINE    ONLINE    rac2        
ora.rac2.vip   ora....t1.type ONLINE    ONLINE    rac2        
ora.racdb.db   ora....se.type ONLINE    ONLINE    rac2        
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac2