1. 程式人生 > 實用技巧 >一次異常關閉ORACLE RAC叢集導致的GRID軟體無法啟動處理

一次異常關閉ORACLE RAC叢集導致的GRID軟體無法啟動處理

問題:

2019/12/12,某儲存維護廠商對客戶的LINUX+ORACLE 11.2.0.4 RAC環境進行ASM磁碟擴容,計劃通過新加LUN方式,不重啟資料庫LINUX主機方式線上操作加盤。
在實際操作過程中(未知如何操作的)導致資料庫ASM磁碟IO報錯,後強制重啟主機;在主機重啟後,發現GRID軟體無法啟動,緊急介入響應……

分析:

首先連入環境,通過crsctl stat res -t -init檢視叢集的初始程序啟動狀態,發現在啟動ASM資源時異常;

通過檢視ASM程序的日誌,發現在例項進行reconfiguration後卡住,通過檢視報錯中對應的TRACE,初步懷疑是叢集的心跳私網問題,但是檢視叢集日誌可以發現CSSD程序啟動並且加入叢集是成功的,
ASM程序也是在Reconfiguration complete後出現hung,

參考MOS文件上的幾種情況也一一排查,仍未解決,一時陷入迷茫……

參考MOS文件:

ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)
最常見的 5 個導致節點重新啟動、驅逐或 CRS 意外重啟的問題 (文件 ID 1524455.1)
Grid Infrastructure 啟動的五大問題 (文件 ID 1526147.1)
處理:

後同事接入處理,臨時關閉存活節點發現故障節點就可以啟動,即只能一個節點啟動,矛頭再次指向叢集通訊問題;後通過將果/var/tmp/.oracle下的socket檔案清空,重啟叢集軟體,恢復正常。
因此可以推斷是前期加盤導致IO異常時強制關機導致了/var/tmp/.oracle下的socket檔案異常(正常情況下叢集軟體啟動時會重建此處的socket檔案).

ASM例項日誌


節點1:

Reconfiguration complete
Thu Dec 12 15:50:06 2019
LMON (ospid: 8523) detects hung instances during IMR reconfiguration
LMON (ospid: 8523) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
LMON (ospid: 8523) aborts 1 previously scheduled instance kills





節點2:

MMNL started with pid=21, OS id=26248 
lmon registered with NM - instance number 2 (internal mem no 1)
Thu Dec 12 16:24:08 2019
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2 
Thu Dec 12 16:24:11 2019
PMON (ospid: 26206): terminating the instance due to error 481
Thu Dec 12 16:24:11 2019
ORA-1092 : opitsk aborting process
Thu Dec 12 16:24:13 2019
System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20191212162413.trc
Dumping diagnostic data in directory=[cdmp_20191212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 26206
Thu Dec 12 16:24:33 2019
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 2 
Starting ORACLE instance (normal)