一個Oracle bug的手工修復(r6筆記第59天)
在上週五的時候,本來一個例行巡檢,想擴充一些表空間,結果弄巧成拙,因為一個drop datafile的操作直接導致了一主兩備的兩個備庫MRP直接丟擲了ORA-600錯誤。 在嘗試了一些方法和查看了MOS之後,除了重建備庫,暫時還沒有找到其它相對更快捷的方法。 因為是10.2.0.4.0的環境,為了先修復問題,自己先使用rman在主庫做了備份,然後在備庫直接做duplicate操作還原恢復。先搭好了一個備庫,另外一個備庫則先留下來,觀察一下,看看有沒有其它的方法,如果還是沒有找到,就繼續重新搭建備庫。 結果在這種試試看的時候,竟然還是找到了一線希望,也非常感謝微信群內的好友都出謀劃策,還是找到了一種可行的方案。 初始的問題,可以參見http://blog.itpub.net/23718752/viewspace-1797653/ 修復的思路是因為在主庫中資料檔案的配置是沒有問題的,直接在主庫生成備份控制檔案,然後在備庫做還原,這個時候還原成功後,如果嘗試啟動MRP肯定會報錯,會有一個檔案存在不一致的情況,這個時候我們就需要讓dataguard端知道這個不一致,直接使用alter database drop datafile的操作就會把原來不一致的檔案從資料字典級進行了更新。 這個過程有點類似於alter tablespace xxx drop datafile的過程,因為alter tablespace drop datafile需要在資料open階段完成,所以我們通過這種方式也能達到同樣的效果。 嘗試的步驟如下: 把備庫啟動到nomount階段,開始controlfile的還原。
$ rman target / Recovery Manager: Release 10.2.0.4.0 - Production on Mon Sep 14 17:43:03 2015 Copyright (c) 1982, 2007, Oracle. All rights reserved. connected to target database (not started) RMAN> startup nomount RMAN> restore controlfile from '/U01/backup_stage/ctl_oaqgu616_1_1'; Starting restore at 14-SEP-15 using target database control file instead of recovery catalog allocated channel: ORA_DISK_1 channel ORA_DISK_1: sid=2984 devtype=DISK channel ORA_DISK_1: restoring control file channel ORA_DISK_1: restore complete, elapsed time: 00:00:02 output filename=/U01/app/oracle/oradata/test/control01.ctl output filename=/U01/app/oracle/oradata/test/control02.ctl output filename=/U01/app/oracle/oradata/test/control03.ctl Finished restore at 14-SEP-15
還原之後,啟動到mount階段。 RMAN> alter database mount; database mounted released channel: ORA_DISK_1 RMAN> exit 這個時候開始嘗試應用日誌,即MRP開始喚醒MRP開始工作。 可以看到alert日誌中的內容變化:
ALTER DATABASE RECOVER managed standby database disconnect from session Mon Sep 14 17:45:04 2015 Attempt to start background Managed Standby Recovery process (p) MRP0 started with pid=16, OS id=27255 Mon Sep 14 17:45:04 2015 MRP0: Background Managed Standby Recovery process started (peak) Managed Standby Recovery not using Real Time Apply MRP0: Background Media Recovery terminated with error 1110 Mon Sep 14 17:45:09 2015 Errors in file /U01/app/oracle/admin/peak/bdump/test_mrp0_27255.trc: ORA-01110: data file 21: '/U01/app/oracle/oradata/test/test_new_index04.dbf' ORA-01122: database file 21 failed verification check ORA-01110: data file 21: '/U01/app/oracle/oradata/test/test_new_index04.dbf' ORA-01203: wrong incarnation of this file - wrong creation SCN Mon Sep 14 17:45:09 2015 Errors in file /U01/app/oracle/admin/peak/bdump/test_mrp0_27255.trc: ORA-01110: data file 21: '/U01/app/oracle/oradata/test/test_new_index04.dbf' ORA-01122: database file 21 failed verification check ORA-01110: data file 21: '/U01/app/oracle/oradata/test/test_new_index04.dbf' ORA-01203: wrong incarnation of this file - wrong creation SCN Mon Sep 14 17:45:09 2015 MRP0: Background Media Recovery process shutdown (test) Mon Sep 14 17:45:10 2015 Completed: ALTER DATABASE RECOVER managed standby database disconnect from session Mon Sep 14 17:46:21 2015
這個時候還是會和預想的差不多,MRP依舊會失敗,但是不同的是,這個時候錯誤已經不是ORA-600的錯誤了。
既然這個檔案存在不一致的情況,而且我們確實知道這個檔案是需要手工刪除的。我們就可以直接刪除資料檔案。
idle> alter database datafile '/U01/app/oracle/oradata/peak/peak_new_index04.dbf' offline drop;
Database altered.
嘗試取消日誌應用
idle> recover managed standby database cancel;
ORA-16136: Managed Standby Recovery not active
可見剛剛的MRP啟動是失敗的。
再次啟動MRP
idle> ALTER DATABASE RECOVER managed standby database disconnect from session ;
Database altered.
再次啟動MRP的時候回發現日誌中出現了轉機,這個時候備庫這邊和主庫基本一致了,但是還是存在歸檔GAP.
alter database datafile '/U01/app/oracle/oradata/test/test_new_index04.dbf' offline drop
Mon Sep 14 17:46:21 2015
Completed: alter database datafile '/U01/app/oracle/oradata/test/test_new_index04.dbf' offline drop
Mon Sep 14 17:46:48 2015
ALTER DATABASE RECOVER managed standby database cancel
Mon Sep 14 17:46:48 2015
ORA-16136 signalled during: ALTER DATABASE RECOVER managed standby database cancel ...
Mon Sep 14 17:47:01 2015
ALTER DATABASE RECOVER managed standby database disconnect from session
Mon Sep 14 17:47:01 2015
Attempt to start background Managed Standby Recovery process (test)
MRP0 started with pid=16, OS id=27547
Mon Sep 14 17:47:01 2015
MRP0: Background Managed Standby Recovery process started (test)
Managed Standby Recovery not using Real Time Apply
parallel recovery started with 15 processes
Mon Sep 14 17:47:06 2015
Waiting for all non-current ORLs to be archived...
Media Recovery Waiting for thread 1 sequence 7414
Fetching gap sequence in thread 1, gap sequence 7414-7416
Mon Sep 14 17:47:07 2015
Completed: ALTER DATABASE RECOVER managed standby database disconnect from session
Mon Sep 14 17:48:06 2015
FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 7414-7416
DBID 1731005384 branch 680697352
這個時候發現了GAP,但是還沒有開始從上次ORA-600錯誤的日誌開始應用日誌。
直接開啟broker的驗證會事半功倍。
DGMGRL>add database stest2 as
connect identifier is stest2
maintained as physical;
DGMGRL>enable database stest;
這個時候日誌中就開始忙碌起來了,關鍵的就是從上次失敗的歸檔開始開啟RFS接受日誌了。
Mon Sep 14 17:53:19 2015
RFS[1]: Archived Log: '/U01/app/oracle/flash_recovery_area/STEST2/archivelog/2015_09_14/o1_mf_1_7414_bzf68cq2_.arc'
Redo Shipping Client Connected as PUBLIC
-- Connected User is Valid
RFS[2]: Assigned to RFS process 28706
RFS[2]: Identified database type as 'physical standby'
RFS[2]: Archived Log: '/U01/app/oracle/flash_recovery_area/STEST2/archivelog/2015_09_14/o1_mf_1_7415_bzf68h9y_.arc'
RFS[2]: Archived Log: '/U01/app/oracle/flash_recovery_area/STEST2/archivelog/2015_09_14/o1_mf_1_7416_bzf68hgr_.arc'
RFS[2]: Archived Log: '/U01/app/oracle/flash_recovery_area/STEST2/archivelog/2015_09_14/o1_mf_1_7426_bzf68jt8_.arc'
.....
RFS[2]: Archived Log: '/U01/app/oracle/flash_recovery_area/STEST2/archivelog/2015_09_14/o1_mf_1_7420_bzf69g71_.arc'
Mon Sep 14 17:53:51 2015
Managed Standby Recovery not using Real Time Apply
parallel recovery started with 15 processes
Mon Sep 14 17:53:51 2015
Waiting for all non-current ORLs to be archived...
Media Recovery Log /U01/app/oracle/flash_recovery_area/STEST2/archivelog/2015_09_14/o1_mf_1_7414_bzf68cq2_.arc
Mon Sep 14 17:53:52 2015
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT NODELAY
Mon Sep 14 17:53:52 2015
MRP也可以繼續應用日誌了,從上次失敗的地方開始。 這個時候使用DG broker來做一個簡單驗證。
DGMGRL> show configuration;
Configuration
Name: test
Enabled: YES
Protection Mode: MaxPerformance
Fast-Start Failover: DISABLED
Databases:
test - Primary database
stest4 - Physical standby database
stest2 - Physical standby database
Current status for "peak":
SUCCESS
當然了問題修復了,來看看資料檔案的情況,這個時候就沒有問題了。
idle> select file#,df.name,df.ts#,ts.name,df.RFILE# from v$datafile df,v$tablespace ts where df.ts#=ts.ts#;
20 /U01/app/oracle/oradata/test/test_new_data04.dbf 9 TEST_NEW_DATA 20
21 /U01/app/oracle/oradata/test/test_new_index04.dbf 10 TEST_NEW_INDEX 21
所以通過這個案例我們可以看到,在某些情況下踩雷的時候,還是不要氣餒,在不影響全域性的情況下,可以根據自己的分析大膽假設,小心求證,沒準還真能有所發現。