1. 程式人生 > >老A的自留地,歡迎加微信交流,微訊號zhoul777

老A的自留地,歡迎加微信交流,微訊號zhoul777

此次rac vip故障主要是由於vip所在網絡卡ent3(做了EtherChannel,即主備網絡卡繫結)出現故障,導致1號節點vip漂移至2號節點。
$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....b1.inst application ONLINE ONLINE crmdb01
ora....b2.inst application ONLINE ONLINE crmdb02
ora....db2.srv application ONLINE ONLINE crmdb02
ora....srv1.cs application ONLINE ONLINE crmdb02
ora.crmdb.db application ONLINE ONLINE crmdb02
[color=red]ora....01.lsnr application ONLINE OFFLINE [/color]
ora....b01.gsd application ONLINE ONLINE crmdb01
ora....b01.ons application ONLINE ONLINE crmdb01
[color=red]ora....b01.vip application ONLINE ONLINE crmdb02 [/color]
ora....02.lsnr application ONLINE ONLINE crmdb02
ora....b02.gsd application ONLINE ONLINE crmdb02
ora....b02.ons application ONLINE ONLINE crmdb02
ora....b02.vip application ONLINE ONLINE crmdb02
解決辦法處理相對比較簡單,只要更換問題網絡卡,1號節點重啟nodeapps即可,vip就自動從2號機切回1號機。
但通過此次故障,我們是不是可以更加挖掘一下,rac vip漂移背後的一些東西呢?
1號機故障發生時,在作業系統級別,我們可以看到一些錯誤:
$ netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.11.25.be.50.e9 2364166277 0 1352130944 371 0
en0 1500 3.3.22 3.3.22.1 2364166277 0 1352130944 371 0
[color=red]en3 1500 link#3 0.11.25.be.4d.41 3591277841 0 1817998840 5 0
en3 1500 130.36.23 130.36.23.8 3591277841 0 1817998840 5 0[/color]
lo0 16896 link#1 1335635349 0 1335747477 0 0
lo0 16896 127 127.0.0.1 1335635349 0 1335747477 0 0
lo0 16896 ::1 1335635349 0 1335747477 0 0

$ errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
[color=red]173C787F 0416124011 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416124011 T H ent1 TRANSMIT FAILURE[/color]
173C787F 0416095911 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416095811 T H ent1 TRANSMIT FAILURE
4FC185D1 0416065011 T H ent1 TRANSMIT FAILURE

更為詳細的錯誤如下所示:
$ errpt -a -j 4FC185D1|more
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1

Date/Time: Sat Apr 16 12:40:04 BEIST 2011
Sequence Number: 10413
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: H
Type: TEMP
Resource Name: ent1
Resource Class: adapter
Resource Type: 14106802
Location: U5791.001.99B18ND-P1-C06-T1
VPD:
Product Specific.( ).......Gigabit Ethernet-SX PCI-X Adapter
Part Number.................10N8586
FRU Number..................10N8586
EC Level....................D76267
Manufacture ID..............YL1021
Network Address.............001125BE4D41
ROM Level.(alterable).......GOL021

Description
TRANSMIT FAILURE

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
FILE NAME
line: 2187 file: goent_tx.c
PCI ETHERNET STATISTICS
0000 25C5 0063 081B 0000 0003 0000 0003 0000 0000 0000 0000 0000 0000 0000 00DA
0000 010C D192 B18E 0001 B2FA DD4E 1CFC 0000 0041 1C93 93A5 0000 0000 0031 20A1
0000 00EE 256D C53E 0002 3042 90A3 0EE5 0000 0000 0000 0000 0000 0001 0001 B321
0000 09DF 0000 0000 0000 0000 0000 01DF 0000 000F 0000 0205 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 BBA3 087C 0200 D400 4120 8000 01A0 0000 0000
0230 0156 0009 F007 0443 C808 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000
DEVICE DRIVER INTERNAL STATE
2222 2222 256D C53E 0000 00C8
SOURCE ADDRESS
0011 25BE 4D41
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1
$ errpt -a -j 173C787F|more
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F

Date/Time: Sat Apr 16 12:40:21 BEIST 2011
Sequence Number: 10414
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: S
Type: INFO
Resource Name: topsvcs

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Recommended Actions
Verify adapter configuration
Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,4983
ERROR ID
6zV5DL.pqFeB/ThN//Ml.1....................
REFERENCE CODE

Adapter interface name
en3
Adapter offset
0
Adapter IP address
130.36.23.8
由於硬體故障,我們對OS日誌不做詳細解讀,我們關心的是故障發生一刻,Oracle做了什麼?
故障發生時racg首先檢測到vip發生故障,並再次進行了vip檢測,racgvip check crmdb01,並記錄至ora.crmdb01.vip.log中
2011-04-16 12:40:13.049: [ RACG][1] [4276526][1][ora.crmdb01.vip]: Invalid parameters, or failed to bring up VIP (host=crmdb01)

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10.2.0.4/crs

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: cmd = /opt/oracle/product/10.2.0.4/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracl
e/product/10.2.0.4/crs/bin/racgvip check crmdb01

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: rc = 1, time = 4.405s

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: end for resource = ora.crmdb01.vip, action = check, status = 1, time = 4.572s
檢測結束後,判斷存在異常之後,由crs程序執行vip漂移動作,可以看到當crs檢測到vip異常offline之後(OFFLINE unexpectedly),
首先停止了監聽,然後將元件ora.crmdb.crmsrv1.crmdb2.srv漂移至crmdb02即2號節點。
2011-04-16 12:40:13.058: [ CRSAPP][11051]32CheckResource error for ora.crmdb01.vip error code = 1
2011-04-16 12:40:13.071: [ CRSRES][11051]32In stateChanged, ora.crmdb01.vip target is ONLINE
2011-04-16 12:40:13.072: [ CRSRES][11051]32ora.crmdb01.vip on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.072: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.086: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.vip` on member `crmdb01`
2011-04-16 12:40:13.487: [ CRSRES][11312]32In stateChanged, ora.crmdb.crmsrv1.crmdb2.srv target is ONLINE
2011-04-16 12:40:13.487: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.488: [ CRSRES][11312]32StopResource: setting CLI values
2011-04-16 12:40:13.520: [ CRSRES][11312]32Attempting to stop `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01`
2011-04-16 12:40:13.636: [ CRSRES][11051]32Stop of `ora.crmdb01.vip` on member `crmdb01` succeeded.
2011-04-16 12:40:13.636: [ CRSRES][11051]32ora.crmdb01.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:13.650: [ CRSRES][11051]32ora.crmdb01.vip failed on crmdb01 relocating.
2011-04-16 12:40:13.770: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.786: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.LISTENER_CRMDB01.lsnr` on member `crmdb01`
2011-04-16 12:40:14.093: [ CRSRES][11312]32Stop of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01` succeeded.
2011-04-16 12:40:14.094: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:14.105: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv failed on crmdb01 relocating.
2011-04-16 12:40:14.150: [ CRSRES][11312]32Attempting to start `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02`
[color=red]2011-04-16 12:40:14.442: [ CRSRES][11312]32Start of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02` succeeded.[/color]

此時2號節點crs日誌顯示如下:
2011-04-16 12:40:14.148: [ CRSRES][11617]32startRunnable: setting CLI values
2011-04-16 12:40:24.488: [ CRSRES][12145]32CRS-1002: Resource 'ora.crmdb.crmsrv1.cs' is already running on member 'crmdb02'

需要注意的是,vip出現故障,甚至會將和vip相關的資源全部停止,
If the VIP fails for any reason and cannot be restarted, CRS will bring down all dependent resources, including the Listener, ASM instance and database instance. CRS will attempt to bring these resources down gracefully - hence, a shutdown immediate will be issued, and will be seen in the alert log of the ASM instance - no errors will be evident in the alert log for the ASM instance.
以下來自一metalink (ID 277274.1) 案例,此故障經常在10.1上出現
`ora.rmsclnxclu1.vip` on `rmsclnxclu1` went OFFLINE unexpectedly
2004-06-21 21:21:05.562: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
RTD #0: Action Script /home/oracle/product/crs/bin/racgwrap(stop) timed out for ora.rmsclnxclu1.vip! (timeout=60)
2004-06-21 21:22:16.472: [RTI:884782] StopResource error for ora.rmsclnxclu1.vip error code = 1
2004-06-21 21:22:18.611: `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` has experienced an unrecoverable failure.
2004-06-21 21:22:18.611: Human intervention required to resume its availability.
2004-06-21 21:22:18.790: [RUNNABLELISTENER:884782] Resource failed into UNKNOWN, killing dependents
`ora.rmsclnxclu1.vip` experienced a failure on `rmsclnxclu1`. Stopping dependent resources.
2004-06-21 21:22:20.525: Attempting to stop `ora.gofod.gofod1.inst` on member `rmsclnxclu1`
2004-06-21 21:25:38.531: Stop of `ora.gofod.gofod1.inst` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:38.611: Attempting to stop `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1`
2004-06-21 21:25:38.983: Stop of `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:39.041: Attempting to stop `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1`
2004-06-21 21:25:46.669: Stop of `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:46.728: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
2004-06-21 21:25:55.547: Stop of `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` succeeded.

如果出現上述故障或者vip經常自動offline,可以用以下思路來解決問題:
1、啟用vip跟蹤,如果vip出現故障,可以進一步獲得更為詳細的日誌資訊
開啟vip跟蹤:
# crsctl debug log res ora.node1.vip:1
Set Resource Debug Module: ora.node1.vip Level: 1
關閉vip跟蹤
# crsctl debug log res ora.node1.vip:0
Set Resource Debug Module: ora.node1.vip Level: 0
在11 R2中開啟跟蹤語法變為:
#crsctl set log res "ora.rmntops1.vip.com:1"

2、修改vip檢查間隔時間和指令碼超時時間,vip檢查間隔時間從預設的30秒改為120秒,指令碼超時時間從60秒改為120秒。
1. Create the .cap file for each vip resource (on each node):

./crs_stat -p ora.rmsclnxclu1.vip > /tmp/ora.rmsclnxclu1.vip.cap

2. Then, update the .cap file using the following syntax and values:

./crs_profile -update ora.rmsclnxclu1.vip -dir /tmp -o ci=120,st=120

(Where ci = the CHECK_INTERVAL and st = the SCRIPT_TIMEOUT value.)

3. Finally, re-register it using the '-u' option:

./crs_register ora.rmsclnxclu1.vip -dir /tmp -u

3、如果是10.1的話,可以在asm資源中將vip相關性移除:
ASM resource name is in the form of ora.<nodename>.<ASM instance name>.asm.
VIP resource name is in the form of ora.<nodename>.vip
- crs_stat -p <ASM resource name> > /tmp/<ASM resource name>.cap
- Edit /tmp/<ASM resource name>.cap to remove VIP resource name from the REQUIRED_RESOURCES attribute.
- crs_register -u <ASM resource name> -dir /tmp
- Use "crs_stat -p <ASM resource name>" to verify if REQUIRED_RESOURCE attribute is updated.