1. 程式人生 > >【故障】ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

【故障】ASM diskgroup dismount with "Waited 15 secs for write IO to PST"


ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

 ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

 SYMPTOMS

 Normal or high redundancy diskgroup is dismounted with these WARNING messages.

Note-ASM alert.log

Sat Mar 07 05:03:10 2015
WARNING: Waited 15 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 18 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 18 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 21 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 21 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 24 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 24 secs for write IO to PST disk 1 in group 2.
Sat Mar 07 05:03:22 2015
WARNING: Waited 27 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 27 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 30 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 30 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 33 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 33 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 36 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 36 secs for write IO to PST disk 1 in group 2.
Sat Mar 07 05:03:34 2015

  • ASM alert.log日誌中出現如上所示的WARNING資訊:WARNING: Waited 15 secs for write IO to PST disk 1 in group 2.該日誌資訊的大意為PST通訊鏈路在訪問磁碟組2中的磁碟1的時候等待了15秒鐘,而且觸發了持續的等待。超時等待會在頻率觸發的基礎上遞增每次的等待時間。出現這種狀況的原因一般與作業系統網路通訊鏈路,資料庫主機磁碟或者超時引數的設定有關。我們繼續檢視ASMalert.log日誌來進一步分析。
Note-DiskGroup Dsimounted

Mon Mar 09 16:32:11 2015
NOTE: process _b000_+asm1 (1051) initiating offline of disk 0.3915951733 (DATA_0000) with mask 0x7e in group 2
NOTE: process _b000_+asm1 (1051) initiating offline of disk 1.3915951732 (DATA_0001) with mask 0x7e in group 2
NOTE: checking PST: grp = 2
GMON checking disk modes for group 2 at 7 for pid 28, osid 1051
ERROR: no read quorum in group: required 2, found 1 disks
NOTE: checking PST for grp 2 done.
NOTE: initiating PST update: grp = 2, dsk = 0/0xe968ae75, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 2, dsk = 1/0xe968ae74, mask = 0x6a, op = clear
GMON updating disk modes for group 2 at 8 for pid 28, osid 1051
ERROR: no read quorum in group: required 2, found 1 disks
Mon Mar 09 16:32:11 2015
NOTE: cache dismounting (not clean) group 2/0xEF985E9D (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 1056, image: oracle@rac1 (B001)
Mon Mar 09 16:32:11 2015
NOTE: halting all I/Os to diskgroup 2 (DATA)
Mon Mar 09 16:32:11 2015
NOTE: LGWR doing non-clean dismount of group 2 (DATA)
NOTE: LGWR sync ABA=30.108 last written ABA 30.108
WARNING: Offline for disk DATA_0000 in mode 0x7f failed.
WARNING: Offline for disk DATA_0001 in mode 0x7f failed

  • 磁碟組2中的磁碟1因為某種原因導致反應緩慢或者HANG住,從而在ASM層面觸發等待。但是,oracleASM機制僅僅在磁碟noresponsiness狀態等待15秒鐘,這是預設情況下的設定。雖然持續等待機制在11.2.0.4版本中會自動增加等待時間,但是該磁碟IO的等待也會有一個極限。當ASM確信磁碟組中的磁碟沒有反應之後,便會OFFLINE該目標故障磁碟。
Mon Mar 09 16:32:11 2015
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
Mon Mar 09 16:32:11 2015
NOTE: No asm libraries found in the system
Mon Mar 09 16:32:11 2015
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 16)
ASM Health Checker found 1 new failures
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE 
 128 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Mon Mar 09 16:32:11 2015
  • 同時,oracle ASM也會嘗試重新配置ASM 相應故障磁碟的通訊鏈路並儲存此時的叢集件和ASM通訊鏈路的狀態。在以上的日誌資訊中表現為DETACH RECONFIGURATION資訊。在此之後Oracle會嘗試重新建立故障盤的通訊鏈路和MOUNT目標磁碟組,從而恢復原有的正常狀態。
Mon Mar 09 16:32:27 2015
 Received dirty detach msg from inst 2 for dom 2
Mon Mar 09 16:32:27 2015
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 2, cluster inc 16)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE 
 128 GCS resources traversed, 0 cancelled
freeing rdom 2
Dirty Detach Reconfiguration complete
Mon Mar 09 16:32:41 2015
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2
Mon Mar 09 16:32:58 2015
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed

Mon Mar 09 16:33:11 2015
SUCCESS: diskgroup DATA was dismounted
SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:4019740317 */
Mon Mar 09 16:33:11 2015
NOTE: diskgroup resource ora.DATA.dg is offline
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA
Mon Mar 09 16:33:11 2015
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
WARNING: requested mirror side 1 of virtual extent 5 logical extent 0 offset 724992 is not allocated; I/O request failed
WARNING: requested mirror side 2 of virtual extent 5 logical extent 1 offset 724992 is not allocated; I/O request failed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Mon Mar 09 16:33:11 2015
SQL> alter diskgroup DATA check /* proxy */ 
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DATA" does not exist or is not mounted
ERROR: alter diskgroup DATA check /* proxy */
NOTE: client exited [14233]
Mon Mar 09 16:33:16 2015
NOTE: [crsd.bin@rac1 (TNS V1-V3) 1581] opening OCR file

 CAUSE

  • Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
  • By the way the heart beat delays are sort of ignored for external redundancy diskgroup.ASM instance stop issuing more PST heart beat until it succeeds PST revalidation.
  • but the heart beat delays do not dismount external redundancy diskgroup directly.
  • +Some of the paths of the physical paths of the multipath device are offline or lost
  • +During path 'failover' in a multipath set up
  • +Server load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer.And then _asm_hbeatiowait helps in setting the time out.

 SOLUTION

  • 1]  Check with OS and Storage admin that there is disk unresponsiveness.
  • 2]  Possibly keep the disk responsiveness to below 15 seconds. 
This will depend on various factors like
+ Operating System
+ Presence of Multipath ( and Multipath Type )
+ Any kernel parameter
  • So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.For example, on AIX  rw_timeout  setting affects this and defaults to 30 seconds.
    Another example is Linux with native multipathing. In such set up, number of physical paths and  polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
    So for your set up ( combination of OS / multipath / storage ), you need to find out this.
  • 3]  If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):
   _asm_hbeatiowait
  • As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, the same will be fixed in 12.2

    Run below in asm instance to set desired value for _asm_hbeatiowait

  alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';
  • And then restart asm instance / crs, to take new parameter value in effect.