【故障】ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

阿新 • • 發佈：2019-02-20

ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

SYMPTOMS

Normal or high redundancy diskgroup is dismounted with these WARNING messages.

Note-ASM alert.log

Sat Mar 07 05:03:10 2015
WARNING: Waited 15 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 18 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 18 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 21 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 21 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 24 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 24 secs for write IO to PST disk 1 in group 2.
Sat Mar 07 05:03:22 2015
WARNING: Waited 27 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 27 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 30 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 30 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 33 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 33 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 36 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 36 secs for write IO to PST disk 1 in group 2.
Sat Mar 07 05:03:34 2015

ASM alert.log日誌中出現如上所示的WARNING資訊：WARNING: Waited 15 secs for write IO to PST disk 1 in group 2.該日誌資訊的大意為PST通訊鏈路在訪問磁碟組2中的磁碟1的時候等待了15秒鐘，而且觸發了持續的等待。超時等待會在頻率觸發的基礎上遞增每次的等待時間。出現這種狀況的原因一般與作業系統網路通訊鏈路，資料庫主機磁碟或者超時引數的設定有關。我們繼續檢視ASM的alert.log日誌來進一步分析。

Note-DiskGroup Dsimounted

Mon Mar 09 16:32:11 2015
NOTE: process _b000_+asm1 (1051) initiating offline of disk 0.3915951733 (DATA_0000) with mask 0x7e in group 2
NOTE: process _b000_+asm1 (1051) initiating offline of disk 1.3915951732 (DATA_0001) with mask 0x7e in group 2
NOTE: checking PST: grp = 2
GMON checking disk modes for group 2 at 7 for pid 28, osid 1051
ERROR: no read quorum in group: required 2, found 1 disks
NOTE: checking PST for grp 2 done.
NOTE: initiating PST update: grp = 2, dsk = 0/0xe968ae75, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 2, dsk = 1/0xe968ae74, mask = 0x6a, op = clear
GMON updating disk modes for group 2 at 8 for pid 28, osid 1051
ERROR: no read quorum in group: required 2, found 1 disks
Mon Mar 09 16:32:11 2015
NOTE: cache dismounting (not clean) group 2/0xEF985E9D (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 1056, image: oracle@rac1 (B001)
Mon Mar 09 16:32:11 2015
NOTE: halting all I/Os to diskgroup 2 (DATA)
Mon Mar 09 16:32:11 2015
NOTE: LGWR doing non-clean dismount of group 2 (DATA)
NOTE: LGWR sync ABA=30.108 last written ABA 30.108
WARNING: Offline for disk DATA_0000 in mode 0x7f failed.
WARNING: Offline for disk DATA_0001 in mode 0x7f failed

磁碟組2中的磁碟1因為某種原因導致反應緩慢或者HANG住，從而在ASM層面觸發等待。但是，oracle的ASM機制僅僅在磁碟noresponsiness狀態等待15秒鐘，這是預設情況下的設定。雖然持續等待機制在11.2.0.4版本中會自動增加等待時間，但是該磁碟IO的等待也會有一個極限。當ASM確信磁碟組中的磁碟沒有反應之後，便會OFFLINE該目標故障磁碟。

Mon Mar 09 16:32:11 2015
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
Mon Mar 09 16:32:11 2015
NOTE: No asm libraries found in the system
Mon Mar 09 16:32:11 2015
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 16)
ASM Health Checker found 1 new failures
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE 
 128 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Mon Mar 09 16:32:11 2015

同時，oracle ASM也會嘗試重新配置ASM 相應故障磁碟的通訊鏈路並儲存此時的叢集件和ASM通訊鏈路的狀態。在以上的日誌資訊中表現為DETACH RECONFIGURATION資訊。在此之後Oracle會嘗試重新建立故障盤的通訊鏈路和MOUNT目標磁碟組，從而恢復原有的正常狀態。

Mon Mar 09 16:32:27 2015
 Received dirty detach msg from inst 2 for dom 2
Mon Mar 09 16:32:27 2015
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 2, cluster inc 16)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE 
 128 GCS resources traversed, 0 cancelled
freeing rdom 2
Dirty Detach Reconfiguration complete

Mon Mar 09 16:32:41 2015
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2
Mon Mar 09 16:32:58 2015
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15079: ASM file is closed

Mon Mar 09 16:33:11 2015
SUCCESS: diskgroup DATA was dismounted
SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:4019740317 */
Mon Mar 09 16:33:11 2015
NOTE: diskgroup resource ora.DATA.dg is offline
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA
Mon Mar 09 16:33:11 2015
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
WARNING: requested mirror side 1 of virtual extent 5 logical extent 0 offset 724992 is not allocated; I/O request failed
WARNING: requested mirror side 2 of virtual extent 5 logical extent 1 offset 724992 is not allocated; I/O request failed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Mon Mar 09 16:33:11 2015
SQL> alter diskgroup DATA check /* proxy */ 
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DATA" does not exist or is not mounted
ERROR: alter diskgroup DATA check /* proxy */
NOTE: client exited [14233]
Mon Mar 09 16:33:16 2015
NOTE: [crsd.bin@rac1 (TNS V1-V3) 1581] opening OCR file

CAUSE

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.ASM instance stop issuing more PST heart beat until it succeeds PST revalidation.
but the heart beat delays do not dismount external redundancy diskgroup directly.

+Some of the paths of the physical paths of the multipath device are offline or lost

+During path 'failover' in a multipath set up

+Server load, or any sort of storage/multipath/OS maintenance

The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer.And then _asm_hbeatiowait helps in setting the time out.

SOLUTION

1] Check with OS and Storage admin that there is disk unresponsiveness.

2] Possibly keep the disk responsiveness to below 15 seconds.

This will depend on various factors like
+ Operating System
+ Presence of Multipath ( and Multipath Type )
+ Any kernel parameter

So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.For example, on AIX rw_timeout setting affects this and defaults to 30 seconds.
Another example is Linux with native multipathing. In such set up, number of physical paths and polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you need to find out this.
3] If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

_asm_hbeatiowait

As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, the same will be fixed in 12.2
Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';

And then restart asm instance / crs, to take new parameter value in effect.

【故障】ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

ASM diskgroup dismount with "Waited 15 secs for write IO to PST" ASM diskgroup dismount with "Waited 15 secs for write IO to PST" SYMPTOMS Normal or

【Lintcode】105.Copy List with Random Pointer

map class node link listnode span public point turn 題目： A linked list is given such that each node contains an additional random pointer

【Azure】ASM虛擬機遷移到ARM中

官網 orm 成功 mit ubuntu 準備發現執行資源管理器這兩天開始講之前在老門戶（ASM）中的虛擬機往新門戶（ARM）中進行遷移，閱讀了很多博主的文章和微軟雲網站的介紹，在下使用微軟雲官網介紹的遷移工具進行遷移。在開始遷移體驗之前，了解一下整個遷移的流程，總

【故障】MySQL主從同步故障－Slave_SQL_Running: No

ack counter stop usr mysql-bin back data 連接 xid 轉自：http://www.linuxidc.com/Linux/2014-02/96945.htm 故障現象：進入slave服務器，運行：mysql> show slav

【SSH】Error creating bean with name 'sessionFactory' defined in class path resource [applicationConte

今天在小鹹兒再一次學習SSH的時候，在進行使用者名稱校驗的時候，啟動專案報了一個曾經遇到的錯誤： Error creating bean with name ‘sessionFactory’ defined in class path resource [applic

【LeetCode】159. Longest Substring with At Most Two Distinct Characters

Difficulty: Hard More:【目錄】LeetCode Java實現 Description Given a string S, find the length of the longest substring T that contains at most two

【LeetCode】823. Binary Trees With Factors 解題報告（Python）

目錄題目描述題目大意解題方法動態規劃相似題目參考資料日期題目描述 Given an array of unique integers, each integer is strictly greater than 1. We make a binary t

【leetcode】#雜湊表【Python】138. Copy List with Random Pointer 複製帶隨機指標的連結串列

連結：題目：給定一個連結串列，每個節點包含一個額外增加的隨機指標，該指標可以指向連結串列中的任何節點或空節點。要求返回這個連結串列的深度拷貝。解法1：先迴圈一遍，把node建完，把所有的no

【LeetCode】693. Binary Number with Alternating Bits

693. Binary Number with Alternating Bits Problem Example Solution Problem 給一個正整數，判斷他的二進位制形式是否是0,1交替的 Examp

【故障】我只是插了一根網線，全網中斷！？

請及時關注“高效運維(微信ID：greatops)”公眾號，並置頂公眾號，以免錯過各種乾貨滿滿的原創文章。作者簡介趙舜東江湖人稱趙班長，曾負責武警某部指揮自動化架構和運維工作，2008年退役後一直從事網際網路運維工作。UnixHot運維社群創始人、《SaltStack入門與實踐》作者。引

【LeetCode】862. Shortest Subarray with Sum at Least K 解題報告（C++）

作者：負雪明燭 id： fuxuemingzhu 個人部落格： http://fuxuemingzhu.cn/ 目錄題目描述題目大意解題方法佇列日期題目

【論文閱讀】Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification

分享圖片 xiv onf class 編碼 isp conf caf 策略轉載請註明出處：https://www.cnblogs.com/White-xzx/ 原文地址：https://arxiv.org/abs/1702.05891 Caffe-code：https:/

【故障】mysql 中的timeStamp經過mybatis獲取後就自動加了8小時

語文太差了。。這個問題我描述的不一定準確。。類似的情況還有多13或者14小時的,但都是一個問題導致的。情況是這樣的… 我們的某個定時任務進度依賴的一個config表中的時間戳，莫名其妙的出現了超過當前時間的未來的時間。。按理說這是不可能的，

【Poj】-A Simple Problem with Integers（線段樹，區間更新）

A Simple Problem with Integers Time Limit: 5000MS Memory Limit: 131072K Total Submissions: 100877 Accepted: 31450 Case Time Limit: 2

【故障】Hadoop Cluster啟動後資料節點（DataNode）程序狀態丟失

Hadoop Cluster啟動後資料節點（DataNode）程序狀態丟失在擁有三個節點的Hadoop叢集環境中，其各節點的配置為：CPU Intel(R) Core(TM) i3-3120M [email protected] 2.50GHz,記憶體RAM 6GB,Operation Sys

【ArcGIS】WebAdaptorIIS 安裝前準備及配置Portal For ArcGIS的問題解決

ima 沒有 port 問題解決準備 bad 問題 cnblogs web 1、計算機全名配置 2、IIS-服務器證書配置 3、端口綁定備註：配置Portal For ArcGIS總會提示計算機域名、全名錯誤，可能就是沒有進行第一步操作

【 OJ 】 HDOJ1029 18年11月15日02:03 [ 28 ]

開始拿到此題以為是對於排序的時間限制，畢竟正常人看了題目第一本能都是想到排序，拿到中位數輸出，快排，二分.....基本都沒用，後來看了網上的解法，MD連陣列都不要...只要一個臨時變數，思想很簡單，假設多於一半的是x 那麼剩下的數字！x，也就是說在輸入讀取的時候一樣++，不一樣--,到了0換

【batch】批處理檔案多引數處理和for迴圈字串連線

batch檔案寫起來，酸爽不談了。 1 @echo off 2 set pathPrefix=D:\ 3 4 set varStr= 5 if "%1"=="" ( 6 echo No variable received, please call the bat with va

【 OJ 】 HDOJ1032 18年11月15日16:58 [ 30 ]

ummmm啥也不想說..... Consider the following algorithm: 1. input n 2. print n 3. if n = 1 then STOP 4. if

【 OJ 】 HDOJ1031 18年11月15日16:11 [ 29 ]

這題沒啥好說的....沒啥感悟,就是排序.... 已AC # include<iostream> # include<algorithm> using namespace std; struct x { double v; double index; }a[1

【故障】ASM diskgroup dismount with "Waited 15 secs for write IO to PST"

SYMPTOMS

CAUSE

SOLUTION

相關推薦