'multipathd' and disk checker recognizes failed disks later than Oracle ASM

阿新 • • 發佈：2018-12-11

Environment

Red Hat Enterprise Linux 5, 6
Oracle ASM

Issue

We have recognized that the multipathd daemon and the scsi path checker for checking LUNs on an EMC Symmetrix storage box recognizes failed or unresponsive LUNs later than the Oracle RAC ASM volume manager.

The Oracle RAC ASM seems to deactivate disks which are unresponsive for 15 seconds. But the multipathd

or the SCSI tur checker seems to recognize a unresponsive disk or a scsi path after 60 seconds. This leads to the situation that the Oracle ASM deactivates disks even if they seem fine from the OS. This results in following error messages in database logs:

Raw

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 8.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 8.
Fri Oct 17 21:40:56 2014
NOTE: process _b000_+asm1 (30427) initiating offline of disk 0.3915928412 (DATA1_0000) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
Fri Oct 17 21:40:56 2014
NOTE: process _b001_+asm1 (30429) initiating offline of disk 1.3915928451 (DATA2_0001) with mask 0x7e in group 8
NOTE: checking PST: grp = 8

Is there any possibility to change the checking parameter for unresponsive LUNs and path, so that we recognize unresponsive disks earlier that the Oracle RAC ARM volume manager.

Resolution

Please try to implement the following tuning options to reduce the time required for dm-multipath, SCSI path checker to detect the failed paths and to initiate the recovery action:

A. Reduce the polling_interval and checker_timeout for dm-multipath

The device mapper multipath uses following two options to query the status of sub paths to the SAN devices:

polling_interval: This option defines how often a path's state is checked, in seconds. For paths that are usable, the time between checks will gradually increase to (4 * polling_interval). The default value of this option is 5.

The polling_interval's main functions are to check failed paths to be restored, to preemptively fail valid paths that are not currently receiving IO, and to react to configuration changes on the devices. Setting the polling_interval to a value less than 4 second isn't generally necessary for rapid failover, but it will cause a significant increase in system overhead, CPU load. Please refer to the article - High CPU usage of multipathd with low polling_intervalfor detailed information about the same. So, it would be suggested to decrease the polling_interval value upto 4 only.
checker_timeout: This option is available in Red Hat Enterprise Linux 5.5 and later. It sets timeout value to use for path checkers that issue SCSI commands with an explicit timeout, in seconds. The default value is taken from /sys/block/sdx/device/timeout (60 sec).

Following steps could be used to decrease the value for above options to reduce the time required for dm-multipath in detecting the IO failure and to initiate a recovery action.

Set the lower SCSI timeout for underlying sdXX devices:

Raw

$ echo "20" > /sys/block/sdXX/device/timeout
$ cat /sys/block/sdXX/device/timeout

Set the following options in default section of /etc/multipath.conf file:

Raw

polling_interval          4          #### With this value, dm-multipath will try to do a status check on sub paths in every 4 seconds.
checker_timeout          10          #### Sets 10s as timeout value for path checker

Reload the multipath configuration using following steps:

Raw
```
$ /etc/init.d/multipathd reload
```
NOTE: Before applying any of these setting on a production system, please make sure the changes are tried in a testing environment and no issues are observed during the tests.

B. Reduce the time required in completing SCSI error handling process using steps in following article

The RHEL 5 kernel-2.6.18-371.6.1.el5 and RHEL 6 kernel-2.6.32-358.32.3.el6 and later provides couple of options to reduce the time spend at SCSI layer in dealing with SCSI error handling process when IO requests issued to any sub paths are failed. Please refer to the following article for detailed information about the same. It would be recommended to please apply the tuning options described in following article also to reduce the time required for SCSI error handling process: Limiting path failover time for SCSI devices

C. Increase the value of _asm_hbeatiowait parameter in Oracle configuration

There is _asm_hbeatiowait parameter available in Oracle configuration, which is by default set to 15 seconds. The timeout value set with _asm_hbeatiowait option causes following messages to be logged if the IO is not completed within 15 seconds timeout:

Raw
```
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
```

The value for above _asm_hbeatiowait option was retrieved with following command:

Raw

SQL> select name,value,describe from v$asm_hidden_paras;  
NAME                                    VALUE    DESCRIBE  
--------------------------------------- -------- ----------------------------------------------------------------------  
_asm_acd_chunks                         1        initial ACD chunks created  
[...]
_asm_global_dump_level                  267      System state dump level for ASM asserts  
_asm_hbeatiowait                        15       number of secs to wait for PST Async Hbeat IO return       <<----------
_asm_hbeatwaitquantum                   2        quantum used to compute time-to-wait for a PST Hbeat check

It would also be suggested to increase the _asm_hbeatiowait parameter available in Oracle configuration to allow it to wait for some more time before logging above messages. As this option is specific to Oracle configuration, we would recommend to please try to check with Oracle support for detailed information about how to modify the value of above option.

'multipathd' and disk checker recognizes failed disks later than Oracle ASM

Environment Red Hat Enterprise Linux 5, 6 Oracle ASM Issue We have recognized that the multipathd daemon and the scsi path checker fo

MySQL錯誤修復：Table xx is marked as crashed and last (automatic?) repair failed

有站長找到我，說資料庫壞了，訪問網站報錯如下： Error establishing a database connection 看了下 MySQL 的錯誤日誌，報錯如下： Error: Table './db_name/table_name' is marked as crashed and last (

android Failed to allocate a 16 byte allocation with 232 free bytes and 232B until OOM; failed due to fragmentation (required co

測試上傳超大影片檔案，結果Android APP crash 掉，錯誤訊息是： Failed to allocate a 16 byte allocation with 232 free bytes and 232B until OOM; failed due to fragmentation (requi

mysql數據庫崩潰:InnoDB: Database page corruption on disk or a failed

sql數據庫 my.cnf 操作重建 let mysql配置文件我沒 .cn 數據庫修改mysql配置文件my.cnf，添加 innodb_force_recovery = 6 innodb_purge_thread = 0 重啟mysql 這時只可

U盤安裝Ubuntu15.04 出現boot failed: please change disks and press a key to continue

啟動 ubunt 根據重新無法版本 and ima change 1、根據國內的教程，用Ultraiso制作了一個Ubuntu15.04的U盤啟動盤，在裝系統的時候提示如下錯誤：boot failed: please change disks and press a

U盤安裝Ubuntu15.04 出現boot failed: please change disks and press a key to continue 錯誤

1、用Ultraiso製作了一個Ubuntu15.04的U盤啟動盤，在裝系統的時候提示如下錯誤： boot failed: please change disks and press a key to continue Start booting from

VMWare ESXi 提示 “Failed to open disk scsi0:0: Unsupported and/or invalid disk type 7”

從VMWare Workstation的VM遷移到VMWare ESXi 時，會提示 “Failed to open disk scsi0:0: Unsupported and/or invalid disk type 7” 這是因為磁碟的虛擬格式不一致，需要轉換，操作步驟

Error: Failed to launch instance "win7": Please try again later [Error: No valid host was found. ].

虛擬機遇見 enable opensta 錯誤信息 zone roman 博客 win7 感謝朋友支持本博客，歡迎共同探討交流。因為能力和時間有限，錯誤之處在所難免，歡迎指正！假設轉載，請保留作者信息。博客地址：http://blog.csdn.net/qq_2

Set VM RDM disk to Round Bobin and set IOPS path to 1

iops rdm KB Related to IOPS settingAdjusting Round Robin IOPS limit from default 1000 to 1 (2069356)https://kb.vmware.com/selfservice/microsites/search

VMware啟動Centos時出現錯誤Cannot open the disk 'xxxxxxx.vmdk' or one of the snapshot disks it depends on. .

每次 pan 鎖定 mic 數據 als log xxxxxx end 　　今天拔裝虛擬機的硬盤的時候，沒有關掉虛擬機，導致虛擬打開的時候出現：Cannot open the disk ‘xxxxxxx.vmdk‘ or one of the snapshot disk

How to Fix “Failed to play test tone” error on Windows 7, 8 and 10

rem item route audio laptop imu right answer tom 轉自： https://appuals.com/how-to-fix-failed-to-play-test-tone-error-on-windows-7-8-and-10/

【Vue報錯】Module build failed: Error: No parser and no file path given, couldn't infer a p arser.

3.0 OS could modules exp ports -- hot loader 在創建一個vue項目啟動時報錯，報錯的內容為： error in ./src/App.vue Module build failed: Error: No parser and no

ASM Disk Group Will not Mount In Presence Of Duplicate Disks / Devices: ORA-15032, ORA-15017, ORA-15063 (文檔 ID 1501660.1)

DG pmo base 11g trie lar grid KS ice APPLIES TO: Oracle Database - Enterprise Edition - Version 11.2.0.1 and laterInformation in this doc

nginx啟動報錯：Job for nginx.service failed. See 'systemctl status nginx.service' and 'journalctl -xn' fo

class lasso clas blog 80端口 led emc tar 強制一、背景這個錯誤在重啟nginx或者啟動nginx的時候，經常會出現。我之前也一直認為出現這個錯誤是因為有程序占用了nginx的進程。但是知其然不知其所以然。每次報錯都有點懵逼

Gradle project sync failed. Please fix your project and try again

adl flow goto eas studio art ack fix use https://stackoverflow.com/questions/29808199/error-running-android-gradle-project-sync-failed-pl

And: adle sync failed: C:\Users\zengjx\.gradle\caches\4.1\scripts-remapped\settings_7fgyzn9rget3mz

Gradle sync failed: C:\Users\zengjx\.gradle\caches\4.1\scripts-remapped\settings_7fgyzn9rget3mzt1mykfto9jk\75m1ojola3xee6bl95bu05hjo\settings4dada6424

轉：springboot專案啟動報錯Failed to configure a DataSource: 'url' attribute is not specified and no embedde

*************************** APPLICATION FAILED TO START *************************** Description: Failed to configure a DataSource: 'url' attribute

springboot報錯 Failed to configure a DataSource: 'url' attribute is not specified and no embedded data

報錯 Description: Failed to configure a DataSource: 'url' attribute is not specified and no embedded datasource could be configured.

postman Installation has failed: There was an error while installing the application. Check the setup log for more information and contact the author

Error msg: Installation has failed: There was an error while installing the application. Check the setup log for more information and contact the autho

Ubuntu OSError: `pydot` failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH.

安裝pydot，pip install pydot 出現OSError: pydot failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are

'multipathd' and disk checker recognizes failed disks later than Oracle ASM

Environment

Issue

Resolution

相關推薦