1. 程式人生 > >RAC 主機記憶體條故障導致的一個節點重啟

RAC 主機記憶體條故障導致的一個節點重啟

今天業務反饋有一臺主機登入不上去,我同事比我先處理一段時間,導致了ocssd.log裡面的日誌沒有拿下來(ohasd.log,ocssd.log這些日誌產生的日誌量比較大,不像DB的alter日誌,當ocssd.log日誌是會覆蓋的,假設你9:00叢集宕機的,當你10:00再去看日誌會發現裡面日誌沒有9:00日誌了,都是最近的日誌,這也是我為什麼沒有拿到ocssd.log裡面發生故障點時間的日誌)。同事確認過在主機發生重啟的時候在ocsd.log裡面沒有因為網路問題導致節點驅逐導致腦裂的資訊。

(1)先上主機,檢視主機何時宕機重啟的。

[[email protected] cssd]# last reboot | head -1  --這條命令才看到主機重啟的時間是08:52,這個時間是啟動的時間,其實主機宕機的時間是比08:52早幾分鐘。 reboot   system boot  2.6.32-431.el6.x Wed Sep 12 08:52 - 10:30  (01:38) 

(2)由於主機發生了重啟,去檢視作業系統讓日誌

作業系統日誌: ep 12 08:24:19 zjhzbjwgzhzg01 kernel: __ratelimit: 16 callbacks suppressed Sep 12 08:24:19 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged  --這裡可以看到在比主機重啟的時間早一些的時候,radhat linux作業系統的主機已經開始報關於硬體方面的錯誤了。 Sep 12 08:24:23 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:25:27 zjhzbjwgzhzg01 kernel: __ratelimit: 38 callbacks suppressed Sep 12 08:25:27 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:25:27 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:26:00 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 595c01f000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:26:00 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:26:00 zjhzbjwgzhzg01 mcelog: Offlining page 595c01f000 Sep 12 08:26:28 zjhzbjwgzhzg01 kernel: __ratelimit: 39 callbacks suppressed Sep 12 08:26:28 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:26:28 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:27:31 zjhzbjwgzhzg01 kernel: __ratelimit: 88 callbacks suppressed Sep 12 08:27:31 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:27:31 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events logged Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 58dc5bf000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Offlining page 58dc5bf000 Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 591dd3f000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:42 zjhzbjwgzhzg01 mcelog: Offlining page 591dd3f000 Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599c859000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c859000 Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599c858000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c858000 Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c858000 failed: Device or resource busy Sep 12 08:27:43 zjhzbjwgzhzg01 kernel: soft offline: 0x599c858 page already poisoned Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599c815000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c815000 Sep 12 08:27:43 zjhzbjwgzhzg01 mcelog: Offlining page 599c815000 failed: Device or resource busy Sep 12 08:27:43 zjhzbjwgzhzg01 kernel: soft offline: 0x599c815 page already poisoned Sep 12 08:27:44 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599cc34000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:44 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:44 zjhzbjwgzhzg01 mcelog: Offlining page 599cc34000 Sep 12 08:27:46 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589d81d000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:46 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:27:46 zjhzbjwgzhzg01 mcelog: Offlining page 589d81d000 Sep 12 08:27:53 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 599dd34000 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:27:53 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] --More--(4%) Sep 12 08:35:06 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:35:06 zjhzbjwgzhzg01 mcelog: Offlining page 58ddd9c000 Sep 12 08:36:10 zjhzbjwgzhzg01 kernel: __ratelimit: 50 callbacks suppressed Sep 12 08:36:10 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:36:11 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:36:45 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 591d45f00 0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:36:45 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:36:45 zjhzbjwgzhzg01 mcelog: Offlining page 591d45f000 Sep 12 08:37:07 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 595c4df00 0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:37:07 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:37:07 zjhzbjwgzhzg01 mcelog: Offlining page 595c4df000 Sep 12 08:37:14 zjhzbjwgzhzg01 kernel: __ratelimit: 61 callbacks suppressed Sep 12 08:37:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:37:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:37:25 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589d0ff00 --這裡報錯是記憶體報錯

0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:37:25 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:37:25 zjhzbjwgzhzg01 mcelog: Offlining page 589d0ff000 Sep 12 08:38:14 zjhzbjwgzhzg01 kernel: __ratelimit: 55 callbacks suppressed Sep 12 08:38:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:38:14 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:39:15 zjhzbjwgzhzg01 kernel: __ratelimit: 72 callbacks suppressed Sep 12 08:39:15 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:39:16 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:39:56 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589c93f00 0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:39:56 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:39:56 zjhzbjwgzhzg01 mcelog: Offlining page 589c93f000 Sep 12 08:40:19 zjhzbjwgzhzg01 kernel: __ratelimit: 56 callbacks suppressed Sep 12 08:40:19 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:40:19 zjhzbjwgzhzg01 kernel: [Hardware Error]: Machine check events lo gged Sep 12 08:40:22 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 589ccff00 0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:40:22 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:40:22 zjhzbjwgzhzg01 mcelog: Offlining page 589ccff000 Sep 12 08:40:25 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 595cd9f00 0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:40:25 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? [] Sep 12 08:40:25 zjhzbjwgzhzg01 mcelog: Offlining page 595cd9f000 Sep 12 08:41:00 zjhzbjwgzhzg01 mcelog: Corrected memory errors on page 5c9cdb600 0 exceed threshold 10 in 24h: 10 in 24h Sep 12 08:41:00 zjhzbjwgzhzg01 mcelog: Location SOCKET:2 CHANNEL:0 DIMM:? []Sep 12 08:41:00 zjhzbjwgzhzg01 mcelog: Offlining page 5c9cdb6000   --這裡可以看到日誌斷了,08:42-08:51分鐘的日誌看不到了,可以判斷這段時間主機是宕機了。所以這裡可以將故障的時間定為到08:40,如果要去檢視ohasd.log,ocrsd.log,ocssd.log日誌就應該將注意力放在08:40之前一段時間,而不是像無頭蒼蠅一樣將大部分日誌看一遍。 Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: imklog 5.8.10, log source = /proc/kmsg st
arted. Sep 12 08:52:53 zjhzbjwgzhzg01 rsyslogd: [origin software="rsyslogd" swVersion=" 5.8.10" x-pid="7988" x-info="http://www.rsyslog.com"] start Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Initializing cgroup subsys cpuset Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Initializing cgroup subsys cpu Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Linux version 2.6.32-431.el6.x86_64 (mock [email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4 .7-4) (GCC) ) #1 SMP Sun Nov 10 22:19:54 EST 2013 Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: Command line: ro root=UUID=40b7f075-d922- 4995-aebe-de61630e7037 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=la tarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quie t nohz=0ff intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll transparent_ hugepage=never Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: KERNEL supported cpus: Sep 12 08:52:53 zjhzbjwgzhzg01 kernel:  Intel GenuineIntel Sep 12 08:52:53 zjhzbjwgzhzg01 kernel:  AMD AuthenticAMD Sep 12 08:52:53 zjhzbjwgzhzg01 kernel:  Centaur CentaurHauls Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-provided physical RAM map: Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000000000000 - 00000000000 9a000 (usable) Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 000000000009a000 - 00000000000 a0000 (reserved) Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 00000000000e0000 - 00000000001 00000 (reserved) Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000000100000 - 00000000755 07000 (usable) Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000075507000 - 0000000075c bb000 (reserved) Sep 12 08:52:53 zjhzbjwgzhzg01 kernel: BIOS-e820: 0000000075cbb000 - 0000000075d bb000 (ACPI data)

(3)檢視節點資料庫日誌(上面通過檢視同事通過觀察ocssd.log內容反饋不是網路導致的叢集腦裂重啟,可以判斷網路這一塊不是導致主機重啟的原因,上面作業系統日誌報錯的是硬體和記憶體方面的錯誤,可能是主機重啟的原因,但是也有可能是資料庫層面的導致的,所以看看DB日誌) Errors in file /opt/oracle/oracle/diag/rdbms/sdh/sdh1/trace/sdh1_j003_91488.trc: ORA-12012: error on auto execute of job 90926 ORA-01403: no data found ORA-06512: at "IRM.PKG_DATACOMPARE_TASKEXECUTE", line 11 ORA-06512: at line 1 Errors in file /opt/oracle/oracle/diag/rdbms/sdh/sdh1/trace/sdh1_j002_91432.trc: ORA-12012: error on auto execute of job 90927 ORA-01403: no data found ORA-06512: at "IRM.PKG_DATACOMPARE_TASKEXECUTE", line 11 ORA-06512: at line 1 Wed Sep 12 08:31:39 2018 Errors in file /opt/oracle/oracle/diag/rdbms/sdh/sdh1/trace/sdh1_j004_92891.trc: ORA-12012: error on auto execute of job 90929 ORA-01403: no data found ORA-06512: at "IRM.PKG_DATACOMPARE_TASKEXECUTE", line 11 ORA-06512: at line 1 Wed Sep 12 08:38:45 2018 Thread 1 advanced to log sequence 338086 (LGWR switch)   Current log# 1 seq# 338086 mem# 0: +SDH_SYS_DG/sdh/onlinelog/group_1.329.879523497 Wed Sep 12 08:38:45 2018 LNS: Standby redo logfile selected for thread 1 sequence 338086 for destination LOG_ARCHIVE_DEST_2Wed Sep 12 08:38:50 2018  --可以看到在08:38-09:14之間的日誌沒有了,這段時間資料庫是宕機的 Archived Log entry 1126324 added for thread 1 sequence 338085 ID 0xd26cc373 dest 1: Wed Sep 12 09:14:56 2018 Starting ORACLE instance (normal) --同事手動將資料庫拉起 ************************ Large Pages Information ******************* Per process system memlock (soft) limit = UNLIMITED   Large page usage restricted to processor group "sys"   Total Shared Global Region in Large Pages = 200 GB (100%)  WARNING: --資料庫啟動的時候又報出來記憶體方面錯誤,之前主機也報出記憶體方面錯誤。   The parameter _linux_prepage_large_pages is explicitly disabled.    Oracle strongly recommends setting the _linux_prepage_large_pages    parameter since the instance  is running in a Processor Group. If there is    insufficient large page memory, instance may encounter SIGBUS error    and may terminate abnormally.   Large Pages used by this instance: 102401 (200 GB) Large Pages unused in Processor Group sys = 145081 (283 GB) Large Pages configured in Processor Group sys = 153496 (300 GB) Large Page size = 2048 KB ******************************************************************** LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 88 Number of processor cores in the system is 44 Number of processor sockets in the system is 4 Private Interface 'Bond1:1' configured from GPnP for use as a private interconnect.   [name='Bond1:1', type=1, ip=169.254.185.193, mac=a0-00-01-00-fe-80-00-00-00-00-00-00-64-3e-00-00-00-00-00-00, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'vlan304' configured from GPnP for use as a public interface.   [name='vlan304', type=1, ip=10.212.252.84, mac=e0-97-96-06-e7-a5, net=10.212.252.80/28, mask=255.255.255.240, use=public/1] Public Interface 'vlan304:1' configured from GPnP for use as a public interface.   [name='vlan304:1', type=1, ip=10.212.252.87, mac=e0-97-96-06-e7-a5, net=10.212.252.80/28, mask=255.255.255.240, use=public/1] Picked latch-free SCN scheme 3 Autotune of undo retention is turned off.  LICENSE_MAX_USERS = 0 SYS auditing is disabled NUMA system with 4 nodes detected Starting up:   Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options. ORACLE_HOME = /opt/oracle/oracle/product/11.2.0/db_1 System name:    Linux Node name:      zjhzbjwgzhzg01 Release:        2.6.32-431.el6.x86_64 Version:        #1 SMP Sun Nov 10 22:19:54 EST 2013 Machine:        x86_64 Using parameter settings in server-side pfile /opt/oracle/oracle/product/11.2.0/db_1/dbs/initsdh1.ora System parameters with non-default values:   processes                = 3000   sessions                 = 4576   event                    = "10949 trace name context forever, level 1"   sga_max_size             = 200G   shared_pool_size         = 20G   large_pool_size          = 3584M   java_pool_size           = 3584M   streams_pool_size        = 512M   spfile                   = "+SDH_SYS_DG/sdh/spfilesdh.ora

(4)檢視ASM日誌,其實ASM磁碟有問題並不會導致叢集宕了導致主機重啟,這裡只是為了看看

ASM日誌

Wed Sep 12 08:56:59 2018 NOTE: No asm libraries found in the system   --開始重啟ASM例項,可以看到在主機重啟後,叢集也自啟動拉起了ASM。不出意外去看看crsd.log MEMORY_TARGET defaulting to 1128267776. * instance_number obtained from CSS = 1, checking for the existence of node 0...  * node 0 does not exist. instance_number = 1  Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 88 Number of processor cores in the system is 44 Number of processor sockets in the system is 4 Private Interface 'Bond1:1' configured from GPnP for use as a private interconnect.   [name='Bond1:1', type=1, ip=169.254.185.193, mac=a0-00-01-00-fe-80-00-00-00-00-00-00-64-3e-00-00-00-00-00-00, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'vlan304' configured from GPnP for use as a public interface.   [name='vlan304', type=1, ip=10.212.252.84, mac=e0-97-96-06-e7-a5, net=10.212.252.80/28, mask=255.255.255.240, use=public/1] Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/grid/11.2.0/grid/dbs/arch Autotune of undo retention is turned on.  LICENSE_MAX_USERS = 0 SYS auditing is disabled NOTE: Volume support  enabled NUMA system with 4 nodes detected Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options. ORACLE_HOME = /opt/oracle/grid/11.2.0/grid System name:    Linux Node name:      zjhzbjwgzhzg01 Release:        2.6.32-431.el6.x86_64 Version:        #1 SMP Sun Nov 10 22:19:54 EST 2013 Machine:        x86_64 Using parameter settings in server-side spfile +OCR_VOTE/zjhzbjw-cluster/asmparameterfile/registry.253.876056333 System parameters with non-default values:   large_pool_size          = 12M   instance_type            = "asm"   remote_login_passwordfile= "EXCLUSIVE"   asm_diskstring           = "/dev/asmdisk/*"   asm_diskgroups           = "OCR_VOTE"   asm_diskgroups           = "SDH_SYS_DG"   asm_diskgroups           = "ARCHIVE_DG"   asm_diskgroups           = "WPS_SYS_DG"   asm_power_limit          = 8   diagnostic_dest          = "/opt/oracle/grid/grid" Cluster communication is configured to use the following interface(s) for this instance   169.254.185.193 cluster interconnect IPC version:Oracle UDP/IP (generic)  

(5)檢視CRS的日誌

crsd日誌:整個日誌可以看到08:57的時候。CRS元件重啟了

2018-09-12 08:37:37.898: [UiServer][1156998912]{1:47993:28042} Done for ctx=0x7fcfbc0088a0 2018-09-12 08:57:08.163: [ CRSMAIN][1714153248] First attempt: init CSS context succeeded. [  clsdmt][1707702016]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=zjhzbjwgzhzg01DBG_CRSD)) 2018-09-12 08:57:08.165: [  clsdmt][1707702016]PID for the Process [35434], connkey 1  2018-09-12 08:57:08.165: [  clsdmt][1707702016]Creating PID [35434] file for home /opt/oracle/grid/11.2.0/grid host zjhzbjwgzhzg01 bin crs to /opt/oracle/gri d/11.2.0/grid/crs/init/ 2018-09-12 08:57:08.165: [  clsdmt][1707702016]Writing PID [35434] to the file [/opt/oracle/grid/11.2.0/grid/crs/init/zjhzbjwgzhzg01.pid]  2018-09-12 08:57:08.607: [ CRSMAIN][1707702016] Policy Engine is not initialized yet! 2018-09-12 08:57:08.607: [ CRSMAIN][1714153248] CRS Daemon Starting 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: allcomp  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: default  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: COMMCRS  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: COMMNS  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: CSSCLNT  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCLIB  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCXBAD  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCLXPT  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCUNDE  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPC  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCGEN  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCTRAC  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCWAIT  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCXCPT  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCOSD  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCBASE  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCCLSA  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCCLSC  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCEXMP  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCGMOD  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHEAD  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCMUX  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCNET  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCNULL  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCPKT  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCSMEM  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHAUP  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHALO  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHTHR  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHGEN  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHLCK  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHDEM  0 2018-09-12 08:57:08.608: [    CRSD][1714153248] Logging level for Module: GIPCHWRK  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSMAIN  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: clsdmt  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: clsdms  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSUI  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSCOMM  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSRTI  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSPLACE  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSAPP  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSRES  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSTIMER  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSEVT  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSD  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CLUCLS  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CLSVER  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CLSFRAME  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSPE  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSSE  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSRPT  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSOCR  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: UiServer  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: AGFW  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: SuiteTes  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSSHARE  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSSEC  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSCCL  0 2018-09-12 08:57:08.609: [ CRSMAIN][1707702016] Policy Engine is not initialized yet! 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: CRSCEVT  0 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: AGENT  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRAPI  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRCLI  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRSRV  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRMAS  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRMSG  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRCAC  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRRAW  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRUTL  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCROSD  1 2018-09-12 08:57:08.609: [    CRSD][1714153248] Logging level for Module: OCRASM  1 2018-09-12 08:57:08.609: [ CRSMAIN][1714153248] Checking the OCR device 2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Sync-up with OCR 2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Connecting to the CSS Daemon 2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Getting local node number 2018-09-12 08:57:08.610: [ CRSMAIN][1714153248] Initializing OCR [   CLWAL][1714153248]clsw_Initialize: OLR initlevel [70000] 2018-09-12 08:57:09.198: [  OCRRAW][1714153248]proprioo: for disk 0 (+OCR_VOTE), id match (1), total id sets, (2) need recover (0), my votes (2), total votes  (2), commit_lsn (1469), lsn (1469) 2018-09-12 08:57:09.198: [  OCRRAW][1714153248]proprioo: my id set: (760227868, 1028247821, 0, 0, 0) 2018-09-12 08:57:09.198: [  OCRRAW][1714153248]proprioo: 1st set: (340549372, 760227868, 0, 0, 0) 2018-09-12 08:57:09.198: [  OCRRAW][1714153248]proprioo: 2nd set: (760227868, 1028247821, 0, 0, 0) 2018-09-12 08:57:09.207: [  OCRSRV][1714153248]th_init: Successfully retrieved CSS misscount [31]. 2018-09-12 08:57:09.207: [  OCRSRV][1714153248]th_init: Successfully query CLSS mode [3]. 2018-09-12 08:57:09.208: [  OCRSRV][1714153248]th_init:1: FROM PUBDATA Node num [2]Remote Listening Port [0] Cache invalidation port [0] 2018-09-12 08:57:09.208: [  OCRSRV][1714153248]th_init:1.1: FROM PUBDATA Node num [2]CLSC Private IP or GIPC connect string [gipcha<zjhzbjwgzhzg02><467b-167f -bbd0-bd81><b6b7-5893-3f00-7370>] 

通過上面可以判斷出可能是由於記憶體問題和硬體問題導致的主機重啟,和主機那邊的人聯絡還真是記憶體條有問題導致整個主機的重啟,同時導致叢集的重啟。

總結:當叢集發生故障可以是硬體方面的,可能是等待事件導致CPU過高導致叢集重啟,也可能是掉盤導致的資料庫掛了但是叢集正常,這方面的內容要通過日誌才可以判斷,關鍵還是定位故障發生的時間,通過時間點日誌定位故障。