oracle rac grid使用者執行命令crs_stat 耗時30秒處理
現象:
環境:2節點oracle rac 11.2.0.4.。物理機
現象1:
一套oracle rac 11.2.0.4 叢集環境,grid使用者執行一些命令返回時間異常。需要近30秒
[grid@rac1 ~]$ time crs_stat -t -v
Name Type R/RA F/FT Target State Host
----------------------------------------------------------------------
ora.DATA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1
ora.FRA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1
ora....ER.lsnr ora....er.type 0/5 0/ ONLINE ONLINE rac1
ora....N1.lsnr ora....er.type 0/5 0/0 ONLINE ONLINE rac1
ora.OCR.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1
ora.asm ora.asm.type 0/5 0/ ONLINE ONLINE rac1
ora.*.db ora....se.type 0/2 0/1 OFFLINE OFFLINE
ora.*.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1
ora.cvu ora.cvu.type 0/5 0/0 ONLINE ONLINE rac1
ora.gsd ora.gsd.type 0/5 0/ OFFLINE OFFLINE
ora....network ora....rk.type 0/5 0/ ONLINE ONLINE rac1
ora.oc4j ora.oc4j.type 0/1 0/2 ONLINE ONLINE rac1
ora.ons ora.ons.type 0/3 0/ ONLINE ONLINE rac1
ora....SM1.asm application 0/5 0/0 ONLINE ONLINE rac1
ora....C1.lsnr application 0/5 0/0 ONLINE ONLINE rac1
ora.rac1.gsd application 0/5 0/0 OFFLINE OFFLINE
ora.rac1.ons application 0/3 0/0 ONLINE ONLINE rac1
ora.rac1.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac1
ora....SM2.asm application 0/5 0/0 ONLINE ONLINE rac2
ora....C2.lsnr application 0/5 0/0 ONLINE ONLINE rac2
ora.rac2.gsd application 0/5 0/0 OFFLINE OFFLINE
ora.rac2.ons application 0/3 0/0 ONLINE ONLINE rac2
ora.rac2.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac2
ora.scan1.vip ora....ip.type 0/0 0/0 ONLINE ONLINE rac1
ora.*.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1
real 0m27.927s
user 0m17.025s
sys 0m10.712s
現象2:zabbix監控告警,偶爾提示 agent.ping超時。同時zabbix serve日誌大量報錯
14975:20210730:051954.737 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2-" failed: first network error, wait for 15 seconds 14980:20210730:052049.401 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored14978:20210730:052250.841 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14979:20210730:052300.585 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:052312.489 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored 14980:20210730:052414.317 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14976:20210730:052557.756 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:052647.517 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14976:20210730:052750.795 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14977:20210730:052800.474 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:052812.589 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored 14980:20210730:052911.698 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14978:20210730:052918.528 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14979:20210730:052956.146 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:053014.555 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored 14980:20210730:053015.559 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14977:20210730:053156.451 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:053246.504 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14978:20210730:053348.958 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14980:20210730:053412.577 resuming Zabbix agent checks on host "RAC-NODE1": connection restored 14975:20210730:053557.316 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2-" failed: first network error, wait for 15 seconds 14980:20210730:053649.374 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14962:20210730:054521.587 item "ORAC-NODE1:oracle.status_offline.process" became not supported: Timeout while executing a shell script. 14975:20210730:055156.431 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:055248.282 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14963:20210730:055347.878 item "ORAC-NODE1:oracle.status_offline.process" became supported 14975:20210730:055358.595 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:055446.042 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14965:20210730:055518.943 item "ORAC-NODE1:oracle.status_online.process" became not supported: Timeout while executing a shell script. 14964:20210730:055519.942 item "ORAC-NODE1:oracle.status_offline.process" became not supported: Timeout while executing a shell scrip
有時候圖形還會斷
分析:
1 分析oracle錯誤日誌
[grid@rac2 trace]$ pwd
/u01/app/grid/diag/asm/+asm/+ASM2/trace
[grid@rac2 trace]$ tail -n 100 alert_+ASM2.log
[oracle@rac2 trace]$ pwd
/u01/app/oracle/diag/rdbms/*/*2/trace
[oracle@rac2 trace]$ tail -n 100 alert_*.log
沒有發現有關異常
2 檢視伺服器效能資訊
[root@rac2 ~]# iostat -x 1 avg-cpu: %user %nice %system %iowait %steal %idle 5.76 0.00 0.53 1.22 0.00 92.49 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 19.00 0.00 6.00 0.00 200.00 33.33 0.00 0.00 0.00 0.00 0.00 0.00 up-0 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 up-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-4 237.00 0.00 562.00 1.00 203376.00 5.00 361.25 0.31 0.55 0.55 1.00 0.31 17.70 up-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 sdd 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 sde 0.00 0.00 1035.00 3.00 385232.00 39.00 371.17 0.55 0.58 0.57 1.67 0.31 32.40 sdf 0.00 0.00 1082.00 3.00 396736.00 39.00 365.69 0.59 0.60 0.60 1.00 0.32 34.30 sdg 0.00 0.00 1049.00 1.00 384224.00 32.00 365.96 0.55 0.56 0.56 1.00 0.32 33.30 sdh 0.00 0.00 0.00 3.00 0.00 39.00 13.00 0.00 2.00 0.00 2.00 0.33 0.10 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-12 233.00 0.00 490.00 3.00 184336.00 39.00 373.99 0.27 0.56 0.55 1.67 0.32 16.00 up-14 237.00 0.00 563.00 0.00 203744.00 0.00 361.89 0.31 0.55 0.55 0.00 0.33 18.70 up-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-19 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 up-21 244.00 0.00 545.00 0.00 200896.00 0.00 368.62 0.33 0.60 0.60 0.00 0.34 18.40 up-23 225.00 0.00 486.00 1.00 180480.00 32.00 370.66 0.29 0.59 0.59 1.00 0.35 17.20 up-25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-27 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-29 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 up-31 242.00 0.00 520.00 2.00 193360.00 34.00 370.49 0.34 0.65 0.65 1.00 0.37 19.30 up-33 0.00 0.00 0.00 3.00 0.00 39.00 13.00 0.01 2.00 0.00 2.00 2.00 0.60 up-35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 25.00 0.00 200.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [root@rac2 ~]# pidstat -d 1 Linux 2.6.32-642.el6.x86_64 (rac2) 07/30/2021 _x86_64_ (32 CPU) 08:45:30 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:45:31 PM 7829 137292.31 0.00 0.00 oracle 08:45:31 PM 11890 2.88 0.00 0.00 ocssd.bin 08:45:31 PM 12525 30.77 0.00 0.00 oracle 08:45:31 PM 16723 46.15 0.00 0.00 oracle 08:45:31 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:45:32 PM 7829 122400.00 0.00 0.00 oracle 08:45:32 PM 11890 836.00 4.00 0.00 ocssd.bin 08:45:32 PM 12089 0.00 32.00 0.00 ologgerd 08:45:32 PM 12385 0.00 4.00 0.00 orarootagent.bi 08:45:32 PM 12525 48.00 0.00 0.00 oracle 08:45:32 PM 16723 32.00 0.00 0.00 oracle
3 生成並檢視Oracle awr報表
[oracle@rac1 admin]$ ll /u01/app/oracle/product/11.2.0/db_1/rdbms/admin/awrgrpti.sql
-rw-r--r-- 1 oracle oinstall 6444 Jul 24 2011 /u01/app/oracle/product/11.2.0/db_1/rdbms/admin/awrgrpti.sql
4 CRS-4700: The Cluster Time Synchronization Service is in Observer mode.
處理:
[grid@rac1 ~]$ cat /opt/synctime.sh #!/bin/bash ntpdate *** hwclock -w [grid@rac1 ~]$ cluvfy comp clocksync -verbose Verifying Clock Synchronization across the cluster nodes Checking if Clusterware is installed on all nodes... Check of Clusterware install passed Checking if CTSS Resource is running on all nodes... Check: CTSS Resource running on all nodes Node Name Status ------------------------------------ ------------------------ rac1 passed Result: CTSS resource check passed Querying CTSS for time offset on all nodes... Result: Query of CTSS for time offset passed Check CTSS state started... Check: CTSS state Node Name State ------------------------------------ ------------------------ rac1 Observer CTSS is in Observer state. Switching over to clock synchronization checks using NTP Starting Clock synchronization checks using Network Time Protocol(NTP)... NTP Configuration file check started... The NTP configuration file "/etc/ntp.conf" is available on all nodes NTP Configuration file check passed Checking daemon liveness... Check: Liveness for "ntpd" Node Name Running? ------------------------------------ ------------------------ rac1 no Result: Liveness check failed for "ntpd" PRVF-5494 : The NTP Daemon or Service was not alive on all nodes PRVF-5415 : Check to see if NTP daemon or service is running failed Result: Clock synchronization check using Network Time Protocol(NTP) failed PRVF-9652 : Cluster Time Synchronization Services check failed Verification of Clock Synchronization across the cluster nodes was unsuccessful on all the specified nodes.
[grid@rac1 ~]$ srvctl status listener Listener LISTENER is enabled Listener LISTENER is running on node(s): rac2,rac1 [grid@rac1 ~]$ ssh rac2 date;date Fri Jul 30 21:56:39 * 2021 Fri Jul 30 21:56:39 * 2021 [grid@rac1 ~]$ crsctl check ctss CRS-4700: The Cluster Time Synchronization Service is in Observer mode. [grid@rac2 ~]$ crsctl check ctss CRS-4700: The Cluster Time Synchronization Service is in Observer mode. [root@rac1 ~]# mv /etc/ntp.conf /etc/ntp.conf.bak [grid@rac1 ~]$ crsctl check ctss CRS-4701: The Cluster Time Synchronization Service is in Active mode. CRS-4702: Offset (in msec): 0
##節點2執行 [root@rac2 ~]# mv /etc/ntp.conf /etc/ntp.conf.bk [grid@rac2 ~]$ crsctl check ctss CRS-4701: The Cluster Time Synchronization Service is in Active mode. CRS-4702: Offset (in msec): 0 [grid@rac2 ~]$ cluvfy comp clocksync -verbose Verifying Clock Synchronization across the cluster nodes Checking if Clusterware is installed on all nodes... Check of Clusterware install passed Checking if CTSS Resource is running on all nodes... Check: CTSS Resource running on all nodes Node Name Status ------------------------------------ ------------------------ rac2 passed Result: CTSS resource check passed Querying CTSS for time offset on all nodes... Result: Query of CTSS for time offset passed Check CTSS state started... Check: CTSS state Node Name State ------------------------------------ ------------------------ rac2 Active CTSS is in Active state. Proceeding with check of clock time offsets on all nodes... Reference Time Offset Limit: 1000.0 msecs Check: Reference Time Offset Node Name Time Offset Status ------------ ------------------------ ------------------------ rac2 0.0 passed Time offset is within the specified limits on the following set of nodes: "[rac2]" Result: Check of clock time offsets passed Oracle Cluster Time Synchronization Services check passed Verification of Clock Synchronization across the cluster nodes was successful. [grid@rac1 ~]$ time srvctl status asm -a ##執行時間還是沒有變化 ASM is running on rac2,rac1 ASM is enabled. real 0m32.048s user 0m20.051s sys 0m11.758s
5 檢視網路日誌
[oracle@rac2 ~]$ tail -n 100 /u01/app/grid/diag/tnslsnr/rac2/listener/alert/log.xml host_addr='***'> <txt>31-JUL-2021 10:16:55 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=root))(service_name=***)) * (ADDRESS=(PROTOCOL=tcp)(HOST=***)(PORT=49407)) * establish * *** * 0 </txt> </msg> <msg time='2021-07-31T10:17:15.788+08:00' org_id='oracle' comp_id='tnslsnr' type='UNKNOWN' level='16' host_id='rac2' host_addr='***'> <txt>31-JUL-2021 10:17:15 * service_update * **** * 0 </txt>
6 跟蹤命令執行
[grid@rac1 ~]$ strace crs_stat -t -v getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 ^Cchdir("/u01/app [grid@rac1 ~]$ tail -n 100 /u01/app/11.2.0/grid/log/rac1/client/clsc105036.log Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved. 2021-04-28 08:01:57.908: [ CRSCOMM][3588262272] NAME: `UI_DATA` length=7 2021-04-28 08:01:57.908: [ CRSCOMM][3588262272] Successfully read response
從上面的跟蹤日誌很明顯就能發現問題所在
解決:
發現命令執行大量呼叫access("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", F_OK) = 0
[grid@rac1 client]$ ll |wc -l 576583 [grid@rac1 client]$ du -sh 2.3G . [grid@rac1 client]$ ll /u01/app/11.2.0/grid/log/rac1/client/clsc105037.log -rw-r--r-- 1 zabbix zabbix 262 Apr 28 08:02 /u01/app/11.2.0/grid/log/rac1/client/clsc105037.log [grid@rac1 client]$ ll clsc*.log|wc -l 576561 [root@rac1 client]# find -type f -mtime -1|wc -l 2328 [root@rac1 client]# ll clsc575437.log -rw-r--r-- 1 zabbix zabbix 262 Aug 1 10:16 clsc575437.log [root@rac1 ~]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vgnode110102723-lv_root 3276800 793009 2483791 25% / tmpfs 1000000 1024 998976 1% /dev/shm /dev/sda1 128016 43 127973 1% /boot /dev/mapper/vg_node110102723-lv_home 13926400 95 13926305 1% /home [root@rac1 client]# find -amin -20 ./clsc576616.log ./clsc576613.log ./clsc576615.log ./clsc576610.log ./clsc576614.log ./clsc576609.log ./clsc576611.log ./clsc576612.log [root@rac1 client]# ll -h clsc576612.log -rw-r--r-- 1 zabbix zabbix 262 Aug 1 22:31 clsc576612.log [root@rac1 client]# ll clsc5766*.log |wc -l 34 You have mail in /var/spool/mail/root
發現大量的clsc*.log日誌,而且使用者和組都是zabbix,由此懷疑zabbix監控專案寫入了此檔案,而且也是同監控頻率一致,一分鐘一個檔案。
檢視另外一套正常的rac庫,該目錄下是沒有生成如此多的檔案。
[root@rac1 client]# pwd /u01/app/11.2.0/grid/log/rac1/client [root@rac1 client]# rm -f clsc5*.log [root@rac1 client]# ll |wc -l [grid@rac1 ~]$ time crs_stat -t -v Name Type R/RA F/FT Target State Host ---------------------------------------------------------------------- ora.DATA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1 ora.FRA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1 ora....ER.lsnr ora....er.type 0/5 0/ ONLINE ONLINE rac1 ora....N1.lsnr ora....er.type 0/5 0/0 ONLINE ONLINE rac1 ora.OCR.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1 ora.asm ora.asm.type 0/5 0/ ONLINE ONLINE rac1 ora.***.db ora....se.type 0/2 0/1 OFFLINE OFFLINE ora.***.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1 ora.cvu ora.cvu.type 0/5 0/0 ONLINE ONLINE rac1 ora.gsd ora.gsd.type 0/5 0/ OFFLINE OFFLINE ora....network ora....rk.type 0/5 0/ ONLINE ONLINE rac1 ora.oc4j ora.oc4j.type 0/1 0/2 ONLINE ONLINE rac1 ora.ons ora.ons.type 0/3 0/ ONLINE ONLINE rac1 ora....SM1.asm application 0/5 0/0 ONLINE ONLINE rac1 ora....C1.lsnr application 0/5 0/0 ONLINE ONLINE rac1 ora.rac1.gsd application 0/5 0/0 OFFLINE OFFLINE ora.rac1.ons application 0/3 0/0 ONLINE ONLINE rac1 ora.rac1.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac1 ora....SM2.asm application 0/5 0/0 ONLINE ONLINE rac2 ora....C2.lsnr application 0/5 0/0 ONLINE ONLINE rac2 ora.rac2.gsd application 0/5 0/0 OFFLINE OFFLINE ora.rac2.ons application 0/3 0/0 ONLINE ONLINE rac2 ora.rac2.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac2 ora.scan1.vip ora....ip.type 0/0 0/0 ONLINE ONLINE rac1 ora.***.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1 real 0m0.049s user 0m0.014s sys 0m0.008s
目前,問題已經解決
刪除相關檔案:clsc*.log
再次執行命令,耗時近real 0m0.049s
[grid@rac1 ~]$ strace -tt -T -v -o /tmp/strace_crs_20210801.log crs_stat -t -v
再次開啟zabbix監控,發現還是1分鐘一個檔案生成,看來還沒有從根本上解決此問題。目前是解決了。如果找不到根本解決辦法,先用定時任務找到該型別檔案並刪除吧。
疑問:1 為什麼這一套oracle rac庫會存在zabbix監控oracle的item,會生成這麼多檔案。監控專案(1521,ora_pmon,asm.process,session_counts,等
2 經過檢視zabbix針對Oracle環境沒有什麼特殊的設定?
3 oracle 環境變數或者引數有沒有什麼特殊設定?