1. 程式人生 > 其它 >使用Oracle Grid配置Goldengate或其他第三方應用高可用

使用Oracle Grid配置Goldengate或其他第三方應用高可用

1. 概述

Oracle Grid不止能提供自身Oracle Database高可用,還可以為第三方應用提供高可用。

可以為OGG、SharePlex等邏輯複製,Apache等應用提供高可用。

使用Oracle Grid代理第三方應用主要有以下兩種方式:

  1. Oracle Grid Infrastructure Agents
  1. Third-Part Script
  1. 官方文件位置:
  2. Clusterware Administration and Deployment Guide
  3. Third-Party Applications Using the Script Agent
  4. Mos文件參考:
  5. Oracle_GoldenGate_Best_Practices_-_Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1_.pdf
  1. 關於第三方應用日誌位置
  1. Oracle Grid 11.2如果使用oracle新增資源,則日誌位置:
  2. $GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
  3. 12c以後GRID日誌也變為標準ADR目錄
  4. $GRID_BASE/diag/crs/crs/agent/scriptagent_oracle.trc
  5. # 如果為GRID新增資源,路徑或日誌名稱scriptagent_grid即可。

2. Grid代理第三方指令碼

下面測試利用Grid代理第三方指令碼形式提供高可用,XAG方式參考官方文件即可。

部署步驟概述:

  1. 配置應用VIP(此VIP不是RAC VIP,僅僅為了應用本身使用),對外提供唯一IP,使切換對應用透明。
  2. 部署goldengate啟停第三方指令碼。
  3. crsctl載入資源,配置許可權。
  4. 測試高可用。

2.1 配置VIP

  1. (1) login as root
  2. # appvipcfg create -network=1 \
  3. -ip=192.168.204.242 \
  4. -vipname=czhvip \
  5. -user=root
  6. (2) 檢視配置vip
  7. # crsctl stat res -p |grep -ie .network -ie subnet |grep -ie name -ie subnet
  8. (3) login as root
  9. # crsctl setperm resource czhvip -u user:oracle:r-x
  10. --配置資源使用許可權使用者,IP資源屬主一定必須是root,其他使用者無法配置IP,會導致無法啟動VIP資源。

2.2 部署OGG

  1. ogg安裝部署不在此贅述,可按照以下幾種方式:
  2. 1. 使用ACFS作為共享磁碟,OGG軟體本身以及dir*相關目錄均存放於ACFS檔案系統。
  3. ACFS相應版本以及補丁參考下面文件:
  4. ACFS Support On OS Platforms (Certification Matrix). (Doc ID 1369107.1).pdf
  5. 2.使用ACFS存放goldengatetrail檔案等,OGG軟體本身存放於操作掛載點即可,通過在作業系統相應路徑下建立軟連結方式指向ACFSdir*相應目錄
  6. $ ln s /acfs_mount_point/dirdat dirdat
  7. 3.使用例如ocfs2gpfs等叢集檔案系統存放

2.2 部署指令碼說明

下面指令碼僅僅用做示例,實際指令碼可以根據不同應用加入相應模組指令碼,比如check指令碼就需要判斷程序狀態等等。

Grid 第三方指令碼模組說明

  1. 1. Grid 11.2指令碼需要包含start/stop/clean/check/abort
  2. --示例指令碼
  3. #!/bin/sh
  4. case $1 in
  5. 'start')
  6. echo $(date)' start'>>/tmp/crs.log
  7. exit 0
  8. ;;
  9. 'stop')
  10. echo $(date)' stop'>>/tmp/crs.log
  11. exit 0
  12. ;;
  13. 'clean')
  14. echo $(date)' clean'>>/tmp/crs.log
  15. echo $?'clean' >>/tmp/crs.log
  16. exit 0
  17. ;;
  18. 'check')
  19. echo "CHECK entry point has been called.."
  20. echo $(date)' check'>>/tmp/crs.log
  21. exit 0
  22. ;;
  23. 'abort')
  24. echo $(date)' abort'>>/tmp/crs.log
  25. exit 0
  26. ;;
  27. esac
  28. 2. 模組說明
  29. --主要介紹11gR2引入的兩個新模組
  30. --12c以後版本引入了更多模組,這點可以從啟動日誌中看到。
  31. CLEAN
  32. Clean was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
  33. Clusterware 10g Release 2 or 11g Release 1. Clean is called when there is a need to clean up the
  34. resource. It is a non-graceful operation.
  35. ABORT
  36. Abort was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
  37. Clusterware 10g Release 2 or 11g Release 1. Abort is called if any of the resource components
  38. hang to abort the ongoing action. Abort is not required to be included.
  39. 3.關於指令碼中變數說明
  40. 如果start/stop/clean/check/abort對應指令碼中啟動程式指令碼需要依賴環境變數,例如
  41. (1)ogg如果extract配置使用本地ORACLE_SID連線資料庫進行捕獲,不是使用tnsalias方式連線資料庫,則ggsci> start extract時,依賴於環境變數ORACLE_SID,這種情況下,需要在上面指令碼中定義好依賴的ORACLE_SID以及ORACLE_HOME變數,因為Grid啟動時由於vip屬主為root,所以如果vipogg資源強依賴時,只能獲取到root的使用者環境變數,無法獲得oracle使用者環境變數,會導致資源無法正常啟動。
  42. (2)所以環境變數一定要在指令碼中完全定義,不要依賴於外部變數,否則將會發生問題後很難排查以及遇到無法啟動資源或啟動資源無法啟動程式中相應程序。

2.3 OGG高可用指令碼

下面為OGG連線ASM與版本關係

  1. 如果 Redo Log 儲存在 ASM 中,設定 Catpure ASM 連線方式如下:
  2. Oracle 10.2.0.5 11.2.0.2 之前版本:
  3. TRANLOGOPTIONS ASMUSER sys@asminst, asmpassword oracle
  4. Oracle 10.2.0.511.2.0.2 或之以後版本,GoldenGate 11g 或以後版本:
  5. TRANLOGOPTIONS DBLOGREADER
  6. 如果在 AIX 平臺數據庫的 redo log 使用的是 RAW,則可能需要設定引數:TRANLOGOPTIONS
  7. RAWDEVICEOFFSET,設定此引數:
  8. TRANLOGOPTIONS RAWDEVICEOFFSET 0
  9. 其他平臺不需要設定此引數。

下面指令碼為未使用ASM或Oracle 10.2.0.5、11.2.0.2 或之以後版本,如果為早期需要調取ASM例項ORACLE_SID,則需要特殊處理

完整示例詳細可以參考OracleGoldenGate_Best_Practices-Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1.pdf

  1. #!/bin/sh
  2. # goldengate_action.scr
  3. # 生效oracle使用者下環境變數,oracle下環境變數一定要配置相關變數,防止下面啟動ogg無法讀取相關ORACLE_SID導致啟動extract失敗
  4. . ~oracle/.bash_profile
  5. # 判斷呼叫指令碼是否有選項,如果第一個選項為空,則報錯,提示使用選項
  6. [ -z "$1" ]&& echo "ERROR!! Usage $0 <start|stop|abort|clean>"&& exit 99
  7. # 指定goldengate安裝目錄
  8. GGS_HOME=<set the path here>
  9. #specify delay after start before checking for successful start
  10. start_delay_secs=5
  11. #Include the Oracle GoldenGate home in the library path to start GGSCI,AIX variable is LIBPATH
  12. export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${GGS_HOME}
  13. #set the oracle home to the database to ensure Oracle GoldenGate will get
  14. #the right environment settings to be able to connect to the database
  15. export ORACLE_HOME=<set the ORACLE_HOME path here>
  16. export CRS_HOME=<set the CRS_HOME path here>
  17. #Set NLS_LANG otherwise it will default to US7ASCII
  18. export NLS_LANG=American_America.US7ASCII
  19. logfile=/tmp/crs_gg_start.log
  20. \rm ${logfile}
  21. # define function log.
  22. function log ()
  23. {
  24. DATETIME=`date +%d/%m/%y-%H:%M:%S`
  25. echo $DATETIME "goldengate_action.scr>>" $1
  26. echo $DATETIME "goldengate_action.scr>>" $1 >> $logfile
  27. }
  28. # define function check_process to check goldengate MGR process is runing or not.
  29. #check_process validates that a manager process is running at the PID
  30. #that Oracle GoldenGate specifies.
  31. check_process ()
  32. {
  33. if ( [ -f "${GGS_HOME}/dirpcs/MGR.pcm" ] )
  34. then
  35. pid=`cut -f8 "${GGS_HOME}/dirpcs/MGR.pcm"`
  36. if [ ${pid} = `ps -e |grep ${pid} |grep mgr |awk '{ print $1 }'` ]
  37. then
  38. #manager process is running on the PID . exit success
  39. echo "manager process is running on the PID . exit success">> /tmp/check.out
  40. exit 0
  41. else
  42. #manager process is not running on the PID
  43. echo "manager process is not running on the PID" >> /tmp/check.out
  44. exit 1
  45. fi
  46. else
  47. #manager is not running because there is no PID file
  48. echo "manager is not running because there is no PID file" >> /tmp/check.out
  49. exit 1
  50. fi
  51. }
  52. # call_ggsci is a generic routine that executes a ggsci command
  53. call_ggsci () {
  54. log "entering call_ggsci"
  55. ggsci_command=$1
  56. #log "about to execute $ggsci_command"
  57. log "id= $USER"
  58. cd ${GGS_HOME}
  59. ggsci_output=`${GGS_HOME}/ggsci << EOF
  60. ${ggsci_command}
  61. exit
  62. EOF`
  63. log "got output of : $ggsci_output"
  64. }
  65. case $1 in
  66. 'start')
  67. #Updated by Sourav B (02/10/2011)
  68. # During failover if the “mgr.pcm” file is not deleted at the node crash
  69. # then Oracle clusterware won’t start the manager on the new node assuming the
  70. # manager process is still running on the failed node. To get around this issue
  71. # we will delete the “mgr.prm” file before starting up the manager on the new
  72. # node. We will also delete the other process files with pc* extension and to
  73. # avoid any file locking issue we will first backup the checkpoint files and then
  74. # delete them from the dirchk directory.After that we will restore the checkpoint
  75. # files from backup to the original location (dirchk directory).
  76. log "removing *.pc* files from dirpcs directory..."
  77. rm -f $GGS_HOME/dirpcs/*.pc*
  78. log "creating tmp directory to backup checkpoint file...."
  79. mkdir $GGS_HOME/dirchk/tmp
  80. log "backing up checkpoint files..."
  81. cp $GGS_HOME/dirchk/*.cp* $GGS_HOME/dirchk/tmp
  82. log "Deleting checkpoint files under dirchk......"
  83. rm -f $GGS_HOME/dirchk/*.cp*
  84. log "Restore checkpoint files from backup to dirchk directory...."
  85. cp $GGS_HOME/dirchk/tmp/*.cp* $GGS_HOME/dirchk
  86. log "Deleting tmp directory...."
  87. rm -r $GGS_HOME/dirchk/tmp
  88. log "starting manager"
  89. call_ggsci 'start manager'
  90. #there is a small delay between issuing the start manager command
  91. #and the process being spawned on the OS . wait before checking
  92. log "sleeping for start_delay_secs"
  93. sleep ${start_delay_secs}
  94. #check whether manager is running and exit accordingly
  95. check_process
  96. ;;
  97. 'stop')
  98. #attempt a clean stop for all non-manager processes
  99. call_ggsci 'stop er *'
  100. #ensure everything is stopped
  101. call_ggsci 'stop er *!'
  102. #stop manager without (y/n) confirmation
  103. call_ggsci 'stop manager!'
  104. #exit success
  105. exit 0
  106. ;;
  107. 'check')
  108. check_process
  109. exit 0
  110. ;;
  111. 'clean')
  112. #attempt a clean stop for all non-manager processes
  113. call_ggsci 'stop er *'
  114. #ensure everything is stopped
  115. call_ggsci 'stop er *!'
  116. #in case there are lingering processes
  117. call_ggsci 'kill er *'
  118. #stop manager without (y/n) confirmation
  119. call_ggsci 'stop manager!'
  120. #exit success
  121. exit 0
  122. ;;
  123. 'abort')
  124. #ensure everything is stopped
  125. call_ggsci 'stop er *!'
  126. #in case there are lingering processes
  127. call_ggsci 'kill er *'
  128. #stop manager without (y/n) confirmation
  129. call_ggsci 'stop manager!'
  130. #exit success
  131. exit 0
  132. ;;
  133. esac

2.4 CRSCTL新增ogg Grid資源

  1. # login as oracle:
  2. $ /u01/app/11.2/grid/bin/crsctl add resource oggapp -type cluster_resource -attr "ACTION_SCRIPT='/acfs_mount_point/ogg.sh',CHECK_INTERVAL=30,START_DEPENDENCIES='hard(czhvip) pullup(czhvip)',STOP_DEPENDENCIES='hard(mvggatevip)'"
  3. --指令碼位置可以存放於本地oracle使用者有讀取執行許可權的目錄,如果存放於本地,則Grid各個節點都需要備份該檔案
  4. --如果ogg安裝使用acfs,則START_DEPENDENCIES可以配置與ASM強依賴。

上述步驟即已完成第三方應用使用Grid託管,還是非常方便實用的。

3. 遇到問題解決

3.1 無法啟動resource

  1. 1. 無法啟動
  2. $ crsctl start res czhapp
  3. CRS-2672: Attempting to start 'czhapp' on 'db-oracle-node1'
  4. CRS-2674: Start of 'czhapp' on 'db-oracle-node1' failed
  5. CRS-2679: Attempting to clean 'czhapp' on 'db-oracle-node1'
  6. CRS-2678: 'czhapp' on 'db-oracle-node1' has experienced an unrecoverable failure
  7. CRS-0267: Human intervention required to resume its availability.
  8. CRS-4000: Command Start failed, or completed with errors.
  9. # 如果配置資源屬於Oracle,則日誌目錄為:
  10. $GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
  11. --關鍵內容如下
  12. 2021-04-26 11:35:07.342: [czhapp][156428032]{1:39006:13462} [clean] Executing action script: /software/crs.sh[clean]
  13. 2021-04-26 11:35:07.397: [ AGFW][156428032]{1:39006:13462} Command: clean for resource: czhapp 1 1 completed with invalid status: 209
  14. 2021-04-26 11:35:07.397: [czhapp][156428032]{1:39006:13462} [check] Executing action script: /software/crs.sh[check]
  15. 2021-04-26 11:35:07.397: [ AGFW][158529280]{1:39006:13462} Agent sending reply for: RESOURCE_CLEAN[czhapp 1 1] ID 4100:717590
  16. 2021-04-26 11:35:07.454: [ AGFW][156428032]{1:39006:13462} Received unknown resource status code: 209
  17. 2021-04-26 11:35:07.455: [ AGFW][158529280]{1:39006:13462} czhapp 1 1 state changed from: CLEANING to: UNKNOWN
  18. 2. 分析
  19. 可以從日誌輸出看到,識別到了指令碼,但是通過在指令碼中指定位置配置輸出,發現指令碼並未真正執行。
  20. 最終排查原因主要為指令碼開頭未宣告指令碼型別導致該問題。
  21. 3. 解決
  22. #!/bin/sh
  23. --寫指令碼還是要規範,以前寫指令碼偶爾拉下宣告部分,並不影響,這次Oracle Grid代理指令碼沒有宣告部分無法啟動還是挺意外的,也說明還是要規範。

3.2 OGG無法啟動extract

  1. 1. 現象
  2. OCI相關報錯,無法連線資料庫
  3. 2.分析
  4. AIX:
  5. ps -ef|grep goldengate
  6. ps eauwww <pid>
  7. 檢視程序環境變數發現,變數中無ORACLE_SID
  8. 由於goldengate extract中配置,未配置使用tnsalias方式連線資料庫,所以依賴於啟動extract時使用者作業系統環境變數ORACLE_SID,但是由於appvipcfg配置
  9. vip資源未給oracle足夠許可權,導致使用oracle使用者無法啟動vip資源,進而導致使用root啟動vip資源之後,環境變數無法取到ORACLE_SID,導致未能啟動extract
  10. 3.解決
  11. --login as root
  12. # crsctl setperm resource oggvip -u user:oracle:rwx
  13. --login as oracle 測試
  14. $ crsctl start resource oggvip
  15. --如果上述命令依然無法使oracle啟動資源,則繼續修改oggvip許可權
  16. --login as root
  17. --將other組許可權設定為rwx即可解決
  18. # crsctl getperm resource oggvip
  19. # crsctl setperm resource oggvip -u other::rwx

3.3 appvipcfg無法執行

  1. 1. 現象
  2. # ./appvipcfg create -network=1 \
  3. -ip=192.168.204.245 \
  4. -vipname=czhvip \
  5. -user=root
  6. /bin/ls: cannot access /ade/ade_88979932/perl/lib: No such file or directory
  7. 2. 原因
  8. 由於opatch打補丁導致appvipcfg內容發生改變,appvipcfg本身為$GRID_HOME/bin/下的一個指令碼檔案,不是一個二進位制檔案,指令碼中定義了ORACLE_HOMEORA_CRS_HOME,由於打補丁導致該檔案兩個變數不正確,修改為正確路徑即可解決。
  9. $ cat /u01/app/11.2/grid/bin/appvipcfg
  10. #!/bin/sh
  11. #
  12. # This script is used for managing
  13. # user mode vip resource.
  14. #
  15. # Do not change the line below for ORACLE_HOME setting
  16. #ORACLE_HOME=/u01/app/11.2/grid
  17. ORACLE_HOME=/ade/ade_19289128/11.2/grid
  18. export ORACLE_HOME
  19. #ORA_CRS_HOME=/u01/app/11.2/grid
  20. ORA_CRS_HOME=/ade/ade_19289128/11.2/grid
  21. export ORA_CRS_HOME