使用Oracle Grid配置Goldengate或其他第三方應用高可用
阿新 • • 發佈:2021-09-05
1. 概述
Oracle Grid不止能提供自身Oracle Database高可用,還可以為第三方應用提供高可用。
可以為OGG、SharePlex等邏輯複製,Apache等應用提供高可用。
使用Oracle Grid代理第三方應用主要有以下兩種方式:
- Oracle Grid Infrastructure Agents
- Third-Part Script
官方文件位置:
Clusterware Administration and Deployment Guide
Third-Party Applications Using the Script Agent
Mos文件參考:
Oracle_GoldenGate_Best_Practices_-_Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1_.pdf
- 關於第三方應用日誌位置
Oracle Grid 11.2如果使用oracle新增資源,則日誌位置:
$GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
12c以後GRID日誌也變為標準ADR目錄
$GRID_BASE/diag/crs/crs/agent/scriptagent_oracle.trc
# 如果為GRID新增資源,路徑或日誌名稱scriptagent_grid即可。
2. Grid代理第三方指令碼
下面測試利用Grid代理第三方指令碼形式提供高可用,XAG方式參考官方文件即可。
部署步驟概述:
- 配置應用VIP(此VIP不是RAC VIP,僅僅為了應用本身使用),對外提供唯一IP,使切換對應用透明。
- 部署goldengate啟停第三方指令碼。
- crsctl載入資源,配置許可權。
- 測試高可用。
2.1 配置VIP
(1) login as root
# appvipcfg create -network=1 \
-ip=192.168.204.242 \
-vipname=czhvip \
-user=root
(2) 檢視配置vip
# crsctl stat res -p |grep -ie .network -ie subnet |grep -ie name -ie subnet
(3) login as root
# crsctl setperm resource czhvip -u user:oracle:r-x
--配置資源使用許可權使用者,IP資源屬主一定必須是root,其他使用者無法配置IP,會導致無法啟動VIP資源。
2.2 部署OGG
ogg安裝部署不在此贅述,可按照以下幾種方式:
1. 使用ACFS作為共享磁碟,OGG軟體本身以及dir*相關目錄均存放於ACFS檔案系統。
ACFS相應版本以及補丁參考下面文件:
ACFS Support On OS Platforms (Certification Matrix). (Doc ID 1369107.1).pdf
2.使用ACFS存放goldengate的trail檔案等,OGG軟體本身存放於操作掛載點即可,通過在作業系統相應路徑下建立軟連結方式指向ACFS中dir*相應目錄
$ ln –s /acfs_mount_point/dirdat dirdat
3.使用例如ocfs2、gpfs等叢集檔案系統存放
2.2 部署指令碼說明
下面指令碼僅僅用做示例,實際指令碼可以根據不同應用加入相應模組指令碼,比如check指令碼就需要判斷程序狀態等等。
Grid 第三方指令碼模組說明
1. Grid 11.2指令碼需要包含start/stop/clean/check/abort
--示例指令碼
#!/bin/sh
case $1 in
'start')
echo $(date)' start'>>/tmp/crs.log
exit 0
;;
'stop')
echo $(date)' stop'>>/tmp/crs.log
exit 0
;;
'clean')
echo $(date)' clean'>>/tmp/crs.log
echo $?'clean' >>/tmp/crs.log
exit 0
;;
'check')
echo "CHECK entry point has been called.."
echo $(date)' check'>>/tmp/crs.log
exit 0
;;
'abort')
echo $(date)' abort'>>/tmp/crs.log
exit 0
;;
esac
2. 模組說明
--主要介紹11gR2引入的兩個新模組
--12c以後版本引入了更多模組,這點可以從啟動日誌中看到。
CLEAN
Clean was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
Clusterware 10g Release 2 or 11g Release 1. Clean is called when there is a need to clean up the
resource. It is a non-graceful operation.
ABORT
Abort was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
Clusterware 10g Release 2 or 11g Release 1. Abort is called if any of the resource components
hang to abort the ongoing action. Abort is not required to be included.
3.關於指令碼中變數說明
如果start/stop/clean/check/abort對應指令碼中啟動程式指令碼需要依賴環境變數,例如
(1)ogg如果extract配置使用本地ORACLE_SID連線資料庫進行捕獲,不是使用tnsalias方式連線資料庫,則ggsci> start extract時,依賴於環境變數ORACLE_SID,這種情況下,需要在上面指令碼中定義好依賴的ORACLE_SID以及ORACLE_HOME變數,因為Grid啟動時由於vip屬主為root,所以如果vip與ogg資源強依賴時,只能獲取到root的使用者環境變數,無法獲得oracle使用者環境變數,會導致資源無法正常啟動。
(2)所以環境變數一定要在指令碼中完全定義,不要依賴於外部變數,否則將會發生問題後很難排查以及遇到無法啟動資源或啟動資源無法啟動程式中相應程序。
2.3 OGG高可用指令碼
下面為OGG連線ASM與版本關係
如果 Redo Log 儲存在 ASM 中,設定 Catpure ASM 連線方式如下:
Oracle 10.2.0.5 或 11.2.0.2 之前版本:
TRANLOGOPTIONS ASMUSER sys@asminst, asmpassword oracle
Oracle 10.2.0.5、11.2.0.2 或之以後版本,GoldenGate 為 11g 或以後版本:
TRANLOGOPTIONS DBLOGREADER
如果在 AIX 平臺數據庫的 redo log 使用的是 RAW,則可能需要設定引數:TRANLOGOPTIONS
RAWDEVICEOFFSET,設定此引數:
TRANLOGOPTIONS RAWDEVICEOFFSET 0
其他平臺不需要設定此引數。
下面指令碼為未使用ASM或Oracle 10.2.0.5、11.2.0.2 或之以後版本,如果為早期需要調取ASM例項ORACLE_SID,則需要特殊處理
完整示例詳細可以參考OracleGoldenGate_Best_Practices-Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1.pdf
#!/bin/sh
# goldengate_action.scr
# 生效oracle使用者下環境變數,oracle下環境變數一定要配置相關變數,防止下面啟動ogg無法讀取相關ORACLE_SID導致啟動extract失敗
. ~oracle/.bash_profile
# 判斷呼叫指令碼是否有選項,如果第一個選項為空,則報錯,提示使用選項
[ -z "$1" ]&& echo "ERROR!! Usage $0 <start|stop|abort|clean>"&& exit 99
# 指定goldengate安裝目錄
GGS_HOME=<set the path here>
#specify delay after start before checking for successful start
start_delay_secs=5
#Include the Oracle GoldenGate home in the library path to start GGSCI,AIX variable is LIBPATH
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${GGS_HOME}
#set the oracle home to the database to ensure Oracle GoldenGate will get
#the right environment settings to be able to connect to the database
export ORACLE_HOME=<set the ORACLE_HOME path here>
export CRS_HOME=<set the CRS_HOME path here>
#Set NLS_LANG otherwise it will default to US7ASCII
export NLS_LANG=American_America.US7ASCII
logfile=/tmp/crs_gg_start.log
\rm ${logfile}
# define function log.
function log ()
{
DATETIME=`date +%d/%m/%y-%H:%M:%S`
echo $DATETIME "goldengate_action.scr>>" $1
echo $DATETIME "goldengate_action.scr>>" $1 >> $logfile
}
# define function check_process to check goldengate MGR process is runing or not.
#check_process validates that a manager process is running at the PID
#that Oracle GoldenGate specifies.
check_process ()
{
if ( [ -f "${GGS_HOME}/dirpcs/MGR.pcm" ] )
then
pid=`cut -f8 "${GGS_HOME}/dirpcs/MGR.pcm"`
if [ ${pid} = `ps -e |grep ${pid} |grep mgr |awk '{ print $1 }'` ]
then
#manager process is running on the PID . exit success
echo "manager process is running on the PID . exit success">> /tmp/check.out
exit 0
else
#manager process is not running on the PID
echo "manager process is not running on the PID" >> /tmp/check.out
exit 1
fi
else
#manager is not running because there is no PID file
echo "manager is not running because there is no PID file" >> /tmp/check.out
exit 1
fi
}
# call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
log "entering call_ggsci"
ggsci_command=$1
#log "about to execute $ggsci_command"
log "id= $USER"
cd ${GGS_HOME}
ggsci_output=`${GGS_HOME}/ggsci << EOF
${ggsci_command}
exit
EOF`
log "got output of : $ggsci_output"
}
case $1 in
'start')
#Updated by Sourav B (02/10/2011)
# During failover if the “mgr.pcm” file is not deleted at the node crash
# then Oracle clusterware won’t start the manager on the new node assuming the
# manager process is still running on the failed node. To get around this issue
# we will delete the “mgr.prm” file before starting up the manager on the new
# node. We will also delete the other process files with pc* extension and to
# avoid any file locking issue we will first backup the checkpoint files and then
# delete them from the dirchk directory.After that we will restore the checkpoint
# files from backup to the original location (dirchk directory).
log "removing *.pc* files from dirpcs directory..."
rm -f $GGS_HOME/dirpcs/*.pc*
log "creating tmp directory to backup checkpoint file...."
mkdir $GGS_HOME/dirchk/tmp
log "backing up checkpoint files..."
cp $GGS_HOME/dirchk/*.cp* $GGS_HOME/dirchk/tmp
log "Deleting checkpoint files under dirchk......"
rm -f $GGS_HOME/dirchk/*.cp*
log "Restore checkpoint files from backup to dirchk directory...."
cp $GGS_HOME/dirchk/tmp/*.cp* $GGS_HOME/dirchk
log "Deleting tmp directory...."
rm -r $GGS_HOME/dirchk/tmp
log "starting manager"
call_ggsci 'start manager'
#there is a small delay between issuing the start manager command
#and the process being spawned on the OS . wait before checking
log "sleeping for start_delay_secs"
sleep ${start_delay_secs}
#check whether manager is running and exit accordingly
check_process
;;
'stop')
#attempt a clean stop for all non-manager processes
call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
'check')
check_process
exit 0
;;
'clean')
#attempt a clean stop for all non-manager processes
call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
'abort')
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
esac
2.4 CRSCTL新增ogg Grid資源
# login as oracle:
$ /u01/app/11.2/grid/bin/crsctl add resource oggapp -type cluster_resource -attr "ACTION_SCRIPT='/acfs_mount_point/ogg.sh',CHECK_INTERVAL=30,START_DEPENDENCIES='hard(czhvip) pullup(czhvip)',STOP_DEPENDENCIES='hard(mvggatevip)'"
--指令碼位置可以存放於本地oracle使用者有讀取執行許可權的目錄,如果存放於本地,則Grid各個節點都需要備份該檔案
--如果ogg安裝使用acfs,則START_DEPENDENCIES可以配置與ASM強依賴。
上述步驟即已完成第三方應用使用Grid託管,還是非常方便實用的。
3. 遇到問題解決
3.1 無法啟動resource
1. 無法啟動
$ crsctl start res czhapp
CRS-2672: Attempting to start 'czhapp' on 'db-oracle-node1'
CRS-2674: Start of 'czhapp' on 'db-oracle-node1' failed
CRS-2679: Attempting to clean 'czhapp' on 'db-oracle-node1'
CRS-2678: 'czhapp' on 'db-oracle-node1' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-4000: Command Start failed, or completed with errors.
# 如果配置資源屬於Oracle,則日誌目錄為:
$GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
--關鍵內容如下
2021-04-26 11:35:07.342: [czhapp][156428032]{1:39006:13462} [clean] Executing action script: /software/crs.sh[clean]
2021-04-26 11:35:07.397: [ AGFW][156428032]{1:39006:13462} Command: clean for resource: czhapp 1 1 completed with invalid status: 209
2021-04-26 11:35:07.397: [czhapp][156428032]{1:39006:13462} [check] Executing action script: /software/crs.sh[check]
2021-04-26 11:35:07.397: [ AGFW][158529280]{1:39006:13462} Agent sending reply for: RESOURCE_CLEAN[czhapp 1 1] ID 4100:717590
2021-04-26 11:35:07.454: [ AGFW][156428032]{1:39006:13462} Received unknown resource status code: 209
2021-04-26 11:35:07.455: [ AGFW][158529280]{1:39006:13462} czhapp 1 1 state changed from: CLEANING to: UNKNOWN
2. 分析
可以從日誌輸出看到,識別到了指令碼,但是通過在指令碼中指定位置配置輸出,發現指令碼並未真正執行。
最終排查原因主要為指令碼開頭未宣告指令碼型別導致該問題。
3. 解決
#!/bin/sh
--寫指令碼還是要規範,以前寫指令碼偶爾拉下宣告部分,並不影響,這次Oracle Grid代理指令碼沒有宣告部分無法啟動還是挺意外的,也說明還是要規範。
3.2 OGG無法啟動extract
1. 現象
OCI相關報錯,無法連線資料庫
2.分析
AIX:
ps -ef|grep goldengate
ps eauwww <pid>
檢視程序環境變數發現,變數中無ORACLE_SID。
由於goldengate extract中配置,未配置使用tnsalias方式連線資料庫,所以依賴於啟動extract時使用者作業系統環境變數ORACLE_SID,但是由於appvipcfg配置
的vip資源未給oracle足夠許可權,導致使用oracle使用者無法啟動vip資源,進而導致使用root啟動vip資源之後,環境變數無法取到ORACLE_SID,導致未能啟動extract。
3.解決
--login as root
# crsctl setperm resource oggvip -u user:oracle:rwx
--login as oracle 測試
$ crsctl start resource oggvip
--如果上述命令依然無法使oracle啟動資源,則繼續修改oggvip許可權
--login as root
--將other組許可權設定為rwx即可解決
# crsctl getperm resource oggvip
# crsctl setperm resource oggvip -u other::rwx
3.3 appvipcfg無法執行
1. 現象
# ./appvipcfg create -network=1 \
-ip=192.168.204.245 \
-vipname=czhvip \
-user=root
/bin/ls: cannot access /ade/ade_88979932/perl/lib: No such file or directory
2. 原因
由於opatch打補丁導致appvipcfg內容發生改變,appvipcfg本身為$GRID_HOME/bin/下的一個指令碼檔案,不是一個二進位制檔案,指令碼中定義了ORACLE_HOME與ORA_CRS_HOME,由於打補丁導致該檔案兩個變數不正確,修改為正確路徑即可解決。
$ cat /u01/app/11.2/grid/bin/appvipcfg
#!/bin/sh
#
# This script is used for managing
# user mode vip resource.
#
# Do not change the line below for ORACLE_HOME setting
#ORACLE_HOME=/u01/app/11.2/grid
ORACLE_HOME=/ade/ade_19289128/11.2/grid
export ORACLE_HOME
#ORA_CRS_HOME=/u01/app/11.2/grid
ORA_CRS_HOME=/ade/ade_19289128/11.2/grid
export ORA_CRS_HOME