KingbaseES R6 叢集repmgr.conf引數'recovery'測試案例(二)
阿新 • • 發佈:2022-03-04
案例二:測試‘recovery = automatic’
1、檢視叢集節點狀態資訊:
[kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+--------------------------- 1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2、配置recovery引數
[kingbase@node3 bin]$ cat ../etc/repmgr.conf |egrep -i 'recovery|failover'
failover='automatic'
recovery='automatic'
3、重啟主庫節點測試
[root@node3 ~]# reboot
4、檢視備庫hamgr日誌
=如下所示,從日誌中獲知,主庫節點宕機後,叢集執行主備切換,並且在主庫節點系統正常後,將原主庫作為新備庫自動加入到叢集。=
[2022-03-01 14:38:09] [NOTICE] starting monitoring of node "node248" (ID: 2) [2022-03-01 14:38:09] [INFO] "connection_check_type" set to "ping" [2022-03-01 14:38:10] [INFO] monitoring connection to upstream node "node243" (ID: 1) [2022-03-01 14:38:10] [NOTICE] try to change wal catched_up state to 1 [2022-03-01 14:38:10] [INFO] primary flush lsn is 0/17000578, local flush lsn is 0/170004C0 [2022-03-01 14:38:10] [NOTICE] try to change streaming_sync state to TRUE [2022-03-01 14:43:11] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state [2022-03-01 14:46:42] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" [2022-03-01 14:46:42] [DETAIL] PQping() returned "PQPING_REJECT" [2022-03-01 14:46:42] [WARNING] unable to connect to upstream node "node243" (ID: 1) [2022-03-01 14:46:42] [INFO] sleeping 6 seconds until next reconnection attempt [2022-03-01 14:46:48] [INFO] checking state of node 1, 1 of 10 attempts [2022-03-01 14:46:58] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr" [2022-03-01 14:46:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-01 14:46:58] [INFO] sleeping 6 seconds until next reconnection attempt ...... [2022-03-01 14:48:59] [INFO] checking state of node 1, 10 of 10 attempts [2022-03-01 14:48:59] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr" [2022-03-01 14:48:59] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-01 14:48:59] [WARNING] unable to reconnect to node 1 after 10 attempts [2022-03-01 14:48:59] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds [2022-03-01 14:49:00] [WARNING] wal receiver not running [2022-03-01 14:49:00] [NOTICE] WAL receiver disconnected on all sibling nodes [2022-03-01 14:49:00] [INFO] WAL receiver disconnected on all 0 sibling nodes [2022-03-01 14:49:00] [INFO] 0 active sibling nodes registered [2022-03-01 14:49:00] [INFO] primary and this node have the same location ("default") [2022-03-01 14:49:00] [INFO] no other sibling nodes - we win by default [2022-03-01 14:49:00] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms [2022-03-01 14:49:00] [NOTICE] this node is the only available candidate and will now promote itself [2022-03-01 14:49:00] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command [2022-03-01 14:49:02] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data. --- 192.168.7.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 2.345/22.599/42.853/20.254 ms [2022-03-01 14:49:02] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1" [2022-03-01 14:49:04] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data. --- 192.168.7.241 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 1999ms [2022-03-01 14:49:04] [WARNING] ping host"192.168.7.241" failed [2022-03-01 14:49:04] [DETAIL] average RTT value is not greater than zero [2022-03-01 14:49:04] [INFO] loadvip result: 1, arping result: 1 [2022-03-01 14:49:04] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success [2022-03-01 14:49:04] [INFO] promote_command is: "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf" NOTICE: promoting standby to primary DETAIL: promoting server "node248" (ID: 2) using sys_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete INFO: SET synchronous TO "async" on primary host [2022-03-01 14:49:07] [NOTICE] try to stop old primary db (host: "192.168.7.243") NOTICE: STANDBY PROMOTE successful DETAIL: server "node248" (ID: 2) was successfully promoted to primary [2022-03-01 14:49:11] [INFO] switching to primary monitoring mode [2022-03-01 14:49:11] [NOTICE] monitoring cluster primary "node248" (ID: 2) [2022-03-01 14:49:11] [INFO] create a thread 0x7f1b4b125700 to check the cluster status [2022-03-01 14:49:11] [INFO] child node: 1; attached: no [2022-03-01 14:49:11] [INFO] check node status again, try 1 / 10 times [2022-03-01 14:49:12] [INFO] node (ID: 1): no server running ....... [2022-03-01 14:49:29] [INFO] check node status again, try 10 / 10 times [2022-03-01 14:49:31] [INFO] child node: 1; attached: no [2022-03-01 14:49:31] [INFO] found node down, recovery will be triggered after recovery delay time 20s [2022-03-01 14:49:33] [INFO] child node: 1; attached: no ...... [2022-03-01 14:49:52] [INFO] child node: 1; attached: no [2022-03-01 14:49:52] [INFO] recovery delay time reached. can do recovery now. [2022-03-01 14:49:52] [INFO] [thread pid:11778] do_nodes_recovery thread begin. The pthread_t tid is 0x7f1b4b125700 [2022-03-01 14:49:52] [NOTICE] [thread pid:11778] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery [2022-03-01 14:49:52] [NOTICE] [thread pid:11778] Now, the primary host ip: 192.168.7.248 [2022-03-01 14:49:52] [INFO] [thread pid:11778] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery [2022-03-01 14:49:53] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock [2022-03-01 14:49:53] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind" NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2 DETAIL: rejoin target server's timeline 8 forked off current database system timeline 7 before current recovery point 0/18000028 NOTICE: executing sys_rewind DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'" sys_rewind: servers diverged at WAL location 0/17000680 on timeline 7 sys_rewind: rewinding from last common checkpoint at 0/160007C8 on timeline 7 sys_rewind: find last common checkpoint start time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:49:53.296332 CST, in "0.125651" seconds. sys_rewind: update the control file: minRecoveryPoint is '0/1700DE58', minRecoveryPointTLI is '8', and database state is 'in archive recovery' sys_rewind: we will remove the dir '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data/sys_replslot/repmgr_slot_2.rewind' and all the file/dir in it. sys_rewind: rewind start wal location 0/16000798 (file 000000070000000000000016), end wal location 0/1700DE58 (file 000000080000000000000017). time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:50:06.920859 CST, in "13.750178" seconds. sys_rewind: Done! NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data NOTICE: setting node 1's upstream to node 2 WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: begin to start server at 2022-03-01 14:50:07.530887 NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start" NOTICE: start server finish at 2022-03-01 14:50:08.952996 NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 2 [2022-03-01 14:50:09] [NOTICE] kbha: node (ID: 1) rejoin success. [2022-03-01 14:50:10] [NOTICE] [thread pid:11778] node "node243" (ID: 1) auto-recovery success [2022-03-01 14:50:10] [INFO] [thread pid:11778] do_nodes_recovery thread ends. The pthread_t tid is 0x7f1b4b125700 [2022-03-01 14:50:10] [INFO] SET synchronous TO "sync" on primary host [2022-03-01 14:50:10] [INFO] thread tid:0x7f1b4b125700 is not running [2022-03-01 14:50:10] [INFO] the recovery thread was exited, reset tid [2022-03-01 14:50:10] [NOTICE] Some nodes reconnect, all standby nodes are OK now [2022-03-01 14:50:12] [NOTICE] new standby "node243" (ID: 1) has connected
5、檢視備庫資料庫程序和叢集狀態資訊
[kingbase@node3 bin]$ ps -ef |grep kingbase kingbase 2654 1 0 14:49 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf kingbase 3462 1 0 14:50 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data kingbase 3463 3462 0 14:50 ? 00:00:00 kingbase: logger kingbase 3464 3462 0 14:50 ? 00:00:00 kingbase: startup recovering 000000080000000000000017 kingbase 3465 3462 0 14:50 ? 00:00:00 kingbase: checkpointer kingbase 3466 3462 0 14:50 ? 00:00:00 kingbase: background writer kingbase 3467 3462 0 14:50 ? 00:00:00 kingbase: stats collector kingbase 3468 3462 0 14:50 ? 00:00:00 kingbase: walreceiver streaming 0/1700F160 kingbase 3471 3462 0 14:50 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(57348) idle kingbase 3522 1 0 14:50 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf kingbase 3523 3462 0 14:50 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(57351) idle [kingbase@node3 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------------------------- 1 | node243 | standby | running | node248 | default | 100 | 7 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node248 | primary | * running | | default | 100 | 8 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
=== 從以上資訊獲知,原主庫節點在系統恢復到正常後,叢集將其作為新備庫自動加入到叢集。====
=未完待續=
KINGBASE研究院