KingbaseES R6 叢集repmgr.conf引數'recovery'測試案例(三)
阿新 • • 發佈:2022-03-04
案例三:測試‘recovery = manual’
1、檢視叢集節點狀態資訊:
[kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+--------------------------- 1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2、檢視recovery配置資訊
3、重啟主庫主機系統
[root@node3 ~]# reboot
4、檢視備庫hamgr日誌
=從以下日誌資訊獲知,主庫系統宕機後,叢集執行主備切換,備庫被提升為主庫。==
[2022-03-02 10:32:38] [NOTICE] starting monitoring of node "node248" (ID: 2) [2022-03-02 10:32:38] [INFO] "connection_check_type" set to "ping" [2022-03-02 10:32:38] [INFO] monitoring connection to upstream node "node243" (ID: 1) [2022-03-02 10:32:38] [NOTICE] try to change wal catched_up state to 1 [2022-03-02 10:32:38] [INFO] primary flush lsn is 0/1F000D40, local flush lsn is 0/1F000D40 [2022-03-02 10:32:38] [NOTICE] try to change streaming_sync state to TRUE [2022-03-02 10:34:24] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" [2022-03-02 10:34:24] [DETAIL] PQping() returned "PQPING_REJECT" [2022-03-02 10:34:24] [WARNING] unable to connect to upstream node "node243" (ID: 1) [2022-03-02 10:34:24] [INFO] sleeping 6 seconds until next reconnection attempt [2022-03-02 10:34:30] [INFO] checking state of node 1, 1 of 10 attempts [2022-03-02 10:34:40] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr" [2022-03-02 10:34:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-02 10:34:40] [INFO] sleeping 6 seconds until next reconnection attempt ...... [2022-03-02 10:35:47] [INFO] checking state of node 1, 10 of 10 attempts [2022-03-02 10:35:47] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr" [2022-03-02 10:35:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-02 10:35:47] [WARNING] unable to reconnect to node 1 after 10 attempts [2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds [2022-03-02 10:35:47] [WARNING] wal receiver not running [2022-03-02 10:35:47] [NOTICE] WAL receiver disconnected on all sibling nodes [2022-03-02 10:35:47] [INFO] WAL receiver disconnected on all 0 sibling nodes [2022-03-02 10:35:47] [INFO] 0 active sibling nodes registered [2022-03-02 10:35:47] [INFO] primary and this node have the same location ("default") [2022-03-02 10:35:47] [INFO] no other sibling nodes - we win by default [2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms [2022-03-02 10:35:48] [NOTICE] this node is the only available candidate and will now promote itself [2022-03-02 10:35:48] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command [2022-03-02 10:35:50] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data. --- 192.168.7.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1008ms rtt min/avg/max/mdev = 2.473/2.535/2.598/0.080 ms [2022-03-02 10:35:50] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1" [2022-03-02 10:35:51] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data. --- 192.168.7.241 ping statistics --- 2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms [2022-03-02 10:35:51] [WARNING] ping host"192.168.7.241" failed [2022-03-02 10:35:51] [DETAIL] average RTT value is not greater than zero [2022-03-02 10:35:51] [INFO] loadvip result: 1, arping result: 1 [2022-03-02 10:35:51] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success [2022-03-02 10:35:51] [INFO] promote_command is: "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf" NOTICE: promoting standby to primary DETAIL: promoting server "node248" (ID: 2) using sys_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete [2022-03-02 10:35:51] [NOTICE] try to stop old primary db (host: "192.168.7.243") INFO: SET synchronous TO "async" on primary host NOTICE: STANDBY PROMOTE successful DETAIL: server "node248" (ID: 2) was successfully promoted to primary [2022-03-02 10:35:56] [INFO] 0 followers to notify [2022-03-02 10:35:56] [INFO] switching to primary monitoring mode [2022-03-02 10:35:56] [NOTICE] monitoring cluster primary "node248" (ID: 2) [2022-03-02 10:35:56] [INFO] create a thread 0x7fdeaa4b9700 to check the cluster status [2022-03-02 10:35:57] [INFO] node (ID: 1): no server running [2022-03-02 10:35:57] [INFO] [thread 0x7fdeaa4b9700] the cluster has no other running primary node, exit
5、原主庫系統正常啟動
1)從新主庫檢視叢集狀態 資訊
=從以下資訊可以獲知,叢集現在處於‘雙主’狀態,只是原主庫是‘failed’,無法連線。=
[kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+---------------- 1 | node243 | primary | - failed | | default | 100 | ? | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 WARNING: following issues were detected - unable to connect to node "node243" (ID: 1) You have new mail in /var/spool/mail/kingbase
2)在新主庫(原備庫)建立複製槽
# 建立replication slots
test=# select sys_create_physical_replication_slot('repmgr_slot_1');
sys_create_physical_replication_slot
--------------------------------------
(repmgr_slot_1,)
(1 row)
test=# select sys_create_physical_replication_slot('repmgr_slot_2');
sys_create_physical_replication_slot
--------------------------------------
(repmgr_slot_2,)
(1 row)
test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-----
repmgr_slot_1 | | physical | | | f | f | | | |
|
repmgr_slot_2 | | physical | | | f | f | | | |
|
(2 rows)
3)在原主庫(新主庫)執行以下恢復操作
# 備份data目錄
[kingbase@node3 kingbase]$ cp data data.bk -r
# 生成備庫標識檔案
[kingbase@node3 kingbase]$ cd data
[kingbase@node3 data]$ touch standby.signal
4)在原主庫執行repmgr node rejoin重新加入到叢集
[kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 10 forked off current database system timeline 9 before current recovery point 0/200000A0
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/1F000D70 on timeline 9
sys_rewind: rewinding from last common checkpoint at 0/1E000A70 on timeline 9
sys_rewind: find last common checkpoint start time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:52:34.358066 CST, in "0.225008" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/1F011AD0', minRecoveryPointTLI is '10', and database state is 'in archive recovery'
sys_rewind: rewind start wal location 0/1E000A40 (file 00000009000000000000001E), end wal location 0/1F011AD0 (file 0000000A000000000000001F). time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:53:06.442270 CST, in "32.309212" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-02 10:53:06.588331
NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-02 10:53:07.313294
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
5)啟動新備庫資料庫服務
[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase 3218 1 0 10:36 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase 5817 1 0 10:49 ? 00:00:01 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase 6730 1 0 10:53 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
kingbase 6731 6730 0 10:53 ? 00:00:00 kingbase: logger
kingbase 6732 6730 0 10:53 ? 00:00:00 kingbase: startup recovering 0000000A000000000000001F
kingbase 6736 6730 0 10:53 ? 00:00:00 kingbase: checkpointer
kingbase 6737 6730 0 10:53 ? 00:00:00 kingbase: background writer
kingbase 6738 6730 0 10:53 ? 00:00:00 kingbase: stats collector
kingbase 6739 6730 0 10:53 ? 00:00:00 kingbase: walreceiver streaming 0/1F012A78
kingbase 6743 6730 0 10:53 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(55941) idle
kingbase 6750 6730 0 10:53 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(55947) idle
6)檢視叢集節點狀態
[kingbase@node3 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------
1 | node243 | standby | running | node248 | default | 100 | 9 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
7)重啟叢集測試(可選)
[kingbase@node3 bin]$ ./sys_monitor.sh restart
2022-03-02 10:55:26 Ready to stop all DB ...
....
server started
2022-03-02 10:55:52 execute to start DB on "[192.168.7.248]" success, connect to check it.
2022-03-02 10:55:53 DB on "[192.168.7.248]" start success.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+-----------+----------+----------+----------+---------------
1 | node243 | standby | running | ! node248 | default | 100 | 10 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
WARNING: following issues were detected
- node "node243" (ID: 1) is not attached to its upstream node "node248" (ID: 2)
2022-03-02 10:55:53 The primary DB is started.
......
2022-03-02 10:56:15 repmgrd on "[192.168.7.248]" start success.
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node243 | standby | running | node248 | running | 9500 | no | 1 second(s) ago
2 | node248 | primary | * running | | running | 27881 | no | n/a
[2022-03-02 10:56:18] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
[2022-03-02 10:56:20] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
2022-03-02 10:56:22 Done.
=從以上資訊獲知,通過手工執行repmgr node rejoin,原主庫作為新備庫重新加入到叢集中。=
總結:
1、對於recovery=standby,主庫節點系統宕機後,叢集執行主庫切換,原主庫需要人工配置為備庫模式,並啟動資料庫服務,然後叢集可自動將其加入到叢集。
2、對於recovery=automatic,主庫節點系統宕機後,叢集執行主庫切換,不需要人工參與,原主庫將作為新的備庫自動加入到叢集。
3、對於recovery=manual,主庫節點系統宕機後,叢集執行主庫切換,需要人工參與,在原主庫執行‘repmgr node rejoin’操作,將原主庫將作為新的備庫自動加入到叢集。
4、對於無DBA日常監控管理的生產環境,可以考慮將recovery配置為automatic,提升叢集架構的可靠性。
KINGBASE研究院