KingbaseES R6 叢集 recovery 引數對切換的影響

阿新 • • 發佈：2022-03-03

案例說明：在KingbaseES R6叢集中，主庫節點出現宕機（如重啟或關機），會產生主備切換，但是當主庫節點系統恢復正常後，如何對原主庫節點進行處理，保證叢集資料的一致性和安全，可以通過對repmgr.conf檔案中配置recovery引數來解決。
本案例記錄了‘recovery’引數的三種配置情況下，primary 主機重啟後，叢集恢復的過程。

注意：對於KingbaseES R6老的版本，recovery引數只支援‘manual’和‘automatic’。

資料庫版本：

叢集架構：

叢集節點資訊：

案例一：測試‘recovery = standby’

一、執行主備切換測試

1、配置recovery引數（所有node）：

2、檢視叢集節點狀態資訊

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
 1  | node243 | primary | * running |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | standby |   running | node243  | default  | 100      | 3        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、主庫節點系統重啟
[root@node3 ~]# reboot

4、檢視備庫hamgr日誌

=從hamgr日誌獲知，原主庫宕機後，叢集主備切換，原備庫提升為主庫。=

[kingbase@node1 log]$ tail -f 100 hamgr.log 
tail: cannot open ‘100’ for reading: No such file or directory
==> hamgr.log <==
[2022-03-01 13:12:23] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2022-03-01 13:12:23] [INFO] connecting to database "host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
INFO:  set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/hamgrd.pid
[2022-03-01 13:12:23] [NOTICE] starting monitoring of node "node248" (ID: 2)
[2022-03-01 13:12:23] [INFO] "connection_check_type" set to "ping"
[2022-03-01 13:12:23] [INFO] monitoring connection to upstream node "node243" (ID: 1)
[2022-03-01 13:12:23] [NOTICE] try to change wal catched_up state to 1
[2022-03-01 13:12:23] [INFO] primary flush lsn is 0/12000900, local flush lsn is 0/12000848
[2022-03-01 13:12:23] [NOTICE] try to change streaming_sync state to TRUE
[2022-03-01 13:17:24] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state
[2022-03-01 13:20:00] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-01 13:20:00] [DETAIL] PQping() returned "PQPING_REJECT"
[2022-03-01 13:20:00] [WARNING] unable to connect to upstream node "node243" (ID: 1)
[2022-03-01 13:20:00] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-01 13:20:06] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-01 13:20:16] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-01 13:20:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-01 13:20:16] [INFO] sleeping 6 seconds until next reconnection attempt
......
[2022-03-01 13:21:23] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-01 13:21:23] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-01 13:21:23] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-01 13:21:23] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-01 13:21:23] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-01 13:21:23] [WARNING] wal receiver not running
[2022-03-01 13:21:23] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-01 13:21:23] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-01 13:21:23] [INFO] 0 active sibling nodes registered
[2022-03-01 13:21:23] [INFO] primary and this node have the same location ("default")
[2022-03-01 13:21:23] [INFO] no other sibling nodes - we win by default
[2022-03-01 13:21:23] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-01 13:21:23] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-01 13:21:23] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
[2022-03-01 13:21:25] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.

--- 192.168.7.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 2.324/6.238/10.152/3.914 ms

[2022-03-01 13:21:25] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
[2022-03-01 13:21:26] [NOTICE] try to stop old primary db (host: "192.168.7.243")
[2022-03-01 13:21:26] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.

--- 192.168.7.241 ping statistics ---
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms


[2022-03-01 13:21:26] [WARNING] ping host"192.168.7.241" failed
[2022-03-01 13:21:26] [DETAIL] average RTT value is not greater than zero
[2022-03-01 13:21:26] [INFO] loadvip result: 1, arping result: 1
[2022-03-01 13:21:26] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
[2022-03-01 13:21:26] [INFO] promote_command is:
  "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node248" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node248" (ID: 2) was successfully promoted to primary
[2022-03-01 13:21:30] [INFO] switching to primary monitoring mode
[2022-03-01 13:21:30] [NOTICE] monitoring cluster primary "node248" (ID: 2)
[2022-03-01 13:21:30] [INFO] create a thread 0x7fe7dbe15700 to check the cluster status
[2022-03-01 13:21:30] [INFO] node (ID: 1): no server running
[2022-03-01 13:21:31] [INFO] [thread 0x7fe7dbe15700] the cluster has no other running primary node, exit

二、原主庫節點系統恢復後加入叢集測試

1、在新主庫建立replication slot

test=# select sys_create_physical_replication_slot('repmgr_slot_1');
sys_create_physical_replication_slot 
--------------------------------------
(repmgr_slot_1,)
(1 row)

test=# select sys_create_physical_replication_slot('repmgr_slot_2');
sys_create_physical_replication_slot 
--------------------------------------
(repmgr_slot_2,)
(1 row)

test=# select * from sys_replication_slots;                         
  slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
repmgr_slot_1 |        | physical  |        |          | f         | f      |            |      |              |             | 
repmgr_slot_2 |        | physical  |        |          | f         | f      |            |      |              |             | 
(2 rows)

2、原主庫系統啟動完成：

1）備份新備庫節點資料目錄
[kingbase@node3 kingbase]$ cp -r data data.bk

2）在data下建立備庫標識檔案（重要）
[kingbase@node3 data]$ touch standby.signal

3）檢視新備庫連線字串資訊

[kingbase@node3 data]$ cat kingbase.auto.conf 
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
job_queue_processes = '5'
primary_conninfo = 'user=esrep connect_timeout=10 host=192.168.7.248 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node243'
recovery_target_timeline = 'latest'
primary_slot_name = 'repmgr_slot_1'
wal_retrieve_retry_interval = '5000'
synchronous_standby_names = '1 (*)'
wal_retrieve_retry_interval = '5000'

4）啟動新備庫資料庫服務

kingbase@node3 bin]$ ./sys_ctl start -D ../data
......
NOTICE: standby node "node243" (ID: 1) successfully registered

5）檢視當前叢集節點狀態

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+----------------------+----------+----------+----------+----------+----------------
 1  | node243 | primary | ! running as standby |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | standby | ! running as primary |          | default  | 100      | 4        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
  - node "node243" (ID: 1) is registered as primary but running as standby
  - node "node248" (ID: 2) is registered as standby but running as primary

6）叢集自動恢復新備庫

=如下hamgr日誌所示，啟動新備庫資料庫服務後，叢集自動對備庫做recovery，並將原主庫以備庫的模式加入叢集。=

*[2022-03-01 13:26:31] [INFO] monitoring primary node "node248" (ID: 2) in normal state
[2022-03-01 13:27:28] [INFO] child node: 1; attached: no
[2022-03-01 13:27:28] [INFO] check node status again, try 1 / 10 times
[2022-03-01 13:27:30] [INFO] child node: 1; attached: no
.....
[2022-03-01 13:27:46] [INFO] check node status again, try 10 / 10 times
[2022-03-01 13:27:48] [INFO] child node: 1; attached: no
[2022-03-01 13:27:48] [INFO] found node down, recovery will be triggered after recovery delay time 20s
[2022-03-01 13:27:50] [INFO] child node: 1; attached: no
......
[2022-03-01 13:28:08] [INFO] child node: 1; attached: no
[2022-03-01 13:28:08] [INFO] recovery delay time reached. can do recovery now.
[2022-03-01 13:28:09] [NOTICE] mark node "node243" (ID: 1) as inactive
[2022-03-01 13:28:09] [INFO] [thread pid:30763] do_nodes_recovery thread begin. The pthread_t tid is 0x7fe7dbe15700
[2022-03-01 13:28:09] [NOTICE] [thread pid:30763] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery
[2022-03-01 13:28:09] [NOTICE] [thread pid:30763] Now, the primary host ip: 192.168.7.248
[2022-03-01 13:28:10] [INFO] [thread pid:30763] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery
[2022-03-01 13:28:10] [NOTICE] kbha: node (ID: 1) is running as standby, stop it and do rejoin.

[2022-03-01 13:28:15] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
[2022-03-01 13:28:15] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind"
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/130000A0
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/12000A08 on timeline 3
sys_rewind: rewinding from last common checkpoint at 0/11000058 on timeline 3
sys_rewind: find last common checkpoint start time from 2022-03-01 13:28:15.600702 CST to 2022-03-01 13:28:16.200048 CST, in "0.599346" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/12011F70', minRecoveryPointTLI is '4', and database state is 'in archive recovery'
*sys_rewind: rewind start wal location 0/11000028 (file 000000030000000000000011), end wal location 0/12011F70 (file 000000040000000000000012). time from 2022-03-01 13:28:15.600702 CST to 2022-03-01 13:28:36.045129 CST, in "20.444427" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-01 13:28:36.437003
NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-01 13:28:37.367954
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[2022-03-01 13:28:38] [NOTICE] kbha: node (ID: 1) rejoin success.

[2022-03-01 13:28:38] [NOTICE] [thread pid:30763] node "node243" (ID: 1) auto-recovery success
[2022-03-01 13:28:38] [INFO] [thread pid:30763] do_nodes_recovery thread ends. The pthread_t tid is 0x7fe7dbe15700
[2022-03-01 13:28:39] [INFO] SET synchronous TO "sync" on primary host 
[2022-03-01 13:28:39] [INFO] thread tid:0x7fe7dbe15700 is not running
[2022-03-01 13:28:39] [INFO] the recovery thread was exited, reset tid
[2022-03-01 13:28:39] [NOTICE] Some nodes reconnect, all standby nodes are OK now
[2022-03-01 13:28:41] [NOTICE] new standby "node243" (ID: 1) has connected
[2022-03-01 13:31:31] [INFO] monitoring primary node "node248" (ID: 2) in normal state

7）檢視備庫資料庫程序

8）原主庫作為新備庫rejoin到叢集

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
 1  | node243 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | primary | * running |          | default  | 100      | 6        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

9）主庫查詢流複製資訊

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+-------------
 repmgr_slot_1 |        | physical  |        |          | f         | t      |      30928 | 1437 |              | 0/120130A8  | 
 repmgr_slot_2 |        | physical  |        |          | f         | f      |            |      |              |             | 
(2 rows)


test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time           
-------+----------+---------+------------------+---------------+-----------------+-------------+-----------
 30928 |    16384 | esrep   | node243          | 192.168.7.243 |                 |       10817 | 2022-03-01 13:28:37.941077+08 |              | streaming | 0/120130A8 | 0/120130A8 | 0/120130A8 | 0/120130A8 |           |           |            |             1 | sync       | 2022-03-01 13:32:08.445325+08
(1 row)

案例二：測試‘recovery = automatic’

1、檢視叢集節點狀態資訊：

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
 1  | node243 | primary | * running |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | standby |   running | node243  | default  | 100      | 3        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、配置recovery引數

[kingbase@node3 bin]$ cat ../etc/repmgr.conf |egrep -i 'recovery|failover'
failover='automatic'
recovery='automatic'

3、重啟主庫節點測試
[root@node3 ~]# reboot

4、檢視備庫hamgr日誌

=如下所示，從日誌中獲知，主庫節點宕機後，叢集執行主備切換，並且在主庫節點系統正常後，將原主庫作為新備庫自動加入到叢集。=

[2022-03-01 14:38:09] [NOTICE] starting monitoring of node "node248" (ID: 2)
[2022-03-01 14:38:09] [INFO] "connection_check_type" set to "ping"
[2022-03-01 14:38:10] [INFO] monitoring connection to upstream node "node243" (ID: 1)
[2022-03-01 14:38:10] [NOTICE] try to change wal catched_up state to 1
[2022-03-01 14:38:10] [INFO] primary flush lsn is 0/17000578, local flush lsn is 0/170004C0
[2022-03-01 14:38:10] [NOTICE] try to change streaming_sync state to TRUE
[2022-03-01 14:43:11] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state
[2022-03-01 14:46:42] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-01 14:46:42] [DETAIL] PQping() returned "PQPING_REJECT"
[2022-03-01 14:46:42] [WARNING] unable to connect to upstream node "node243" (ID: 1)
[2022-03-01 14:46:42] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-01 14:46:48] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-01 14:46:58] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-01 14:46:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-01 14:46:58] [INFO] sleeping 6 seconds until next reconnection attempt
......
[2022-03-01 14:48:59] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-01 14:48:59] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-01 14:48:59] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-01 14:48:59] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-01 14:48:59] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-01 14:49:00] [WARNING] wal receiver not running
[2022-03-01 14:49:00] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-01 14:49:00] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-01 14:49:00] [INFO] 0 active sibling nodes registered
[2022-03-01 14:49:00] [INFO] primary and this node have the same location ("default")
[2022-03-01 14:49:00] [INFO] no other sibling nodes - we win by default
[2022-03-01 14:49:00] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-01 14:49:00] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-01 14:49:00] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
[2022-03-01 14:49:02] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.

--- 192.168.7.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 2.345/22.599/42.853/20.254 ms

[2022-03-01 14:49:02] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
[2022-03-01 14:49:04] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.

--- 192.168.7.241 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms


[2022-03-01 14:49:04] [WARNING] ping host"192.168.7.241" failed
[2022-03-01 14:49:04] [DETAIL] average RTT value is not greater than zero
[2022-03-01 14:49:04] [INFO] loadvip result: 1, arping result: 1
[2022-03-01 14:49:04] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
[2022-03-01 14:49:04] [INFO] promote_command is:
  "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node248" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
INFO: SET synchronous TO "async" on primary host 
[2022-03-01 14:49:07] [NOTICE] try to stop old primary db (host: "192.168.7.243")
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node248" (ID: 2) was successfully promoted to primary
[2022-03-01 14:49:11] [INFO] switching to primary monitoring mode
[2022-03-01 14:49:11] [NOTICE] monitoring cluster primary "node248" (ID: 2)
[2022-03-01 14:49:11] [INFO] create a thread 0x7f1b4b125700 to check the cluster status
[2022-03-01 14:49:11] [INFO] child node: 1; attached: no
[2022-03-01 14:49:11] [INFO] check node status again, try 1 / 10 times
[2022-03-01 14:49:12] [INFO] node (ID: 1): no server running
.......
[2022-03-01 14:49:29] [INFO] check node status again, try 10 / 10 times
[2022-03-01 14:49:31] [INFO] child node: 1; attached: no
[2022-03-01 14:49:31] [INFO] found node down, recovery will be triggered after recovery delay time 20s
[2022-03-01 14:49:33] [INFO] child node: 1; attached: no
......
[2022-03-01 14:49:52] [INFO] child node: 1; attached: no
[2022-03-01 14:49:52] [INFO] recovery delay time reached. can do recovery now.
[2022-03-01 14:49:52] [INFO] [thread pid:11778] do_nodes_recovery thread begin. The pthread_t tid is 0x7f1b4b125700
[2022-03-01 14:49:52] [NOTICE] [thread pid:11778] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery
[2022-03-01 14:49:52] [NOTICE] [thread pid:11778] Now, the primary host ip: 192.168.7.248
[2022-03-01 14:49:52] [INFO] [thread pid:11778] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery
[2022-03-01 14:49:53] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
[2022-03-01 14:49:53] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind"
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 8 forked off current database system timeline 7 before current recovery point 0/18000028
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/17000680 on timeline 7
sys_rewind: rewinding from last common checkpoint at 0/160007C8 on timeline 7
sys_rewind: find last common checkpoint start time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:49:53.296332 CST, in "0.125651" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/1700DE58', minRecoveryPointTLI is '8', and database state is 'in archive recovery'
sys_rewind: we will remove the dir '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data/sys_replslot/repmgr_slot_2.rewind' and all the file/dir in it.
sys_rewind: rewind start wal location 0/16000798 (file 000000070000000000000016), end wal location 0/1700DE58 (file 000000080000000000000017). time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:50:06.920859 CST, in "13.750178" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-01 14:50:07.530887
NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-01 14:50:08.952996
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[2022-03-01 14:50:09] [NOTICE] kbha: node (ID: 1) rejoin success.

[2022-03-01 14:50:10] [NOTICE] [thread pid:11778] node "node243" (ID: 1) auto-recovery success
[2022-03-01 14:50:10] [INFO] [thread pid:11778] do_nodes_recovery thread ends. The pthread_t tid is 0x7f1b4b125700
[2022-03-01 14:50:10] [INFO] SET synchronous TO "sync" on primary host 
[2022-03-01 14:50:10] [INFO] thread tid:0x7f1b4b125700 is not running
[2022-03-01 14:50:10] [INFO] the recovery thread was exited, reset tid
[2022-03-01 14:50:10] [NOTICE] Some nodes reconnect, all standby nodes are OK now
[2022-03-01 14:50:12] [NOTICE] new standby "node243" (ID: 1) has connected

5、檢視備庫資料庫程序和叢集狀態資訊

[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase  2654     1  0 14:49 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase  3462     1  0 14:50 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
kingbase  3463  3462  0 14:50 ?        00:00:00 kingbase: logger   
kingbase  3464  3462  0 14:50 ?        00:00:00 kingbase: startup   recovering 000000080000000000000017
kingbase  3465  3462  0 14:50 ?        00:00:00 kingbase: checkpointer   
kingbase  3466  3462  0 14:50 ?        00:00:00 kingbase: background writer   
kingbase  3467  3462  0 14:50 ?        00:00:00 kingbase: stats collector   
kingbase  3468  3462  0 14:50 ?        00:00:00 kingbase: walreceiver   streaming 0/1700F160
kingbase  3471  3462  0 14:50 ?        00:00:00 kingbase: esrep esrep 192.168.7.243(57348) idle
kingbase  3522     1  0 14:50 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase  3523  3462  0 14:50 ?        00:00:00 kingbase: esrep esrep 192.168.7.243(57351) idle

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------- 
1  | node243 | standby |   running | node248  | default  | 100      | 7        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2  | node248 | primary | * running |          | default  | 100      | 8        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

案例三：測試‘recovery = manual’

1、檢視叢集節點狀態資訊：

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
 1  | node243 | primary | * running |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | standby |   running | node243  | default  | 100      | 3        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、檢視recovery配置資訊

3、重啟主庫主機系統
[root@node3 ~]# reboot

4、檢視備庫hamgr日誌

=從以下日誌資訊獲知，主庫系統宕機後，叢集執行主備切換，備庫被提升為主庫。==

[2022-03-02 10:32:38] [NOTICE] starting monitoring of node "node248" (ID: 2)
[2022-03-02 10:32:38] [INFO] "connection_check_type" set to "ping"
[2022-03-02 10:32:38] [INFO] monitoring connection to upstream node "node243" (ID: 1)
[2022-03-02 10:32:38] [NOTICE] try to change wal catched_up state to 1
[2022-03-02 10:32:38] [INFO] primary flush lsn is 0/1F000D40, local flush lsn is 0/1F000D40
[2022-03-02 10:32:38] [NOTICE] try to change streaming_sync state to TRUE
[2022-03-02 10:34:24] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-02 10:34:24] [DETAIL] PQping() returned "PQPING_REJECT"
[2022-03-02 10:34:24] [WARNING] unable to connect to upstream node "node243" (ID: 1)
[2022-03-02 10:34:24] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-02 10:34:30] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-02 10:34:40] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-02 10:34:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-02 10:34:40] [INFO] sleeping 6 seconds until next reconnection attempt

......

[2022-03-02 10:35:47] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-02 10:35:47] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-02 10:35:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-02 10:35:47] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-02 10:35:47] [WARNING] wal receiver not running
[2022-03-02 10:35:47] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-02 10:35:47] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-02 10:35:47] [INFO] 0 active sibling nodes registered
[2022-03-02 10:35:47] [INFO] primary and this node have the same location ("default")
[2022-03-02 10:35:47] [INFO] no other sibling nodes - we win by default
[2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-02 10:35:48] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-02 10:35:48] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
[2022-03-02 10:35:50] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.

--- 192.168.7.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1008ms
rtt min/avg/max/mdev = 2.473/2.535/2.598/0.080 ms

[2022-03-02 10:35:50] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
[2022-03-02 10:35:51] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.

--- 192.168.7.241 ping statistics ---
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms


[2022-03-02 10:35:51] [WARNING] ping host"192.168.7.241" failed
[2022-03-02 10:35:51] [DETAIL] average RTT value is not greater than zero
[2022-03-02 10:35:51] [INFO] loadvip result: 1, arping result: 1
[2022-03-02 10:35:51] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
[2022-03-02 10:35:51] [INFO] promote_command is:
  "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node248" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2022-03-02 10:35:51] [NOTICE] try to stop old primary db (host: "192.168.7.243")
INFO: SET synchronous TO "async" on primary host
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node248" (ID: 2) was successfully promoted to primary
[2022-03-02 10:35:56] [INFO] 0 followers to notify
[2022-03-02 10:35:56] [INFO] switching to primary monitoring mode
[2022-03-02 10:35:56] [NOTICE] monitoring cluster primary "node248" (ID: 2)
[2022-03-02 10:35:56] [INFO] create a thread 0x7fdeaa4b9700 to check the cluster status
[2022-03-02 10:35:57] [INFO] node (ID: 1): no server running
[2022-03-02 10:35:57] [INFO] [thread 0x7fdeaa4b9700] the cluster has no other running primary node, exit

5、原主庫系統正常啟動

1）從新主庫檢視叢集狀態資訊

=從以下資訊可以獲知，叢集現在處於‘雙主’狀態，只是原主庫是‘failed’，無法連線。=

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+----------------
 1  | node243 | primary | - failed  |          | default  | 100      | ?        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | primary | * running |          | default  | 100      | 10       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
  - unable to connect to node "node243" (ID: 1)
You have new mail in /var/spool/mail/kingbase

2）在新主庫（原備庫）建立複製槽

# 建立replication slots

test=# select sys_create_physical_replication_slot('repmgr_slot_1');
 sys_create_physical_replication_slot 
--------------------------------------
 (repmgr_slot_1,)
(1 row)

test=# select sys_create_physical_replication_slot('repmgr_slot_2');
 sys_create_physical_replication_slot 
--------------------------------------
 (repmgr_slot_2,)
(1 row)

test=# select * from sys_replication_slots;                         
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
---------------+--------+-----------+--------+----------+-----------+--------+------------+-----
 repmgr_slot_1 |        | physical  |        |          | f         | f      |            |      |              |             | 
 repmgr_slot_2 |        | physical  |        |          | f         | f      |            |      |              |             | 
(2 rows)

3）在原主庫（新主庫）執行以下恢復操作

# 備份data目錄
[kingbase@node3 kingbase]$ cp data data.bk -r

# 生成備庫標識檔案
[kingbase@node3 kingbase]$ cd data
[kingbase@node3 data]$ touch standby.signal

4）在原主庫執行repmgr node rejoin重新加入到叢集

 [kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 10 forked off current database system timeline 9 before current recovery point 0/200000A0
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/1F000D70 on timeline 9
sys_rewind: rewinding from last common checkpoint at 0/1E000A70 on timeline 9
sys_rewind: find last common checkpoint start time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:52:34.358066 CST, in "0.225008" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/1F011AD0', minRecoveryPointTLI is '10', and database state is 'in archive recovery'
sys_rewind: rewind start wal location 0/1E000A40 (file 00000009000000000000001E), end wal location 0/1F011AD0 (file 0000000A000000000000001F). time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:53:06.442270 CST, in "32.309212" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-02 10:53:06.588331
NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-02 10:53:07.313294
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

5）啟動新備庫資料庫服務

[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase  3218     1  0 10:36 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase  5817     1  0 10:49 ?        00:00:01 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase  6730     1  0 10:53 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
kingbase  6731  6730  0 10:53 ?        00:00:00 kingbase: logger   
kingbase  6732  6730  0 10:53 ?        00:00:00 kingbase: startup   recovering 0000000A000000000000001F
kingbase  6736  6730  0 10:53 ?        00:00:00 kingbase: checkpointer   
kingbase  6737  6730  0 10:53 ?        00:00:00 kingbase: background writer   
kingbase  6738  6730  0 10:53 ?        00:00:00 kingbase: stats collector   
kingbase  6739  6730  0 10:53 ?        00:00:00 kingbase: walreceiver   streaming 0/1F012A78
kingbase  6743  6730  0 10:53 ?        00:00:00 kingbase: esrep esrep 192.168.7.243(55941) idle
kingbase  6750  6730  0 10:53 ?        00:00:00 kingbase: esrep esrep 192.168.7.243(55947) idle

6）檢視叢集節點狀態

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+----------------
 1  | node243 | standby |   running | node248  | default  | 100      | 9        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | primary | * running |          | default  | 100      | 10       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

7）重啟叢集測試（可選）

[kingbase@node3 bin]$ ./sys_monitor.sh restart
2022-03-02 10:55:26 Ready to stop all DB ...
....
server started
2022-03-02 10:55:52 execute to start DB on "[192.168.7.248]" success, connect to check it.
2022-03-02 10:55:53 DB on "[192.168.7.248]" start success.
 ID | Name    | Role    | Status    | Upstream  | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+-----------+----------+----------+----------+---------------
 1  | node243 | standby |   running | ! node248 | default  | 100      | 10       | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | primary | * running |           | default  | 100      | 10       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
WARNING: following issues were detected
  - node "node243" (ID: 1) is not attached to its upstream node "node248" (ID: 2)
2022-03-02 10:55:53 The primary DB is started.
......
2022-03-02 10:56:15 repmgrd on "[192.168.7.248]" start success.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node243 | standby |   running | node248  | running | 9500  | no      | 1 second(s) ago    
 2  | node248 | primary | * running |          | running | 27881 | no      | n/a                
[2022-03-02 10:56:18] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
[2022-03-02 10:56:20] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
2022-03-02 10:56:22 Done.

=從以上資訊獲知，通過手工執行repmgr node rejoin，原主庫作為新備庫重新加入到叢集中。=

總結：

   1、對於recovery=standby，主庫節點系統宕機後，叢集執行主庫切換，原主庫需要人工配置為備庫模式，並啟動資料庫服務，然後叢集可自動將其加入到叢集。
   2、對於recovery=automatic，主庫節點系統宕機後，叢集執行主庫切換，不需要人工參與，原主庫將作為新的備庫自動加入到叢集。
   3、對於recovery=manual，主庫節點系統宕機後，叢集執行主庫切換，需要人工參與，在原主庫執行‘repmgr node rejoin’操作，將原主庫將作為新的備庫自動加入到叢集。
   4、對於無DBA日常監控管理的生產環境，可以考慮將recovery配置為automatic，提升叢集架構的可靠性。

KINGBASE研究院

KingbaseES R6 叢集 recovery 引數對切換的影響

案例說明：在KingbaseES R6叢集中，主庫節點出現宕機（如重啟或關機），會產生主備切換，但是當主庫節點系統恢復正常後，如何對原主庫節點進行處理，保證叢集資料的一致性和安全，可以通過對repmgr.conf檔案中配置r

KingbaseES R6叢集repmgr.conf引數'recovery'測試案例(二)

KingbaseES 、repmgr 案例二：測試‘recovery = automatic’ 1、檢視叢集節點狀態資訊：

KingbaseES R6 叢集repmgr.conf引數'recovery'測試案例(一)

KingbaseES R6叢集repmgr.conf引數\'recovery\'測試案例(一) 案例說明：在KingbaseES R6叢集中，主庫節點出現宕機（如重啟或關機），會產生主備切換，但是當主庫節點系統恢復正常後，如何對原主庫節點進行處理，保

KingbaseES R6 叢集repmgr.conf引數'recovery'測試案例(三)

案例三：測試‘recovery = manual’ 1、檢視叢集節點狀態資訊： [kingbase@node1 bin]$ ./repmgr cluster show

KingbaseES R6 叢集repmgr.conf引數'recovery'測試案例(二)

案例二：測試‘recovery = automatic’ 1、檢視叢集節點狀態資訊： [kingbase@node1 bin]$ ./repmgr cluster show

kingbaseES R6 叢集手工切換案例

kingbaseES R6叢集切換priority為0測試案例

KingbaseES、repmgr、PostgreSQL 案例說明：在一主多備的架構中，需要配置一臺備庫在主備切換時，不能選舉為主庫。對於repmgr主備切換主庫的選擇演算法如下：

KingbaseES R6叢集主機鎖衝突導致的主備切換案例

案例說明：主庫在業務高峰期間，客戶執行建表等DDL操作，主庫產生“AccessExclusiveLock ”鎖，導致大量的事務產生鎖衝突，大量的會話堆積，客戶端session訪問主庫失敗。備庫和主庫之間的PQping的心跳通訊測試也受到

KingbaseES R6 叢集主機鎖衝突導致的主備切換案例

案例說明：主庫在業務高峰期間，客戶執行建表等DDL操作，主庫產生“AccessExclusiveLock ”鎖，導致大量的事務產生鎖衝突，大量的會話堆積，客戶端session訪問主庫失敗。備庫和主庫之間的PQping的心跳通訊測試也受

KingbaseES R6叢集一鍵修改叢集和資料庫引數測試案例

案例說明：叢集環境修改叢集或資料庫引數，需要在每個node上都要修改，在每個節點而執行修改操作，容易出現漏改或節點上引數不一致等錯誤；在KingbaseES V8R6的叢集中增加了，一鍵修改引數的新功能，可以在一個節點

KingbaseES R6 叢集一鍵修改叢集和資料庫引數測試案例

案例說明：叢集環境修改叢集或資料庫引數，需要在每個node上都要修改，在每個節點而執行修改操作，容易出現漏改或節點上引數不一致等錯誤；在KingbaseES V8R6的叢集中增加了，一鍵修改引數的新功能，可以在一個節

kingbaseES R6 叢集“雙主”故障解決案例

案例測試環境：作業系統： [kingbase@node1 bin]$ cat /etc/centos-release CentOS Linux release 7.2.1511 (Core)

KingbaseES R6叢集手工配置vip案例

案例環境：作業系統（UOS)： root@uos01:~# cat /etc/issue Uniontech OS Server 20 Enterprise \\n \\l

KingbaseES R6叢集修改data目錄測試案例

KingbaseES、repmgr、KingbaseCluster 案例說明：本案例是在部署完成KingbaseES R6集群后，由於業務的需求，叢集需要修改data（資料儲存）目錄的測試。本案例分兩種修改方式，第一種是離線修改data目錄，即關閉

KingbaseES R6叢集通過備庫clone線上新增新節點

案例說明： KingbaseES R6叢集可以通過圖形化方式線上新增新節點，但是在新增新節點clone環節時，是從主庫copy資料到新的節點，這樣在生產環境，如果資料量大，將會對主庫的網路I/O造成壓力。可以通過‘repmgr stand

KingbaseES R6叢集主庫網絡卡down測試案例

資料庫版本： test=# select version(); version ----------------------------------------------------------------------------------------------------------------------

KingbaseES R6叢集備庫網絡卡down測試案例

資料庫版本： test=# select version(); version ----------------------------------------------------------------------------------------------------------------------

KingbaseES R6 叢集備庫網絡卡down測試案例

資料庫版本： test=# select version(); version ----------------------------------------------------------------------------------------------------------------------

KingbaseES R3 叢集刪除test庫導致主備無法切換問題

案例說明：在KingbaseES R3叢集中，kingbasecluster程序會通過test庫訪問，連線後臺資料庫服務測試；如果刪除test資料庫，導致後臺資料庫服務訪問失敗，在叢集主備切換時，無法訪問後臺資料庫服務，導致切換失敗。

KingbaseES R6 repmgr叢集通用機root無法建立ssh信任連線案例

案例說明：在生產環境下，由於安全需要，主機間不允許建立root使用者的ssh信任連線，這樣導致KingbaseES R6 repmgr叢集，通過sys_monitor.sh指令碼啟動叢集時，節點之間不能通過ssh正常訪問，導致叢集啟動失敗。本

KingbaseES R6 叢集 recovery 引數對切換的影響

相關推薦