1. 程式人生 > 其它 >KingbaseES R6叢集手工配置vip案例

KingbaseES R6叢集手工配置vip案例

案例環境:


作業系統(UOS):
root@uos01:~# cat /etc/issue
Uniontech OS Server 20 Enterprise \n \l

資料庫:
test=# select version();
                                                       version                                                      
-------------------------------------------------------------------------------
 KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)

案例說明:
在叢集前期部署過程中,沒有配置vip,部署執行後,因應用需求,需要配置vip。對於KES R6叢集手工配置vip操作比較簡單,只需要修改repmgr.conf檔案即可。

操作步驟總結:

     
      1) 確定需要配置的vip地址,需和物理ip同網段,並且沒有被使用。
      2) 檢視arping和ip可執行檔案的路徑及arping的版本。
      3) 對ip和arping可執行檔案配置setuid許可權(s許可權)。
      4) 修改repmgr.conf檔案新增配置項。
      5) 重新啟動叢集並驗證叢集狀態。
      6) 主備切換測試。
      7) 應用連線vip訪問測試。

一、叢集架構資訊
1、前期部署

2、檢視叢集節點狀態資訊

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 1        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | standby |   running | node238  | default  | 100      | 1        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、檢視repmgr.conf檔案

 kingbase@uos01:~/cluster/R6HA/kha/kingbase/etc$ cat repmgr.conf 
on_bmj=off
node_id=1
node_name='node238'
promote_command='/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6HA/kha/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/R6HA/kha/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
log_file='/home/kingbase/cluster/R6HA/kha/kingbase/hamgr.log'
data_directory='/home/kingbase/cluster/R6HA/kha/kingbase/data'
sys_bindir='/home/kingbase/cluster/R6HA/kha/kingbase/bin'
ssh_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22'
reconnect_attempts=3
reconnect_interval=5
failover='automatic'
recovery='manual'
monitoring_history='no'
trusted_servers='192.168.7.1'
synchronous='quorum'
repmgrd_pid_file='/home/kingbase/cluster/R6HA/kha/kingbase/hamgrd.pid'
ping_path='/usr/bin'

=從以上配置檔案獲知,檔案中沒有virtual_ip的配置項=

4、sys_monitor.sh啟動叢集

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./sys_monitor.sh restart
2021-03-01 12:07:25 Ready to stop all DB ...
Service process "node_export" was killed at process 12391
Service process "postgres_ex" was killed at process 12392
Service process "node_export" was killed at process 5229
Service process "postgres_ex" was killed at process 5230
2021-03-01 12:07:28 begin to stop repmgrd on "[192.168.7.238]".
2021-03-01 12:07:29 repmgrd on "[192.168.7.238]" stop success.
2021-03-01 12:07:29 begin to stop repmgrd on "[192.168.7.239]".
2021-03-01 12:07:29 repmgrd on "[192.168.7.239]" stop success.
2021-03-01 12:07:29 begin to stop DB on "[192.168.7.239]".
waiting for server to shut down.... done
server stopped
2021-03-01 12:07:30 DB on "[192.168.7.239]" stop success.
2021-03-01 12:07:30 begin to stop DB on "[192.168.7.238]".
waiting for server to shut down.... done
server stopped
2021-03-01 12:07:30 DB on "[192.168.7.238]" stop success.
2021-03-01 12:07:30 Done.
2021-03-01 12:07:30 Ready to start all DB ...
2021-03-01 12:07:30 begin to start DB on "[192.168.7.238]".
waiting for server to start.... done
server started
2021-03-01 12:07:31 execute to start DB on "[192.168.7.238]" success, connect to check it.
2021-03-01 12:07:32 DB on "[192.168.7.238]" start success.
2021-03-01 12:07:32 Try to ping trusted_servers on host 192.168.7.238 ...
2021-03-01 12:07:34 Try to ping trusted_servers on host 192.168.7.239 ...
2021-03-01 12:07:37 begin to start DB on "[192.168.7.239]".
waiting for server to start.... done
server started
2021-03-01 12:07:37 execute to start DB on "[192.168.7.239]" success, connect to check it.
2021-03-01 12:07:38 DB on "[192.168.7.239]" start success.
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 1        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | standby |   running | node238  | default  | 100      | 1        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2021-03-01 12:07:38 The primary DB is started.
2021-03-01 12:07:38 begin to start repmgrd on "[192.168.7.238]".
[2021-03-01 12:07:39] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 12:07:39] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/kha/kingbase/hamgr.log"

2021-03-01 12:07:39 repmgrd on "[192.168.7.238]" start success.
2021-03-01 12:07:39 begin to start repmgrd on "[192.168.7.239]".
[2021-03-01 12:07:35] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 12:07:35] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/kha/kingbase/hamgr.log"

2021-03-01 12:07:40 repmgrd on "[192.168.7.239]" start success.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node238 | primary | * running |          | running | 13285 | no      | n/a                
 2  | node239 | standby |   running | node238  | running | 5508  | no      | 0 second(s) ago    
2021-03-01 12:07:44 Done.
===從以上資訊獲知,在叢集啟動過程中,沒有對VIP檢測的環節。===

二、修改repmgr.conf配置檔案配置vip(需要在所有節點執行)

1、確定配置vip的網絡卡

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:56:02:82 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.238/24 brd 192.168.7.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
   ====配置vip的網絡卡必須和物理ip是同一個裝置。====

2、確定ip和arping可執行檔案路徑和許可權

確定ip和arping可執行檔案路徑:

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ which arping
/usr/bin/arping
root@uos01:~# which ip
/usr/sbin/ip

檢視arping版本:

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ arping -V
arping utility, iputils-s20180629
kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ls arping
arping
kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./arping -V
arping utility, iputils-s20200808kb

配置ip和arping可執行檔案許可權(配置setuid許可權):

root@uos01:~# ls -lh /usr/bin/arping
-rwxr-xr-x 1 root root 27K Jan 14  2020 /usr/bin/arping
root@uos01:~# ls -lh /usr/bin/ip
-rwxr-xr-x 1 root root 575K Jun  4  2021 /usr/bin/ip
root@uos01:~# chmod 4755 /usr/bin/arping
root@uos01:~# chmod 4755 /usr/sbin/ip
root@uos01:~# ls -lh /usr/bin/arping
-rwsr-xr-x 1 root root 27K Jan 14  2020 /usr/bin/arping
root@uos01:~# ls -lh /usr/sbin/ip
lrwxrwxrwx 1 root root 7 Jun  4  2021 /usr/sbin/ip -> /bin/ip
root@uos01:~# ls -lh /bin/ip
-rwsr-xr-x 1 root root 575K Jun  4  2021 /bin/ip

注意:
1)ip命令用於載入和解除安裝vip。
2)arping命令用於vip切換中的arp cache的清理和測試。

3、修改repmgr.conf配置檔案

三、重新啟動叢集(sys_monitor.sh啟動)

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./sys_monitor.sh restart
2021-03-01 12:22:39 Ready to stop all DB ...
There is no service "node_export" running currently.
There is no service "postgres_ex" running currently.
There is no service "node_export" running currently.
There is no service "postgres_ex" running currently.
2021-03-01 12:22:42 begin to stop repmgrd on "[192.168.7.238]".
2021-03-01 12:22:43 repmgrd on "[192.168.7.238]" already stopped.
2021-03-01 12:22:43 begin to stop repmgrd on "[192.168.7.239]".
2021-03-01 12:22:43 repmgrd on "[192.168.7.239]" already stopped.
2021-03-01 12:22:43 begin to stop DB on "[192.168.7.239]".
waiting for server to shut down.... done
server stopped
2021-03-01 12:22:44 DB on "[192.168.7.239]" stop success.
2021-03-01 12:22:44 begin to stop DB on "[192.168.7.238]".
waiting for server to shut down.... done
server stopped
2021-03-01 12:22:44 DB on "[192.168.7.238]" stop success.
2021-03-01 12:22:44 Done.
2021-03-01 12:22:44 Ready to start all DB ...
2021-03-01 12:22:44 begin to start DB on "[192.168.7.238]".
waiting for server to start.... done
server started
2021-03-01 12:22:45 execute to start DB on "[192.168.7.238]" success, connect to check it.
2021-03-01 12:22:46 DB on "[192.168.7.238]" start success.
2021-03-01 12:22:46 Try to ping trusted_servers on host 192.168.7.238 ...
2021-03-01 12:22:48 Try to ping trusted_servers on host 192.168.7.239 ...
2021-03-01 12:22:51 begin to start DB on "[192.168.7.239]".
waiting for server to start.... done
server started
2021-03-01 12:22:51 execute to start DB on "[192.168.7.239]" success, connect to check it.
2021-03-01 12:22:52 DB on "[192.168.7.239]" start success.
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 1        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | standby |   running | node238  | default  | 100      | 1        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2021-03-01 12:22:53 The primary DB is started.
2021-03-01 12:22:57 Success to load virtual ip [192.168.7.244/24] on primary host [192.168.7.238].
2021-03-01 12:22:57 Try to ping vip on host 192.168.7.238 ...
2021-03-01 12:22:59 Try to ping vip on host 192.168.7.239 ...
2021-03-01 12:23:02 begin to start repmgrd on "[192.168.7.238]".
[2021-03-01 12:23:02] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 12:23:02] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/kha/kingbase/hamgr.log"

2021-03-01 12:23:02 repmgrd on "[192.168.7.238]" start success.
2021-03-01 12:23:02 begin to start repmgrd on "[192.168.7.239]".
[2021-03-01 12:22:58] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 12:22:58] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/kha/kingbase/hamgr.log"

2021-03-01 12:23:03 repmgrd on "[192.168.7.239]" start success.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node238 | primary | * running |          | running | 15043 | no      | n/a                
 2  | node239 | standby |   running | node238  | running | 6440  | no      | n/a                
2021-03-01 12:23:07 Done.

=== 從以上資訊可獲知,叢集重啟後已經開始載入VIP地址 [192.168.7.244/24] ===

四、驗證叢集狀態

1、檢視vip的載入

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:56:02:82 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.238/24 brd 192.168.7.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.7.244/24 scope global secondary enp0s3:3
       valid_lft forever preferred_lft forever
 === 從以上獲知,vip載入在主庫節點成功===

2、檢視叢集節點狀態

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 1        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | standby |   running | node238  | default  | 100      | 1        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、通過vip連線資料庫檢視流複製狀態

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./ksql -h 192.168.7.244 -U system test
ksql (V8.0)
Type "help" for help.

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_s
tart         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag |
 replay_lag | sync_priority | sync_state |          reply_time           
-------+----------+---------+------------------+---------------+-----------------+
 14935 |    16384 | esrep   | node239          | 192.168.7.239 |                 |       58172 | 2021-03-01 12:22:
51.831920+08 |              | streaming | 0/6000670 | 0/6000670 | 0/6000670 | 0/6000670  |           |           |
            |             1 | quorum     | 2021-03-01 12:24:30.751707+08
(1 row)

五、主備switchover切換測試

1、切換前叢集節點狀態

 kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 1        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | standby |   running | node238  | default  | 100      | 1        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、執行switchover的切換

kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr standby switchover --siblings-follow
NOTICE: executing switchover on node "node239" (ID: 2)
WARNING: option "--sibling-nodes" specified, but no sibling nodes exist
INFO: pausing repmgrd on node "node238" (ID 1)
INFO: pausing repmgrd on node "node239" (ID 2)
NOTICE: local node "node239" (ID: 2) will be promoted to primary; current primary "node238" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node238" (ID: 1)
NOTICE: issuing CHECKPOINT
NOTICE: node (ID: 1) release the virtual ip 192.168.7.244/24 success
DETAIL: executing server command "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/7000028
NOTICE: PING 192.168.7.244 (192.168.7.244) 56(84) bytes of data.

--- 192.168.7.244 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 3ms


WARNING: ping host"192.168.7.244" failed
DETAIL: average RTT value is not greater than zero
NOTICE: new primary node (ID: 2) acquire the virtual ip 192.168.7.244/24 success
NOTICE: promoting standby to primary
DETAIL: promoting server "node239" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node239" (ID: 2) was successfully promoted to primary
NOTICE: issuing CHECKPOINT
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/7000028; rejoin target node's fork point: 0/70000A0
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2021-03-01 12:29:42.971664
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2021-03-01 12:29:43.087104
NOTICE: replication slot "repmgr_slot_2" deleted on node 1
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
NOTICE: switchover was successful
DETAIL: node "node239" is now primary and node "node238" is attached as standby
INFO: unpausing repmgrd on node "node238" (ID 1)
INFO: unpause node "node238" (ID 1) successfully
INFO: unpausing repmgrd on node "node239" (ID 2)
INFO: unpause node "node239" (ID 2) successfully
NOTICE: STANDBY SWITCHOVER has completed successfully

3、檢視切換後vip的載入

kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ip add sh
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:c9:c0:27 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.239/24 brd 192.168.7.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.7.244/24 scope global secondary enp0s3:3
       valid_lft forever preferred_lft forever
=== 由以上獲知,vip已經載入到新的主庫上===

4、檢視切換後的節點狀態(切換狀態正常)

kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | standby |   running | node239  | default  | 100      | 1        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | primary | * running |          | default  | 100      | 2        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

5、檢視原主庫vip(已經被解除安裝)

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
   link/ether 08:00:27:56:02:82 brd ff:ff:ff:ff:ff:ff
   inet 192.168.7.238/24 brd 192.168.7.255 scope global noprefixroute enp0s3
      valid_lft forever preferred_lft forever

六、叢集failover switch測試

1、檢視主庫資料庫程序並kill

檢視資料庫服務:

kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ps -ef |grep kingbase

kingbase  6403     1  0 12:22 ?        00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kingbase -D /home/kingbase/cluster/R6HA/kha/kingbase/data
kingbase  6404  6403  0 12:22 ?        00:00:00 kingbase: logger   
kingbase  6406  6403  0 12:22 ?        00:00:00 kingbase: checkpointer   
kingbase  6407  6403  0 12:22 ?        00:00:00 kingbase: background writer   
kingbase  6408  6403  0 12:22 ?        00:00:00 kingbase: stats collector   
kingbase  6438  6403  0 12:22 ?        00:00:00 kingbase: esrep esrep 192.168.7.239(26210) idle
kingbase  6440     1  0 12:22 ?        00:00:04 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase  6478     1  0 12:23 ?        00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase  6514     1  0 12:23 ?        00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/../share/node_exporter
kingbase  6515     1  0 12:23 ?        00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/../share/postgres_exporter
kingbase  6520  6403  0 12:23 ?        00:00:00 kingbase: system test ::1(19532) idle
root      6870  2168  0 12:29 pts/1    00:00:00 su - kingbase
kingbase  6871  6870  0 12:29 pts/1    00:00:00 -bash
kingbase  6934  6403  0 12:29 ?        00:00:00 kingbase: walwriter   
kingbase  6935  6403  0 12:29 ?        00:00:00 kingbase: autovacuum launcher   
kingbase  6936  6403  0 12:29 ?        00:00:00 kingbase: archiver   last was 000000010000000000000007.partial
kingbase  6937  6403  0 12:29 ?        00:00:00 kingbase: ksh writer   
kingbase  6938  6403  0 12:29 ?        00:00:00 kingbase: ksh collector   
kingbase  6939  6403  0 12:29 ?        00:00:00 kingbase: sys_kwr collector   
kingbase  6940  6403  0 12:29 ?        00:00:00 kingbase: logical replication launcher   
kingbase  6950  6403  0 12:29 ?        00:00:00 kingbase: walsender esrep 192.168.7.238(57878) streaming 0/7001AD8
kingbase  6960  6403  0 12:29 ?        00:00:00 kingbase: esrep esrep 192.168.7.238(57890) idle
kingbase  7422  6478  0 12:35 ?        00:00:00 ping -q -c3 -w2 192.168.7.1

# 關閉主庫資料庫服務

kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

2、檢視failover後集群節點狀態

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status               | Upstream  | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+----------------------+-----------+----------+----------+
 1  | node238 | standby | ! running as primary | ? node239 | default  | 100      | 3        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | primary | ? unreachable        |           | default  | 100      | ?        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
  - node "node238" (ID: 1) is registered as standby but running as primary
  - unable to connect to node "node238" (ID: 1)'s upstream node "node239" (ID: 2)
  - unable to determine if node "node238" (ID: 1) is attached to its upstream node "node239" (ID: 2)
  - unable to connect to node "node239" (ID: 2)
  - node "node239" (ID: 2) is registered as an active primary but is unreachable

=== 從以上獲知,在主庫資料庫服務宕機後,發生failover的切換,原備庫被切換為新的主庫,在節點狀態中原主庫的狀態為”unreachable“。===

3、原主庫啟動資料庫服務


 kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ./sys_ctl start -D ../data
 waiting for server to start....2021-03-01 12:38:22.270 CST [7554] LOG:  sepapower extension initialized
 2021-03-01 12:38:22.272 CST [7554] LOG:  starting KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
 2021-03-01 12:38:22.273 CST [7554] LOG:  listening on IPv4 address "0.0.0.0", port 54321
2021-03-01 12:38:22.274 CST [7554] LOG:  listening on IPv6 address "::", port 54321
2021-03-01 12:38:22.306 CST [7554] LOG:  listening on Unix socket "/tmp/.s.KINGBASE.54321"
2021-03-01 12:38:22.348 CST [7554] LOG:  redirecting log output to logging collector process
2021-03-01 12:38:22.348 CST [7554] HINT:  Future log output will appear in directory "sys_log".
 done
 server started

4、檢視叢集節點狀態

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 3        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | primary | ! running |          | default  | 100      | 2        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
  - node "node239" (ID: 2) is running but the repmgr node record is inactive

=== 從以上獲知,原主庫啟動後不能作為備庫,自動加入到叢集中,現在叢集出現了”雙主“現象,需要人工處理,將原主庫重新join到叢集。===

5、將原主庫重新加入到叢集中

kingbase@uos02:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr node rejoin -h 192.168.7.238 -U esrep -d esrep --force-rewind

NOTICE: sys_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 0/9000028
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' --source-server='host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/80000A0 on timeline 2
sys_rewind: rewinding from last common checkpoint at 0/7001A28 on timeline 2
sys_rewind: find last common checkpoint start time from 2021-03-01 12:40:10.934396 CST to 2021-03-01 12:40:10.957023 CST, in "0.022627" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/802A840', minRecoveryPointTLI is '3', and database state is 'in archive recovery'
sys_rewind: we will remove the dir '/home/kingbase/cluster/R6HA/kha/kingbase/data/sys_replslot/repmgr_slot_1.rewind' and all the file/dir in it.
sys_rewind: rewind start wal location 0/70019F0 (file 000000020000000000000007), end wal location 0/802A840 (file 000000030000000000000008). time from 2021-03-01 12:40:10.934396 CST to 2021-03-01 12:40:13.660121 CST, in "2.725725" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6HA/kha/kingbase/data
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2021-03-01 12:40:13.720043
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2021-03-01 12:40:14.273627
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1

6、檢視叢集節點狀態

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node238 | primary | * running |          | default  | 100      | 3        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node239 | standby |   running | node238  | default  | 100      | 2        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=== 從以上獲知,現在叢集節點狀態已經恢復正常===

7、檢視failover後vip的載入

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:56:02:82 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.238/24 brd 192.168.7.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.7.244/24 scope global secondary enp0s3:3
       valid_lft forever preferred_lft forever
=== 從以上獲知,vip已經載入到新的主庫===

8、檢視主備流複製狀態

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./ksql -U system test
ksql (V8.0)
Type "help" for help.

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_s
tart         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag |
 replay_lag | sync_priority | sync_state |          reply_time           
-------+----------+---------+------------------+---------------+-----------------+
 16412 |    16384 | esrep   | node239          | 192.168.7.239 |                 |       59958 | 2021-03-01 12:40:
19.069026+08 |              | streaming | 0/802B900 | 0/802B900 | 0/802B900 | 0/802B900  |           |           |
            |             1 | quorum     | 2021-03-01 12:42:08.616363+08
(1 row)     

七、配置過程中的故障資訊

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./sys_monitor.sh restart

the dir "/sbin" has no execute file "arping", please set [arping_path] in /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf

kingbase@uos01:~/cluster/R6HA/kha/kingbase/bin$ ./sys_monitor.sh restart
2021-03-01 12:19:27 Ready to stop all DB ...
Service process "node_export" was killed at process 13382
Service process "postgres_ex" was killed at process 13383
Service process "node_export" was killed at process 5575
Service process "postgres_ex" was killed at process 5576
2021-03-01 12:19:31 begin to stop repmgrd on "[192.168.7.238]".
2021-03-01 12:19:31 repmgrd on "[192.168.7.238]" stop success.
2021-03-01 12:19:31 begin to stop repmgrd on "[192.168.7.239]".
2021-03-01 12:19:32 repmgrd on "[192.168.7.239]" stop success.
2021-03-01 12:19:32 begin to stop DB on "[192.168.7.239]".
incorrect command permissions for the virtual ip.
waiting for server to shut down.... done
server stopped
2021-03-01 12:19:33 DB on "[192.168.7.239]" stop success.
2021-03-01 12:19:33 begin to stop DB on "[192.168.7.238]".
incorrect command permissions for the virtual ip.
waiting for server to shut down.... done
server stopped
2021-03-01 12:19:33 DB on "[192.168.7.238]" stop success.
2021-03-01 12:19:33 Done.
2021-03-01 12:19:33 Ready to start all DB ...
2021-03-01 12:19:33 begin to start DB on "[192.168.7.238]".
incorrect command permissions for the virtual ip.
waiting for server to start.... done
server started
2021-03-01 12:19:34 execute to start DB on "[192.168.7.238]" success, connect to check it.
2021-03-01 12:19:35 DB on "[192.168.7.238]" start success.
2021-03-01 12:19:35 Try to ping trusted_servers on host 192.168.7.238 ...
2021-03-01 12:19:37 Try to ping trusted_servers on host 192.168.7.239 ...
2021-03-01 12:19:40 begin to start DB on "[192.168.7.239]".
incorrect command permissions for the virtual ip.
waiting for server to start.... done
server started
2021-03-01 12:19:40 execute to start DB on "[192.168.7.239]" success, connect to check it.
2021-03-01 12:19:41 DB on "[192.168.7.239]" start success.
ERROR: No execute permission for "/usr/sbin/ip"
incorrect command permissions for the virtual ip.
2021-03-01 12:19:42 There is no primary DB running, will do nothing and exit.

=== 從以上故障獲知,在配置檔案沒有設定arping可執行檔案的路徑及ip和arping可執行檔案沒有設定setuid許可權===