1. 程式人生 > >postgresql 高可用 repmgr 的使用之五 1 Primary + 1 Standby 的 manual failover,node rejoin

postgresql 高可用 repmgr 的使用之五 1 Primary + 1 Standby 的 manual failover,node rejoin

os:ubunbu 16.04 postgresql:9.6.8 repmgr:4.1.1

192.168.56.101 node1 192.168.56.102 node2

操作前/etc/repmgr.conf 的內容

node1 節點上的檔案內容,node2 節點上類似

$ cat /etc/repmgr.conf 

node_id=1
node_name=node1
conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/9.6/main'
use_replication_slots=true
pg_bindir='/usr/lib/postgresql/9.6/bin'
service_start_command   = 'sudo pg_ctlcluster 9.6 main start'
service_stop_command    = 'sudo pg_ctlcluster 9.6 main stop'
service_restart_command = 'sudo pg_ctlcluster 9.6 main restart'
service_reload_command  = 'sudo pg_ctlcluster 9.6 main reload' 
service_promote_command  = 'sudo pg_ctlcluster 9.6 main promote'

手動關閉主庫模擬異常

node1 節點上操作

$ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop
或者
$ sudo pg_ctlcluster 9.6 main stop

$ repmgr -f /etc/repmgr.conf cluster show
ERROR: connection to database failed:
  could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.101 fallback_application_name=repmgr
  

node2 節點上操作

$ repmgr -f /etc/repmgr.conf cluster show

 ID | Name  | Role    | Status        | Upstream | Location | Connection string                                              
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | ? unreachable |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?"
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
 

可以看出 node1 的 Status 顯示 unreachable

從庫提升為主庫

現在node1節點的postgresql已經不可用了(手動關閉、程序異常終止、宕機),需要提升node2上的standby 為 master。 node2 節點上操作

$ repmgr -f /etc/repmgr.conf standby promote

NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
DETAIL: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary

node2 上再次檢視

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                              
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?"

node1 節點變為新的slave

node1 節點上操作,啟動postgresql

# /etc/init.d/postgresql start
$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status               | Upstream | Location | Connection string                                              
----+-------+---------+----------------------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running            |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby | ! running as primary | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "node2" (ID: 2) is registered as standby but running as primary
  

node2 節點上操作

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | ! running |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "node1" (ID: 1) is running but the repmgr node record is inactive

問題來了,node1、node2檢視狀態時都有 WARNING 了,接下來需要為node1 的 postgresql 設定新的 master。

node 1 節點上關閉 postgresql

$ sudo pg_ctlcluster 9.6 main stop

使用 repmgr node rejoin 新增到叢集裡,選項可以使用的是 pg_rewind。 (This can optionally use pg_rewind to re-integrate a node which has diverged from the rest of the cluster, typically a failed primary.)

$ repmgr -f /etc/repmgr.conf node rejoin -d 'host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --dry-run --verbose

NOTICE: using provided configuration file "/etc/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
INFO: 0 files would have been copied to "/tmp/repmgr-config-archive-pgsql96"
INFO: temporary archive directory "/tmp/repmgr-config-archive-pgsql96" deleted
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
  /usr/lib/postgresql/9.6/bin/pg_rewind -D '/var/lib/postgresql/9.6/main' --source-server='host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met
$ repmgr -f /etc/repmgr.conf node rejoin -d 'host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose

NOTICE: using provided configuration file "/etc/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
INFO: 0 files copied to "/tmp/repmgr-config-archive-pgsql96"
NOTICE: executing pg_rewind
NOTICE: 0 files copied to /var/lib/postgresql/9.6/main
INFO: directory "/tmp/repmgr-config-archive-pgsql96" deleted
INFO: deleting "recovery.done"
NOTICE: setting node 1's primary to node 2
NOTICE: starting server using "sudo pg_ctlcluster 9.6 main start"
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

符合預期。