Repmgr 叢集“雙主”故障解決案例
阿新 • • 發佈:2021-06-29
實際工作中,可能會碰到叢集腦裂的情況,在腦裂時,會出現雙 primary情況。這時,需要使用者介入,人工判斷哪個節點的資料最新,減少資料丟失。
一、測試環境資訊
作業系統: [kingbase@node1 bin]$ cat /etc/centos-release CentOS Linux release 7.2.1511 (Core) 資料庫: [kingbase@node1 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test=# select version(); version ---------------------------------------------------------------------------------------- KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit (1 row)
二、叢集啟動後“雙主”故障
1、故障現象
[kingbase@node1 bin]$ ./sys_monitor.sh restart 2021-03-01 13:30:03 Ready to stop all DB ... Service process "node_export" was killed at process 8253 Service process "postgres_ex" was killed at process 8254 Service process "node_export" was killed at process 8131 Service process "postgres_ex" was killed at process 8132 2021-03-01 13:30:09 begin to stop repmgrd on "[192.168.7.248]". 2021-03-01 13:30:10 repmgrd on "[192.168.7.248]" stop success. 2021-03-01 13:30:10 begin to stop repmgrd on "[192.168.7.249]". 2021-03-01 13:30:11 repmgrd on "[192.168.7.249]" stop success. 2021-03-01 13:30:11 begin to stop DB on "[192.168.7.249]". waiting for server to shut down..... done server stopped 2021-03-01 13:30:13 DB on "[192.168.7.249]" stop success. 2021-03-01 13:30:13 begin to stop DB on "[192.168.7.248]". waiting for server to shut down.... done server stopped 2021-03-01 13:30:14 DB on "[192.168.7.248]" stop success. 2021-03-01 13:30:14 Done. 2021-03-01 13:30:14 Ready to start all DB ... 2021-03-01 13:30:14 begin to start DB on "[192.168.7.248]". waiting for server to start.... done server started 2021-03-01 13:30:16 execute to start DB on "[192.168.7.248]" success, connect to check it. 2021-03-01 13:30:17 DB on "[192.168.7.248]" start success. 2021-03-01 13:30:17 Try to ping trusted_servers on host 192.168.7.248 ... 2021-03-01 13:30:19 Try to ping trusted_servers on host 192.168.7.249 ... 2021-03-01 13:30:22 begin to start DB on "[192.168.7.249]". waiting for server to start.... done server started 2021-03-01 13:30:23 execute to start DB on "[192.168.7.249]" success, connect to check it. 2021-03-01 13:30:24 DB on "[192.168.7.249]" start success. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | primary | ! running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 WARNING: following issues were detected - node "node249" (ID: 2) is running but the repmgr node record is inactive 2021-03-01 13:30:24 There are more than one primary DBs([2] DBs are running), will do nothing and exit
如上所示:叢集在啟動過程中,出現“雙主”的故障,對於“雙主”故障,需要人工參與,判斷叢集中那個節點是最新的主庫,重新恢復叢集。
2、檢視原備庫資料庫服務
node2 (原主庫):
[kingbase@node2 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+----------------------+----------+----------+----------+-------- 1 | node248 | standby | ! running as primary | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 WARNING: following issues were detected - node "node248" (ID: 1) is running as primary but the repmgr node record is inactive
三、檢視控制檔案對比節點資料差異
node1:
[kingbase@node1 bin]$ ./sys_controldata -D ../data sys_control version number: 1201 Catalog version number: 201909212 Database system identifier: 6950158917747347623 Database cluster state: in production sys_control last modified: Mon 01 Mar 2021 01:35:16 PM CST Latest checkpoint location: 1/F2008980 Latest checkpoint's REDO location: 1/F2008948 Latest checkpoint's REDO WAL file: 0000000500000001000000F2 Latest checkpoint's TimeLineID: 5 Latest checkpoint's PrevTimeLineID: 5 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:8813 Latest checkpoint's NextOID: 32951 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 839 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 8813 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Mon 01 Mar 2021 01:35:16 PM CST
node2:
[kingbase@node2 bin]$ ./sys_controldata -D ../data sys_control version number: 1201 Catalog version number: 201909212 Database system identifier: 6950158917747347623 Database cluster state: in production sys_control last modified: Mon 01 Mar 2021 01:34:45 PM CST Latest checkpoint location: 1/F2002AC0 Latest checkpoint's REDO location: 1/F2002A88 Latest checkpoint's REDO WAL file: 0000000400000001000000F2 Latest checkpoint's TimeLineID: 4 Latest checkpoint's PrevTimeLineID: 4 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:8810 Latest checkpoint's NextOID: 32951 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 839 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 8810 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Mon 01 Mar 2021 01:34:45 PM CST
從control檔案對比可以獲知,新主庫的timeline(5)高於原主庫timeline(4),並且新主庫的事務id:8813高於原主庫事務id:8810,故選擇新主庫作為叢集的primary節點,原主庫被standby。
注意:對於選擇主庫的判斷,最好能在啟動資料庫,連線到業務上進行判斷,那個主機資料是最新的。
四、將原主庫重新加入到叢集
node2 rejoin 到叢集:
[kingbase@node2 bin]$ ./sys_ctl stop -D ../data waiting for server to shut down.... done server stopped [kingbase@node2 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep ERROR: this node cannot attach to rejoin target node 1 DETAIL: rejoin target server's timeline 5 forked off current database system timeline 4 before current recovery point 1/F2002B70 HINT: use --force-rewind to execute sys_rewind [kingbase@node2 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind NOTICE: sys_rewind execution required for this node to attach to rejoin target node 1 DETAIL: rejoin target server's timeline 5 forked off current database system timeline 4 before current recovery point 1/F2002B70 NOTICE: executing sys_rewind DETAIL: sys_rewind command is "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'" sys_rewind: servers diverged at WAL location 1/F20000D8 on timeline 4 sys_rewind: rewinding from last common checkpoint at 1/F2000060 on timeline 4 sys_rewind: find last common checkpoint start time from 2021-03-01 14:06:28.539405 CST to 2021-03-01 14:06:28.577794 CST, in "0.038389" seconds. sys_rewind: update the control file: minRecoveryPoint is '1/F2031590', minRecoveryPointTLI is '5', and database state is 'in archive recovery' sys_rewind: we will remove the dir '/home/kingbase/cluster/R6HA/KHA/kingbase/data/sys_replslot/repmgr_slot_1.rewind' and all the file/dir in it. sys_rewind: we will remove the dir '/home/kingbase/cluster/R6HA/KHA/kingbase/data/base/syssql_tmp.rewind' and all the file/dir in it. sys_rewind: rewind start wal location 1/F2000060 (file 0000000400000001000000F2), end wal location 1/F2031590 (file 0000000500000001000000F2). time from 2021-03-01 14:06:28.539405 CST to 2021-03-01 14:06:44.221603 CST, in "15.682198" seconds. sys_rewind: Done! NOTICE: 0 files copied to /home/kingbase/cluster/R6HA/KHA/kingbase/data NOTICE: setting node 2's upstream to node 1 WARNING: unable to ping "host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: begin to start server at 2021-03-01 14:06:44.800564 NOTICE: starting server using "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' -l /home/kingbase/cluster/R6HA/KHA/kingbase/bin/logfile start" NOTICE: start server finish at 2021-03-01 14:06:46.217825 NOTICE: NODE REJOIN successful DETAIL: node 2 is now attached to node 1
檢視叢集節點狀態:
[kingbase@node2 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
檢視主備流複製狀態:
[kingbase@node1 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test=# select * from sys_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_s tart | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_la g | replay_lag | sync_priority | sync_state | reply_time -------+----------+---------+------------------+---------------+-----------------+------- 22853 | 16384 | esrep | node249 | 192.168.7.249 | | 38638 | 2021-03-01 14:07: 24.293687+08 | | streaming | 1/F20357A8 | 1/F20357A8 | 1/F20357A8 | 1/F20357A8 | | | | 0 | async | 2021-03-01 14:07:57.851500+08 (1 row)
五、重新啟動叢集測試
[kingbase@node1 bin]$ ./sys_monitor.sh restart 2021-03-01 14:09:05 Ready to stop all DB ... There is no service "node_export" running currently. There is no service "postgres_ex" running currently. There is no service "node_export" running currently. There is no service "postgres_ex" running currently. 2021-03-01 14:09:10 begin to stop repmgrd on "[192.168.7.248]". 2021-03-01 14:09:11 repmgrd on "[192.168.7.248]" already stopped. 2021-03-01 14:09:11 begin to stop repmgrd on "[192.168.7.249]". 2021-03-01 14:09:11 repmgrd on "[192.168.7.249]" already stopped. 2021-03-01 14:09:11 begin to stop DB on "[192.168.7.249]". waiting for server to shut down.... done server stopped 2021-03-01 14:09:13 DB on "[192.168.7.249]" stop success. 2021-03-01 14:09:13 begin to stop DB on "[192.168.7.248]". waiting for server to shut down...... done server stopped 2021-03-01 14:09:16 DB on "[192.168.7.248]" stop success. 2021-03-01 14:09:16 Done. 2021-03-01 14:09:16 Ready to start all DB ... 2021-03-01 14:09:16 begin to start DB on "[192.168.7.248]". waiting for server to start.... done server started 2021-03-01 14:09:17 execute to start DB on "[192.168.7.248]" success, connect to check it. 2021-03-01 14:09:19 DB on "[192.168.7.248]" start success. 2021-03-01 14:09:19 Try to ping trusted_servers on host 192.168.7.248 ... 2021-03-01 14:09:21 Try to ping trusted_servers on host 192.168.7.249 ... 2021-03-01 14:09:24 begin to start DB on "[192.168.7.249]". waiting for server to start.... done server started 2021-03-01 14:09:25 execute to start DB on "[192.168.7.249]" success, connect to check it. 2021-03-01 14:09:26 DB on "[192.168.7.249]" start success. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2021-03-01 14:09:26 The primary DB is started. 2021-03-01 14:09:31 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.248]. 2021-03-01 14:09:31 Try to ping vip on host 192.168.7.248 ... 2021-03-01 14:09:33 Try to ping vip on host 192.168.7.249 ... 2021-03-01 14:09:36 begin to start repmgrd on "[192.168.7.248]". [2021-03-01 14:09:37] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf" [2021-03-01 14:09:37] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log" 2021-03-01 14:09:37 repmgrd on "[192.168.7.248]" start success. 2021-03-01 14:09:37 begin to start repmgrd on "[192.168.7.249]". [2021-03-01 14:09:00] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf" [2021-03-01 14:09:00] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log" 2021-03-01 14:09:38 repmgrd on "[192.168.7.249]" start success. ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+---------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node248 | primary | * running | | running | 24725 | no | n/a 2 | node249 | standby | running | node248 | running | 23587 | no | n/a 2021-03-01 14:09:46 Done.
檢視叢集節點狀態:
[kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3