1. 程式人生 > >問題定位 | Peronca Xtrabackup 8.0近日踩坑總結 - xtrabackup 2.4和8.0區別

問題定位 | Peronca Xtrabackup 8.0近日踩坑總結 - xtrabackup 2.4和8.0區別

[TOC] ---- ### 前言 近期在給 `radondb/xenon` 適配 percona xtrabackup 8.0時,遇到了一些問題,經過多日調研、嘗試終於解決,特此分享。 **版本資訊:** ``` Percona-Server 8.0.19-10 Percona-Xtrabackup 8.0.13 ``` 版本各欄位含義參考 https://www.percona.com/blog/2020/08/18/aligning-percona-xtrabackup-versions-with-percona-server-for-mysql/ ### 適配過程中遇到的坑 一、MySQL 8.0 + Semi-Sync + 持續寫入資料期間執行重建後,change master to && start slave 報錯: ```verilog Last_Error: Could not execute Write_rows event on table db1.t1; Duplicate entry '28646' for key 't1.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000052, end_log_pos 437 ``` 二、MySQL 8.0 + Group Replication + 持續寫入資料期間執行重建後,change master to && start group_replication 報錯: ```verilog 2020-08-21T14:51:09.977606+08:00 61 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. 2020-08-21T14:51:09.987494+08:00 61 [ERROR] [MY-013124] [Repl] Slave SQL for channel 'group_replication_applier': Slave failed to initialize relay log info structure from the repository, Error_code: MY-013124 2020-08-21T14:51:09.987542+08:00 61 [ERROR] [MY-011534] [Repl] Plugin group_replication reported: 'Error while starting the group replication applier thread' 2020-08-21T14:51:09.987651+08:00 7 [ERROR] [MY-011669] [Repl] Plugin group_replication reported: 'Unable to initialize the Group Replication applier module.' 2020-08-21T14:51:09.987831+08:00 7 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.' ``` 要解釋這個問題,首先要弄清楚xtrabackup 2.4和8.0的區別。 ### xtrabackup 2.4和8.0區別 **google查到xtrabackup 8.0與2.4版本行為有所不同:** >1. Xtrabackup 2.4 備份後生成的 `xtrabackup_binlog_info` 檔案記錄的 GTID 資訊是準確的,但是備份恢復後 `show master status` 顯示的 GTID 是不準確的。 >2. Xtrabackup 8.0 在備份只有 InnoDB 表的例項時,`xtrabackup_binlog_info` 檔案記錄的 GTID 資訊不一定是準確的,但是備份恢復後 `show master status` 顯示的 GTID 是準確的。 >3. Xtrabackup 8.0 在備份有非 InnoDB 表格的例項時,`xtrabackup_binlog_info` 檔案記錄的 GTID 資訊是準確的,備份恢復後 `show master status` 顯示的 GTID 也是準確的。 **之前研究過 xtrabackup 2.4 ,其過程大致如下:** >1. start backup >2. copy ibdata1 / copy .ibd file >3. excuted FTWRL >4. backup non-InnoDB tables and files >5. writing xtrabackup_binlog_info >6. executed FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS >7. executed UNLOCK TABLES >8. copying ib_buffer_pool >9. completed OK! **問題1:xtrabackup 8.0 的執行過程是什麼樣?** 首先,檢視重建期間的`general log`: ```verilog 2020-08-26T16:20:18.136376+08:00 170 Query SET SESSION wait_timeout=2147483 2020-08-26T16:20:18.136439+08:00 170 Query SET SESSION autocommit=1 2020-08-26T16:20:18.136523+08:00 170 Query SET NAMES utf8 2020-08-26T16:20:18.136595+08:00 170 Query SHOW VARIABLES 2020-08-26T16:20:18.138840+08:00 170 Query SELECT COUNT(*) FROM information_schema.tables WHERE engine = 'MyISAM' OR engine = 'RocksDB' 2020-08-26T16:20:18.140203+08:00 170 Query SHOW ENGINES 2020-08-26T16:20:18.140407+08:00 170 Query SHOW ENGINE INNODB STATUS 2020-08-26T16:20:18.141570+08:00 170 Query SELECT PLUGIN_NAME, PLUGIN_LIBRARY FROM information_schema.plugins WHERE PLUGIN_STATUS = 'ACTIVE' AND PLUGIN_TYPE = 'KEYRING' 2020-08-26T16:20:18.142140+08:00 170 Query SELECT CONCAT(table_schema, '/', table_name), engine FROM information_schema.tables WHERE engine NOT IN ('MyISAM', 'InnoDB', 'CSV', 'MRG_MYISAM', 'ROCKSDB') AND table_schema NOT IN ( 'performance_schema', 'information_schema', 'mysql') 2020-08-26T16:20:18.209819+08:00 171 Query SET SESSION wait_timeout=2147483 2020-08-26T16:20:18.209879+08:00 171 Query SET SESSION autocommit=1 2020-08-26T16:20:18.209950+08:00 171 Query SET NAMES utf8 2020-08-26T16:20:18.210015+08:00 171 Query SHOW VARIABLES 2020-08-26T16:20:18.214030+08:00 170 Query SELECT T2.PATH, T2.NAME, T1.SPACE_TYPE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES T1 JOIN INFORMATION_SCHEMA.INNODB_TABLESPACES_BRIEF T2 USING (SPACE) WHERE T1.SPACE_TYPE = 'Single' && T1.ROW_FORMAT != 'Undo'UNION SELECT T2.PATH, SUBSTRING_INDEX(SUBSTRING_INDEX(T2.PATH, '/', -1), '.', 1) NAME, T1.SPACE_TYPE FROM INFORMATION_SCHEMA .INNODB_TABLESPACES T1 JOIN INFORMATION_SCHEMA .INNODB_TABLESPACES_BRIEF T2 USING (SPACE) WHERE T1.SPACE_TYPE = 'General' && T1.ROW_FORMAT != 'Undo' 2020-08-26T16:20:19.533904+08:00 170 Query FLUSH NO_WRITE_TO_BINLOG BINARY LOGS 2020-08-26T16:20:19.543095+08:00 170 Query SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status 2020-08-26T16:20:19.543418+08:00 170 Query SHOW VARIABLES 2020-08-26T16:20:19.545383+08:00 170 Query SHOW VARIABLES 2020-08-26T16:20:19.550641+08:00 170 Query FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS 2020-08-26T16:20:20.556885+08:00 170 Query SELECT UUID() 2020-08-26T16:20:20.557118+08:00 170 Query SELECT VERSION() ``` 可見,**xtrabackup 8.0預設情況下大致過程如下:** >1. start backup >2. copy .ibd file >3. backup non-InnoDB tables and files >4. executed FLUSH NO_WRITE_TO_BINLOG BINARY LOGS >5. selecting LSN and binary log position from p_s.log_status >6. copy last binlog file >7. writing /mysql/backup/backup/binlog.index >8. writing xtrabackup_binlog_info >9. executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS >10. copy ib_buffer_pool >11. completed OK! > >**注意:** 當存在非InnoDB表時,xtrabackup 8.0會執行FTWRL。 從上述步驟可知,xtrabackup 8.0與2.4的步驟**主要區別**為: 當只存在InnoDB引擎的表時,不再執行FTWRL,而是通過 上述第5步(`SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status` )來獲取LSN、binlog position、GTID 。 手冊中對於表 [log_status](https://dev.mysql.com/doc/refman/8.0/en/performance-schema-log-status-table.html) 的描述如下: >The [`log_status`](https://dev.mysql.com/doc/refman/8.0/en/performance-schema-log-status-table.html) table provides information that enables an online backup tool to copy the required log files without locking those resources for the duration of the copy process. > >When the [`log_status`](https://dev.mysql.com/doc/refman/8.0/en/performance-schema-log-status-table.html) table is queried, the server blocks logging and related administrative changes for just long enough to populate the table, then releases the resources. The [`log_status`](https://dev.mysql.com/doc/refman/8.0/en/performance-schema-log-status-table.html) table informs the online backup which point it should copy up to in the source's binary log and `gtid_executed` record, and the relay log for each replication channel. It also provides relevant information for individual storage engines, such as the last log sequence number (LSN) and the LSN of the last checkpoint taken for the `InnoDB` storage engine. 從上述手冊描述可知,`performance_schema.log_status`是MySQL 8.0提供給線上備份工具獲取複製資訊的表格,查詢該表時,mysql server將阻止日誌的記錄和相關的更改來獲取足夠的時間以填充該表,然後釋放資源。 log_status 表通知線上備份工具當前主庫的 binlog 的位點和 gtid_executed 的值以及每個複製通道的 relay log。另外,它還提供了各個儲存引擎的相關資訊,比如,提供了 InnoDB 引擎使用的最後一個日誌序列號(LSN)和最後一個檢查點的 LSN。 `performance_schema.log_status`表定義為: ```sql -- Semi-Sync mysql> select * from performance_schema.log_status\G *************************** 1. row *************************** SERVER_UUID: 6b437e80-e5d5-11ea-88e3-52549922fdbb LOCAL: {"gtid_executed": "6b437e80-e5d5-11ea-88e3-52549922fdbb:1-201094", "binary_log_file": "mysql-bin.000079", "binary_log_position": 195} REPLICATION: {"channels": []} STORAGE_ENGINES: {"InnoDB": {"LSN": 23711425885, "LSN_checkpoint": 23711425885}} 1 row in set (0.00 sec) -- Group Replication mysql> select * from performance_schema.log_status\G *************************** 1. row *************************** SERVER_UUID: 7bd32480-e5d5-11ea-8f8a-525499cfbb7d LOCAL: {"gtid_executed": "aaaaaaaa-aaaa-aaaa-aaaa-53ab6ea1210a:1-11", "binary_log_file": "mysql-bin.000003", "binary_log_position": 1274} REPLICATION: {"channels": [{"channel_name": "group_replication_applier", "relay_log_file": "mysql-relay-bin-group_replication_applier.000004", "relay_log_position": 311, "relay_master_log_file": "", "exec_master_log_position": 0}, {"channel_name": "group_replication_recovery", "relay_log_file": "mysql-relay-bin-group_replication_recovery.000003", "relay_log_position": 151, "relay_master_log_file": "", "exec_master_log_position": 0}]} STORAGE_ENGINES: {"InnoDB": {"LSN": 20257208, "LSN_checkpoint": 20257208}} 1 row in set (0.00 sec) ``` **問題2:`performance_schema.log_status`提供的資訊是否準確呢?** 當寫入壓力大時,該表中的binlog position與GTID資訊不一致。 ```sql mysql> select * from performance_schema.log_status\G show master status; *************************** 1. row *************************** SERVER_UUID: 6b437e80-e5d5-11ea-88e3-52549922fdbb LOCAL: {"gtid_executed": "6b437e80-e5d5-11ea-88e3-52549922fdbb:1-448709", "binary_log_file": "mysql-bin.000087", "binary_log_position": 341265185} REPLICATION: {"channels": []} STORAGE_ENGINES: {"InnoDB": {"LSN": 33797305275, "LSN_checkpoint": 33433316246}} 1 row in set (0.11 sec) +------------------+-----------+--------------+------------------+-----------------------------------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+-----------+--------------+------------------+-----------------------------------------------+ | mysql-bin.000087 | 343317905 | | | 6b437e80-e5d5-11ea-88e3-52549922fdbb:1-448709 | +------------------+-----------+--------------+------------------+-----------------------------------------------+ 1 row in set (0.01 sec) ``` **問題3:既然log_status中的binlog position不準確,為什麼備份恢復後GTID並沒有缺失,資料也沒問題?** 原因是xtrabackup 8.0在第4步`FLUSH NO_WRITE_TO_BINLOG BINARY LOGS`之後,在第6步`copy last binlog file`,這樣備份恢復出的新例項在啟動後不僅會讀取 `gtid_executed` 表,還會讀取拷貝的那個binlog檔案來更新GTID。 ```verilog $ mysqlbinlog -vv /data/mysql/mysql-bin.000096 /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #200827 11:26:47 server id 575010000 end_log_pos 124 CRC32 0xb026e372 Start: binlog v 4, server v 8.0.19-10 created 200827 11:26:47 # Warning: this binlog is either in use or was not closed properly. BINLOG ' 9ydHXw/Q9EUieAAAAHwAAAABAAQAOC4wLjE5LTEwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEwANAAgAAAAABAAEAAAAYAAEGggAAAAICAgCAAAACgoKKioAEjQA CgFy4yaw '/*!*/; # at 124 #200827 11:26:47 server id 575010000 end_log_pos 195 CRC32 0xad060415 Previous-GTIDs # 6b437e80-e5d5-11ea-88e3-52549922fdbb:1-465503 SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/; ``` ### 問題定位 #### 坑一:MySQL 8.0 + Semi-Sync 重建問題 xenon原有的重建邏輯是適配於MySQL 5.6、5.7的(重建過程中xenon程序存活),一直無問題: >1. 禁用raft,將xenon狀態設為LEARNER ; >2. 如mysql程序存在,則stop mysql; >3. 清空MySQL資料目錄; >4. 執行`xtrabackup --backup`以`xbstream`方式獲取對端資料; >5. 執行`xtrabackup --prepare`應用redo log; >6. 啟動mysql; >7. 執行`stop slave; reset slave all`; >8. 執行`reset master`,以`xtrabackup_binlog_info`檔案中的GTID為準設定`gtid_purged`; >9. 啟用raft,將xenon狀態設為FOLLOWER或IDLE; >10. 等待xenon自動`change master to`到主節點。 >11. 執行`start slave`。 **問題1:為什麼在 MySQL 8.0 + Semi-Sync 組合下會出現 Duplicate entry ?** 跟蹤重建過程中的general log,發現在第6和第7步中間,也就是設定`gtid_purged`之前憑空多出了 `change master to` 和 `start slave` 操作: ```verilog 2020-08-24T21:55:22.817859+08:00 8 Query SET GLOBAL rpl_semi_sync_master_enabled=OFF 2020-08-24T21:55:22.818025+08:00 8 Query SET GLOBAL read_only = 1 2020-08-24T21:55:22.818143+08:00 8 Query SET GLOBAL super_read_only = 1 2020-08-24T21:55:22.818323+08:00 8 Query START SLAVE 2020-08-24T21:55:22.824449+08:00 8 Query STOP SLAVE 2020-08-24T21:55:22.824610+08:00 8 Query CHANGE MASTER TO MASTER_HOST = '192.168.0.3', MASTER_USER = 'qc_repl', MASTER_PASSWORD = , MASTER_PORT = 3306, MASTER_AUTO_POSITION = 1 2020-08-24T21:55:22.833710+08:00 8 Query START SLAVE 2020-08-24T21:55:22.935973+08:00 10 Query BEGIN 2020-08-24T21:55:22.936084+08:00 10 Query COMMIT /* implicit, from Xid_log_event */ ...... 2020-08-24T21:55:24.701711+08:00 10 Query BEGIN 2020-08-24T21:55:24.701901+08:00 10 Query COMMIT /* implicit, from Xid_log_event */ 2020-08-24T21:55:24.816571+08:00 8 Query SET GLOBAL rpl_semi_sync_master_enabled=OFF 2020-08-24T21:55:24.816886+08:00 8 Query SET GLOBAL read_only = 1 2020-08-24T21:55:24.817177+08:00 8 Query SET GLOBAL super_read_only = 1 2020-08-24T21:55:24.817281+08:00 8 Query START SLAVE 2020-08-24T21:55:25.039581+08:00 10 Query BEGIN 2020-08-24T21:55:25.039749+08:00 10 Query COMMIT /* implicit, from Xid_log_event */ ...... 2020-08-24T21:55:25.152919+08:00 10 Query BEGIN 2020-08-24T21:55:25.153082+08:00 10 Query COMMIT /* implicit, from Xid_log_event */ 2020-08-24T21:55:25.389776+08:00 8 Query STOP SLAVE 2020-08-24T21:55:25.392581+08:00 8 Query RESET SLAVE ALL 2020-08-24T21:55:25.407434+08:00 8 Query RESET MASTER 2020-08-24T21:55:25.417292+08:00 8 Query SET GLOBAL gtid_purged='6b437e80-e5d5-11ea-88e3-52549922fdbb:1-102610 ' 2020-08-24T21:55:25.419835+08:00 8 Query START SLAVE 2020-08-24T21:55:25.427071+08:00 8 Query SET GLOBAL read_only = 1 2020-08-24T21:55:25.427178+08:00 8 Query SET GLOBAL super_read_only = 1 2020-08-24T21:55:25.427271+08:00 8 Query SET GLOBAL sync_binlog=1000 2020-08-24T21:55:25.427339+08:00 8 Query SET GLOBAL innodb_flush_log_at_trx_commit=1 2020-08-24T21:55:25.427423+08:00 8 Query SHOW SLAVE STATUS 2020-08-24T21:55:25.427600+08:00 8 Query SHOW MASTER STATUS 2020-08-24T21:55:26.817622+08:00 8 Query SET GLOBAL rpl_semi_sync_master_enabled=OFF 2020-08-24T21:55:26.817794+08:00 8 Query SET GLOBAL read_only = 1 2020-08-24T21:55:26.817897+08:00 8 Query SET GLOBAL super_read_only = 1 2020-08-24T21:55:26.817988+08:00 8 Query START SLAVE 2020-08-24T21:55:26.818381+08:00 8 Query SHOW SLAVE STATUS 2020-08-24T21:55:26.818570+08:00 8 Query SHOW MASTER STATUS 2020-08-24T21:55:26.818715+08:00 8 Query STOP SLAVE 2020-08-24T21:55:26.818823+08:00 8 Query CHANGE MASTER TO MASTER_HOST = '192.168.0.3', MASTER_USER = 'qc_repl', MASTER_PASSWORD = , MASTER_PORT = 3306, MASTER_AUTO_POSITION = 1 2020-08-24T21:55:26.832164+08:00 8 Query START SLAVE ``` 這就是說在設定gtid_purged之前已經啟用複製獲取了一部分資料,那麼 xtrabackup_binlog_info 中的內容就不再準確,之後設定的GTID與實際資料就不一致,實際的資料比設定的GTID要多,引起主鍵衝突。 **問題2:為什麼之前MySQL 5.6、5.7從沒遇到過這個問題呢?** 測試了很多次,發現在 MySQL 5.6 & 5.7 在`set gtid_purged` 前執行 `change master to & start slave` 後會報復制錯誤 `Slave failed to initialize relay log info structure from the repository` ,而在`reset slave all; reset master、set gtid_purged`後再執行 `change master to & start slave` 就可以正常複製,資料無誤。 **問題3:xenon中哪塊邏輯引起的額外的 change master to 和 start slave ?** 問題根源在重建期間 xenon 會設為 LEARNER 角色,而該角色在探測到MySQL Alive後,會 change master 到主節點。正常來說,要等raft狀態設為 FOLLOWER 後由 FOLLOWER 的監聽執行緒 change master 到主節點。(程式碼見 [pr104](https://github.com/radondb/xenon/pull/104) 、[pr102](https://github.com/radondb/xenon/pull/102) ) #### 坑二:MySQL 8.0 + Group-Replication 重建後無法啟動MGR 根據報錯資訊`Slave failed to initialize relay log info structure from the repository`看,應該是xtrabackup重建後的資料目錄保留了slave複製資訊導致的,嘗試在啟動組複製前執行`reset slave或reset slave all`即可解決。 ### 總結 >1. Xtrabackup 2.4 備份後生成的 `xtrabackup_binlog_info` 檔案記錄的 GTID 資訊是準確的,但是備份恢復後 `show master status` 顯示的 GTID 是不準確的。 >2. Xtrabackup 8.0 在備份只有 InnoDB 表的例項時,`xtrabackup_binlog_info` 檔案記錄的 GTID 資訊不一定是準確的,但是備份恢復後 `show master status` 顯示的 GTID 是準確的。 >3. Xtrabackup 8.0 在備份有非 InnoDB 表格的例項時,`xtrabackup_binlog_info` 檔案記錄的 GTID 資訊是準確的,備份恢復後 `show master status` 顯示的 GTID 也是準確的。 >4. 使用 Xtrabackup 8.0 重建 Semi-Sync 叢集節點後,無需執行 `reset master & set gtid_purged` 操作。 >5. 使用 Xtrabackup 8.0 重建 Group-Replication 叢集節點後,啟動組複製前需要先執行`reset slave或reset slave all`清除slave資訊,否則 `start group_replication` 會失敗。 ---- **專欄:** 歡迎關注我的微信公眾號【MySQL資料庫技術】。 ![](https://img2020.cnblogs.com/blog/881517/202005/881517-20200508180828253-1311071614.png) 知乎 - 資料庫技術 專欄: https://zhuanlan.zhihu.com/mysqldb **部落格:** 思否/segmentfault: https://segmentfault.com/u/dbtech 開源中國/oschina: https://my.oschina.net/dbtech 掘金: https://juejin.im/user/5e9d3ed251882538083fed1f/posts 部落格園/cnblogs: https://www.cnblogs.com