MySQL MHA高可用方案
四、配置relay_log的清除方式(在每個Node上)
(1)所有Node的cnf配置檔案加上
relay_log_purge=0
MHA在發生切換的過程中,從庫的恢復過程中依賴於relay log的相關資訊,所以這裡要將relay log的自動清除設定為OFF,採用手動清除relay log的方式。
在預設情況下,從伺服器上的中繼日誌會在SQL執行緒執行完畢後被自動刪除。但是在MHA環境中,這些中繼日誌在恢復其他從伺服器時可能會被用到,因此需要禁用中繼日誌的自動刪除功能。定期清除中繼日誌需要考慮到複製延時的問題。在ext3的檔案系統下,刪除大的檔案需要一定的時間,會導致嚴重的複製延時。為了避免複製延時,需要暫時為中繼日誌建立硬連結,因為在linux系統中通過硬連結刪除大檔案速度會很快。
提示:在mysql資料庫中,刪除大表時,通常也採用建立硬連結的方式
MHA節點中包含了pure_relay_logs命令工具,它可以為中繼日誌建立硬連結,執行SET GLOBAL relay_log_purge=1,等待幾秒鐘以便SQL執行緒切換到新的中繼日誌,再執行SET GLOBAL relay_log_purge=0。
pure_relay_logs指令碼引數如下所示:
--user mysql 使用者名稱 --password mysql 密碼 --port 埠號 --workdir 指定建立relay log的硬連結的位置,預設是/var/tmp,由於系統不同分割槽建立硬連結檔案會失敗,故需要執行硬連結具體位置,成功執行指令碼後,硬連結的中繼日誌檔案被刪除 --disable_relay_log_purge 預設情況下,如果relay_log_purge=1,指令碼會什麼都不清理,自動退出,通過設定這個引數,當relay_log_purge=1的情況下會將relay_log_purge設定為0。清理relay log之後,最後將引數設定為OFF。
(2)在每臺slave Node上建立
vim /usr/local/mha/purge_relay_log.sh
#!/bin/bash
user=root
passwd=root ####確保使用者和密碼能通過127.0.0.1登入
host='127.0.0.1'
port=3306
work_dir='/mysql/data'
purge='/usr/local/mha/bin/purge_relay_logs'
$purge --user=$user --password=$passwd --host=$host --disable_relay_log_purge --port=$port --workdir=$work_dir >> /usr/local/mha/purge_relay_logs.log 2>&1
chmod u+x /usr/local/mha/purge_relay_log.sh
將指令碼加入到os定時任務中
五、檢測啟動MHA
1.檢查ssh配置
masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf
[root@monitor ha1]# masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf Thu Aug 25 14:53:30 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Aug 25 14:53:30 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 14:53:30 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 14:53:30 2016 - [info] Starting SSH connection tests.. Thu Aug 25 14:53:35 2016 - [debug] Thu Aug 25 14:53:31 2016 - [debug] Connecting via SSH from [email protected]192.168.137.20(192.168.137.20:22) to [email protected]192.168.137.10(192.168.137.10:22).. Thu Aug 25 14:53:33 2016 - [debug] ok. Thu Aug 25 14:53:33 2016 - [debug] Connecting via SSH from [email protected]192.168.137.20(192.168.137.20:22) to [email protected]192.168.137.30(192.168.137.30:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:35 2016 - [debug] Thu Aug 25 14:53:31 2016 - [debug] Connecting via SSH from [email protected]192.168.137.30(192.168.137.30:22) to [email protected]192.168.137.10(192.168.137.10:22).. Thu Aug 25 14:53:33 2016 - [debug] ok. Thu Aug 25 14:53:33 2016 - [debug] Connecting via SSH from [email protected]192.168.137.30(192.168.137.30:22) to [email protected]192.168.137.20(192.168.137.20:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:36 2016 - [debug] Thu Aug 25 14:53:30 2016 - [debug] Connecting via SSH from [email protected]192.168.137.10(192.168.137.10:22) to [email protected]192.168.137.20(192.168.137.20:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:34 2016 - [debug] Connecting via SSH from [email protected]192.168.137.10(192.168.137.10:22) to [email protected]192.168.137.30(192.168.137.30:22).. Thu Aug 25 14:53:35 2016 - [debug] ok. Thu Aug 25 14:53:36 2016 - [info] All SSH connection tests passed successfully.
可以看到每個Node到其它的Node都是相通的。
2.檢查整個複製環境
masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf
[root@monitor ha1]# masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf Thu Aug 25 16:09:19 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Aug 25 16:09:19 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 16:09:19 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 16:09:19 2016 - [info] MHA::MasterMonitor version 0.55. Thu Aug 25 16:09:20 2016 - [info] Dead Servers: Thu Aug 25 16:09:20 2016 - [info] Alive Servers: Thu Aug 25 16:09:20 2016 - [info] 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.20(192.168.137.20:3306) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.30(192.168.137.30:3306) Thu Aug 25 16:09:20 2016 - [info] Alive Slaves: Thu Aug 25 16:09:20 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Thu Aug 25 16:09:20 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Thu Aug 25 16:09:20 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Not candidate for the new Master (no_master is set) Thu Aug 25 16:09:20 2016 - [info] Current Alive Master: 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Checking slave configurations.. Thu Aug 25 16:09:20 2016 - [info] read_only=1 is not set on slave 192.168.137.20(192.168.137.20:3306). Thu Aug 25 16:09:20 2016 - [info] Checking replication filtering settings.. Thu Aug 25 16:09:20 2016 - [info] binlog_do_db= , binlog_ignore_db= Thu Aug 25 16:09:20 2016 - [info] Replication filtering check ok. Thu Aug 25 16:09:20 2016 - [info] Starting SSH connection tests.. Thu Aug 25 16:09:25 2016 - [info] All SSH connection tests passed successfully. Thu Aug 25 16:09:25 2016 - [info] Checking MHA Node version.. Thu Aug 25 16:09:26 2016 - [info] Version check ok. Thu Aug 25 16:09:26 2016 - [info] Checking SSH publickey authentication settings on the current master.. Thu Aug 25 16:09:27 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Thu Aug 25 16:09:29 2016 - [info] Master MHA Node version is 0.54. Thu Aug 25 16:09:29 2016 - [info] Checking recovery script configurations on the current master.. Thu Aug 25 16:09:29 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/log --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000138 Thu Aug 25 16:09:29 2016 - [info] Connecting to [email protected]192.168.137.10(192.168.137.10).. Creating /tmp if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /mysql/log, up to mysql-bin.000138 Thu Aug 25 16:09:30 2016 - [info] Master setting check done. Thu Aug 25 16:09:30 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Thu Aug 25 16:09:30 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.137.20 --slave_ip=192.168.137.20 --slave_port=3306 --workdir=/tmp --target_version=5.6.15-log --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx Thu Aug 25 16:09:30 2016 - [info] Connecting to [email protected]192.168.137.20(192.168.137.20:22).. Checking slave recovery environment settings.. Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000006 Temporary relay log file is /mysql/data/mysql-relay-bin.000006 Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Aug 25 16:09:31 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --workdir=/tmp --target_version=5.6.15-log --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx Thu Aug 25 16:09:31 2016 - [info] Connecting to [email protected]192.168.137.30(192.168.137.30:22).. Checking slave recovery environment settings.. Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000002 Temporary relay log file is /mysql/data/mysql-relay-bin.000002 Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Aug 25 16:09:32 2016 - [info] Slaves settings check done. Thu Aug 25 16:09:32 2016 - [info] 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 Thu Aug 25 16:09:32 2016 - [info] Checking replication health on 192.168.137.20.. Thu Aug 25 16:09:32 2016 - [info] ok. Thu Aug 25 16:09:32 2016 - [info] Checking replication health on 192.168.137.30.. Thu Aug 25 16:09:32 2016 - [info] ok. Thu Aug 25 16:09:32 2016 - [info] Checking master_ip_failover_script status: Thu Aug 25 16:09:32 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Checking the Status of the script.. OK Thu Aug 25 16:09:32 2016 - [info] OK. Thu Aug 25 16:09:32 2016 - [warning] shutdown_script is not defined. Thu Aug 25 16:09:32 2016 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.View Code
--ignore_fail_on_start: 當有slave 節點宕掉時,預設是啟動不了的,加上 --ignore_fail_on_start 即使有節點宕掉也能啟動MHA,加上該引數會忽略啟動檔案中配置ignore_fail=1的server
3.檢查MHA Manager狀態
masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf
由於mha還沒有啟動,所以這裡檢測是stopped
4.啟動MHA
nohup masterha_manager --conf=/usr/local/mha/ha1/ha1.cnf --ignore_fail_on_start --ignore_last_failover < /dev/null > /usr/local/mha/ha1/start.log 2>&1 &
--remove_dead_master_conf:該引數代表當發生主從切換後,老的主庫的ip將會從配置檔案中移除。這裡暫時不使用該引數,因為發生使用該引數會將ha1.cnf配置檔案搞亂。
--start_log:日誌。
--ignore_last_failover:發生主從切換後,MHAmanager服務會自動停掉,且在manager_workdir目錄下面生成檔案app1.failover.complete,若要啟動MHA,必須先刪除該檔案,該引數代表忽略上次MHA觸發切換產生的檔案,這裡設定為-ignore_last_failover。 在預設情況下,如果MHA檢測到連續發生宕機,且兩次宕機間隔不足8小時的話,則不會進行Failover,之所以這樣限制是為了避免ping-pong效應。
--ignore_fail_on_start: 當有slave 節點宕掉時,預設是啟動不了的,加上 --ignore_fail_on_start 即使有節點宕掉也能啟動MHA,加上該引數會忽略啟動檔案中配置ignore_fail=1的server。
(1)再次檢視MHA狀態是否正常:
[root@monitor ha1]# masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf ha1 (pid:6371) is running(0:PING_OK), master:192.168.137.10 [root@monitor ha1]#
(2)檢視啟動日誌
cat manager.log
Thu Aug 25 17:11:50 2016 - [info] 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 Thu Aug 25 17:11:50 2016 - [info] Checking master_ip_failover_script status: Thu Aug 25 17:11:50 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Checking the Status of the script.. OK Thu Aug 25 17:11:50 2016 - [info] OK. Thu Aug 25 17:11:50 2016 - [warning] shutdown_script is not defined. Thu Aug 25 17:11:50 2016 - [info] Set master ping interval 1 seconds. Thu Aug 25 17:11:50 2016 - [info] Set secondary check script: /usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 Thu Aug 25 17:11:50 2016 - [info] Starting ping health check on 192.168.137.10(192.168.137.10:3306).. Thu Aug 25 17:11:50 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. [[email protected] ha1]#
(3)產生的檔案
ha1.master_status.health:mha正常啟動會產生該檔案
manager.log:mha監控日誌
start.log:mha啟動時生成的日誌
5.關閉MHA
masterha_stop --conf=/usr/local/mha/ha1/ha1.cnf
六、故障處理步驟
發生主從切換後,MHA服務會自動停掉
1.檢查日誌
檢查故障處理的日誌,確保故障正常轉移。
cat /usr/local/mha/ha1/manager.log
2.處理故障master
處理故障的master,將其配置為從庫chang到新的master,可以從manager.log找到change語句。
grep "CHANGE MASTER TO MASTER" /usr/local/mha/ha1/manager.log | tail -1
Fri Aug 26 12:04:22 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000143', MASTER_LOG_POS=22123166, MASTER_USER='repl', MASTER_PASSWORD='xxx';
注意:這裡要確保slave的SQL_THREAD和IO_TRREAD正常,如果是配置半同步複製要確保半同步複製啟動正常,可以執行"show status like '%rpl_%';",具體參考前面半同步複製的搭建。
3.修改ha1.cnf配置檔案
需要修改"secondary_check_script"選項中的master_host、master_ip、master_port為新的master;如果兩臺電腦的配置都相同的話其它地方不用修改。
4.刪除fail檔案(非必需)
由於啟動mha的時候加上了--ignore_last_failover引數,所以不刪除failower生成的檔案也能啟動,否則需要刪除failower生成的檔案“ha1.failover.complete”。
rm -f /usr/local/mha/ha1/ha1.failover.complete
5.check檢查
檢查SSH配置 masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf 檢查複製 masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf 檢查狀態 masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf
必需保證所有的檢查都通過
6.啟動MHA
nohup masterha_manager --conf=/usr/local/mha/ha1/ha1.cnf --ignore_fail_on_start --ignore_last_failover < /dev/null > /usr/local/mha/ha1/start.log 2>&1 &
七、模擬Failover
1.自動failover
我這裡是非同步複製,137.20是當前的master,然後在137.20上執行併發插入,同時關閉137.10和137.30的IO執行緒,在137.20上壓測一段時間,然後先開啟137.30的IO執行緒,過幾秒鐘再開啟137.10的IO執行緒;保證137.30的binlog比候選的137.10的binlog更新。
master 137.20(22497564)
candidate slave:137.10(pos=9857376)
new replay slave:137.30(pos=22461852)
Fri Aug 26 11:57:36 2016 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query) Fri Aug 26 11:57:36 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/log --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin Fri Aug 26 11:57:36 2016 - [info] Executing seconary network check script: /usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 --user=root --master_host=192.168.137.20 --master_ip=192.168.137.20 --master_port=3306 Fri Aug 26 11:57:37 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:37 2016 - [warning] Connection failed 1 time(s).. Fri Aug 26 11:57:38 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:38 2016 - [warning] Connection failed 2 time(s).. Fri Aug 26 11:57:38 2016 - [info] HealthCheck: SSH to 192.168.137.20 is reachable. Monitoring server backup is reachable, Master is not reachable from backup. OK. Fri Aug 26 11:57:39 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:39 2016 - [warning] Connection failed 3 time(s).. Monitoring server master is reachable, Master is not reachable from master. OK. Fri Aug 26 11:57:41 2016 - [info] Master is not reachable from all other monitoring servers. Failover should start. Fri Aug 26 11:57:41 2016 - [warning] Master is not reachable from health checker! Fri Aug 26 11:57:41 2016 - [warning] Master 192.168.137.20(192.168.137.20:3306) is not reachable! Fri Aug 26 11:57:41 2016 - [warning] SSH is reachable. Fri Aug 26 11:57:41 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /usr/local/mha/ha1/ha1.cnf again, and trying to connect to all servers to check server status.. Fri Aug 26 11:57:41 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Fri Aug 26 11:57:41 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 11:57:41 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 11:57:42 2016 - [info] Dead Servers: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Alive Servers: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:42 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 11:57:42 2016 - [info] Alive Slaves: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:42 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:42 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:42 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:42 2016 - [info] Checking slave configurations.. Fri Aug 26 11:57:42 2016 - [info] read_only=1 is not set on slave 192.168.137.10(192.168.137.10:3306). Fri Aug 26 11:57:42 2016 - [info] Checking replication filtering settings.. Fri Aug 26 11:57:42 2016 - [info] Replication filtering check ok. Fri Aug 26 11:57:42 2016 - [info] Master is down! Fri Aug 26 11:57:42 2016 - [info] Terminating monitoring script. Fri Aug 26 11:57:42 2016 - [info] Got exit code 20 (Master dead). Fri Aug 26 11:57:42 2016 - [info] MHA::MasterFailover version 0.55. Fri Aug 26 11: