多執行緒回放+flush tables with read lock 死鎖
阿新 • • 發佈:2018-11-21
文章目錄
多執行緒回放+flush tables with read lock 死鎖
一、場景描述
MySQL-5.7.18 slave例項上夜間進行備份操作時,
processlist 結果,只列出關鍵部分
mysql> show processlist;l +----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+ | 3 | root | localhost | test_dead_lock | Query | 2264 | Waiting for commit lock | flush tables with read lock | | 4 | system user | | NULL | Connect | 2708 | Waiting for master to send event | NULL | | 5 | system user | | NULL | Connect | 2289 | Slave has read all relay log; waiting for more updates | NULL | | 6 | system user | | NULL | Connect | 2319 | Waiting for global read lock | insert into ashe(id,name) values(40,'aaa') | | 7 | system user | | NULL | Connect | 2308 | Waiting for preceding transaction to commit | NULL | | 8 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 9 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 10 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 11 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 12 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 13 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 14 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 15 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL | | 16 | system user | | NULL | Connect | 2708 | Waiting for an event from Coordinator | NULL |
slave相關引數設定
mysql> show global variables like '%slave%' -> ; +-------------------------------------------+-----------------------+ | Variable_name | Value | +-------------------------------------------+-----------------------+ | init_slave | | | log_slave_updates | ON | | log_slow_slave_statements | ON | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_slave_enabled | ON | | rpl_semi_sync_slave_trace_level | 32 | | rpl_stop_slave_timeout | 31536000 | | slave_allow_batching | OFF | | slave_checkpoint_group | 512 | | slave_checkpoint_period | 300 | | slave_compressed_protocol | OFF | | slave_exec_mode | STRICT | | slave_load_tmpdir | /data/mysql/mysql_tmp | | slave_max_allowed_packet | 1073741824 | | slave_net_timeout | 30 | | slave_parallel_type | LOGICAL_CLOCK | | slave_parallel_workers | 16 | | slave_pending_jobs_size_max | 16777216 | | slave_preserve_commit_order | ON | | slave_rows_search_algorithms | INDEX_SCAN,HASH_SCAN | | slave_skip_errors | OFF | | slave_sql_verify_checksum | ON | | slave_transaction_retries | 128 | | slave_type_conversions | | | sql_slave_skip_counter | 0 | +-------------------------------------------+-----------------------+
此slave例項沒有為業務提供查詢服務,凌晨會有xtrabackup備份操作。
二、死鎖排查
從processlist中可以看到如下資訊
- 執行緒3是xtrabackup 下發的
flush tables with read lock
被其他執行緒阻塞,一般出現這個問題,證明是有慢語句存在。 - 執行緒6是sql回放執行緒,被執行緒3阻塞
- 執行緒7的狀態為
Waiting for preceding transaction to commit
排查思路
Waiting for preceding transaction to commit
的意思
這條資訊對應如下的PSI_stage_info
PSI_stage_info stage_worker_waiting_for_its_turn_to_commit= { 0, "Waiting for preceding transaction to commit", 0};
檢視合適進入stage_worker_waiting_for_its_turn_to_commit的狀態,是在Commit_order_manager::wait_for_its_turn中
bool Commit_order_manager::wait_for_its_turn(Slave_worker *worker,
bool all)
{
DBUG_ENTER("Commit_order_manager::wait_for_its_turn");
/*
When prior transaction fail, current trx should stop and wait for signal
to rollback itself
*/
if ((all || ending_single_stmt_trans(worker->info_thd, all) || m_rollback_trx) &&
m_workers[worker->id].status == OCS_WAIT)
{
PSI_stage_info old_stage;
mysql_cond_t *cond= &m_workers[worker->id].cond;
THD *thd= worker->info_thd;
DBUG_PRINT("info", ("Worker %lu is waiting for commit signal", worker->id));
mysql_mutex_lock(&m_mutex);
thd->ENTER_COND(cond, &m_mutex,
&stage_worker_waiting_for_its_turn_to_commit,
&old_stage);
其實Commit_order_manager這個類是為了解決slave並行回放時,slave binlog亂序的問題。就是slave的binlog提交順序嚴格按照主庫上的提交順序來,那麼如果兩個事務可以並行回放,並且編號更大的事務執行的比更小的事務要快,則此事務進入order commit階段時必須等待事務編號較小的事務。
- 根據如上所述可以分析出必然有事務未進入到提交階段,可以檢視其他的sql回放執行緒
如下,但是發現此事務已經被flush tables阻塞了,
| 6 | system user | | NULL | Connect | 2319 | Waiting for global read lock | insert into ashe(id,name) values(40,'aaa') |
其實出問題的時候,可以通過查詢全域性的gtid_owned來驗證猜測,如下:
gtid_owned | cdfe45e6-c227-11e8-abf5-001c42bf9720:19#6:20#7
可以看到執行緒6當前執行的事務的gtid的sid為19,執行緒7的為20,執行緒7確實是在等待執行緒6.
3.檢視執行緒6的狀態
Waiting for global read lock
這證明是在sql執行時,開啟表被阻塞。
4.基本上可以定位到問題了,
-
flush tables with read lock 被執行緒7阻塞
-
執行緒7要等待執行緒6,保證順序提交
-
執行緒6等待 flush tables with read lock。
三、解決辦法
猜想一下,能用什麼辦法解決呢?
- 假設一:stop slave
- 假設二:kill flush tables
- 假設三:kill sql回放執行緒
四、如何復現的?
需要對多執行緒複製,事務提交等有比較清晰的認識才可以。