1. 程式人生 > >多執行緒回放+flush tables with read lock 死鎖

多執行緒回放+flush tables with read lock 死鎖

文章目錄

多執行緒回放+flush tables with read lock 死鎖

一、場景描述

MySQL-5.7.18 slave例項上夜間進行備份操作時,

processlist 結果,只列出關鍵部分

mysql> show processlist;l
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
| Id | User        | Host      | db                 | Command | Time | State                                                  | Info                                       |
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
|  3 | root        | localhost | test_dead_lock     | Query   | 2264 | Waiting for commit lock                                | flush tables with read lock                |
|  4 | system user |           | NULL               | Connect | 2708 | Waiting for master to send event                       | NULL                                       |
|  5 | system user |           | NULL               | Connect | 2289 | Slave has read all relay log; waiting for more updates | NULL                                       |
|  6 | system user |           | NULL               | Connect | 2319 | Waiting for global read lock                           | insert into ashe(id,name) values(40,'aaa') |
|  7 | system user |           | NULL               | Connect | 2308 | Waiting for preceding transaction to commit            | NULL                                       |
|  8 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
|  9 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 10 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 11 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 12 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 13 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 14 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 15 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 16 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |

slave相關引數設定

mysql> show global variables like '%slave%'
    -> ;
+-------------------------------------------+-----------------------+
| Variable_name                             | Value                 |
+-------------------------------------------+-----------------------+
| init_slave                                |                       |
| log_slave_updates                         | ON                    |
| log_slow_slave_statements                 | ON                    |
| rpl_semi_sync_master_wait_for_slave_count | 1                     |
| rpl_semi_sync_master_wait_no_slave        | ON                    |
| rpl_semi_sync_slave_enabled               | ON                    |
| rpl_semi_sync_slave_trace_level           | 32                    |
| rpl_stop_slave_timeout                    | 31536000              |
| slave_allow_batching                      | OFF                   |
| slave_checkpoint_group                    | 512                   |
| slave_checkpoint_period                   | 300                   |
| slave_compressed_protocol                 | OFF                   |
| slave_exec_mode                           | STRICT                |
| slave_load_tmpdir                         | /data/mysql/mysql_tmp |
| slave_max_allowed_packet                  | 1073741824            |
| slave_net_timeout                         | 30                    |
| slave_parallel_type                       | LOGICAL_CLOCK         |
| slave_parallel_workers                    | 16                    |
| slave_pending_jobs_size_max               | 16777216              |
| slave_preserve_commit_order               | ON                    |
| slave_rows_search_algorithms              | INDEX_SCAN,HASH_SCAN  |
| slave_skip_errors                         | OFF                   |
| slave_sql_verify_checksum                 | ON                    |
| slave_transaction_retries                 | 128                   |
| slave_type_conversions                    |                       |
| sql_slave_skip_counter                    | 0                     |
+-------------------------------------------+-----------------------+

此slave例項沒有為業務提供查詢服務,凌晨會有xtrabackup備份操作。

二、死鎖排查

從processlist中可以看到如下資訊

  1. 執行緒3是xtrabackup 下發的flush tables with read lock 被其他執行緒阻塞,一般出現這個問題,證明是有慢語句存在。
  2. 執行緒6是sql回放執行緒,被執行緒3阻塞
  3. 執行緒7的狀態為 Waiting for preceding transaction to commit

排查思路

  1. Waiting for preceding transaction to commit的意思
    這條資訊對應如下的PSI_stage_info
PSI_stage_info stage_worker_waiting_for_its_turn_to_commit= { 0, "Waiting for preceding transaction to commit", 0};

檢視合適進入stage_worker_waiting_for_its_turn_to_commit的狀態,是在Commit_order_manager::wait_for_its_turn中

bool Commit_order_manager::wait_for_its_turn(Slave_worker *worker,
                                                  bool all)
{
  DBUG_ENTER("Commit_order_manager::wait_for_its_turn");

  /*
    When prior transaction fail, current trx should stop and wait for signal
    to rollback itself
  */
  if ((all || ending_single_stmt_trans(worker->info_thd, all) || m_rollback_trx) &&
      m_workers[worker->id].status == OCS_WAIT)
  {
    PSI_stage_info old_stage;
    mysql_cond_t *cond= &m_workers[worker->id].cond;
    THD *thd= worker->info_thd;

    DBUG_PRINT("info", ("Worker %lu is waiting for commit signal", worker->id));

    mysql_mutex_lock(&m_mutex);
    thd->ENTER_COND(cond, &m_mutex,
                    &stage_worker_waiting_for_its_turn_to_commit,
                    &old_stage);

其實Commit_order_manager這個類是為了解決slave並行回放時,slave binlog亂序的問題。就是slave的binlog提交順序嚴格按照主庫上的提交順序來,那麼如果兩個事務可以並行回放,並且編號更大的事務執行的比更小的事務要快,則此事務進入order commit階段時必須等待事務編號較小的事務。

  1. 根據如上所述可以分析出必然有事務未進入到提交階段,可以檢視其他的sql回放執行緒
    如下,但是發現此事務已經被flush tables阻塞了,
|  6 | system user |           | NULL               | Connect | 2319 | Waiting for global read lock                           | insert into ashe(id,name) values(40,'aaa') |

其實出問題的時候,可以通過查詢全域性的gtid_owned來驗證猜測,如下:

 gtid_owned                       | cdfe45e6-c227-11e8-abf5-001c42bf9720:19#6:20#7

可以看到執行緒6當前執行的事務的gtid的sid為19,執行緒7的為20,執行緒7確實是在等待執行緒6.

3.檢視執行緒6的狀態

 Waiting for global read lock

這證明是在sql執行時,開啟表被阻塞。

4.基本上可以定位到問題了,

  • flush tables with read lock 被執行緒7阻塞

  • 執行緒7要等待執行緒6,保證順序提交

  • 執行緒6等待 flush tables with read lock。

三、解決辦法

猜想一下,能用什麼辦法解決呢?

  • 假設一:stop slave
  • 假設二:kill flush tables
  • 假設三:kill sql回放執行緒

四、如何復現的?

需要對多執行緒複製,事務提交等有比較清晰的認識才可以。