1. 程式人生 > >MySQL Waiting for table metadata lock故障分析

MySQL Waiting for table metadata lock故障分析

源起

線上執行ALTER時,通過show processlist檢視到出現Waiting for table metadata lock ,導致後面的查詢都無法執行。

5217122 | create_table_04 | 172.100.207.148:31291 | finance | Query   | 1829 | Waiting for table metadata lock  | ALTER TABLE `Pay` MODIFY COLUMN `pay`  smallint(6) NULL DEFAULT 0 COMMENT '付??'
5217155 | bx_live_dml     | 172.100.210.4:34730   | finance | Prepare | 1714 | Waiting for table metadata lock  | select * from Pay where (pay =4 or pay =8) and projectId ='CSZY0'

由於當時直接把ALTER給kill了,所以基本沒有現場,所以下面模擬什麼情況MySQL產生這種事故。

MySQL元資料鎖

MySQL DBA對於Waiting for table metadata lock肯定不會陌生,一般都是進行alter操作時被堵住了,導致了我們在show processlist 時,看到執行緒的狀態是在等metadata lock。

為了在併發環境下維護表元資料的資料一致性,在表上有活動事務(顯式或隱式)的時候,不可以對元資料進行寫入操作。因此從MySQL5.5版本開始引入了MDL鎖(metadata lock),來保護表的元資料資訊,用於解決或者保證DDL操作與DML操作之間的一致性。對於引入MDL,其主要解決了2個問題,一個是事務隔離問題,比如在可重複隔離級別下,會話A在2次查詢期間,會話B對錶結構做了修改,兩次查詢結果就會不一致,無法滿足可重複讀的要求;另外一個是資料複製的問題,比如會話A執行了多條更新語句期間,另外一個會話B做了表結構變更並且先提交,就會導致slave在重做時,先重做alter,再重做update時就會出現複製錯誤的現象。

所以在對錶進行上述操作時,如果表上有活動事務(未提交或回滾),請求寫入的會話會等待在Metadata lock wait 。例如下面的這種情形:

若沒有MDL鎖的保護,則事務2可以直接執行DDL操作,並且導致事務1出錯,5.1版本即是如此。5.5版本加入MDL鎖就在於保護這種情況的發生,由於事務1開啟了查詢,那麼獲得了MDL鎖,鎖的模式為SHARED_READ,事務2要執行DDL,則需獲得EXCLUSIVE鎖,兩者互斥,所以事務2需要等待。

注:支援事務的InnoDB引擎表和不支援事務的MyISAM引擎表,都會出現Metadata Lock Wait等待現象。一旦出現Metadata Lock Wait等待現象,後續所有對該表的訪問都會阻塞在該等待上,導致連線堆積,業務受影響。

引起Metadata lock wait出現的場景:Waiting for table metadata lock

場景一:當前有執行DML操作時執行ALTRE操作。

# SESSION A
mysql> insert into sbtest2 select * from sbtest1;

# SESSION B
mysql> alter table sbtest2 add test1 int;   //等待SESSION A執行完;

# SESSION C
mysql> show processlist;
+-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+
| Id  | User | Host      | db     | Command | Time | State                           | Info                                      |
+-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+
| 267 | root | localhost | sbtest | Query   |    7 | Sending data                    | insert into sbtest2 select * from sbtest1 |
| 271 | root | localhost | sbtest | Query   |    3 | Waiting for table metadata lock | alter table sbtest2 add test1 int         |
| 272 | root | localhost | NULL   | Query   |    0 | starting                        | show processlist                          |
+-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+
3 rows in set (0.00 sec)

# SESSION D
mysql> select * from sbtest2 limit 10;    //等待元資料鎖;

# SESSION E
mysql> show processlist;
+-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+
| Id  | User | Host      | db     | Command | Time | State                           | Info                                      |
+-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+
| 267 | root | localhost | sbtest | Query   |   20 | Sending data                    | insert into sbtest2 select * from sbtest1 |
| 271 | root | localhost | sbtest | Query   |   13 | Waiting for table metadata lock | alter table sbtest2 add test1 int         |
| 272 | root | localhost | NULL   | Query   |    0 | starting                        | show processlist                          |
| 308 | root | localhost | sbtest | Query   |    3 | Waiting for table metadata lock | select * from sbtest2 limit 10            |
+-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+
4 rows in set (0.00 sec)

從上述例子可以看出,我們在執行DDL語句的時候得事先看一下,程序中是否已經存在某些DML語句佔用了表的元資料鎖,這樣會導致DDL語句處於鎖等待狀態。一旦出現Waiting for table metadata lock等待現象,後續所有對該表的訪問都會阻塞在該等待上,包括讀操作,導致連線堆積,業務受影響。

場景二:當前有對錶的長時間查詢或使用mysqldump/mysqlpump時,使用alter會被堵住。

# SESSION A
mysql> select *,sleep(10) from sbtest2;

# SESSION B
mysql> alter table sbtest2 add test2 int;   //等待SESSION A執行完;

# SESSION C
mysql> show processlist;
+-----+------+-----------+--------+---------+------+---------------------------------+---------------------------------------+
| Id  | User | Host      | db     | Command | Time | State                           | Info                                  |
+-----+------+-----------+--------+---------+------+---------------------------------+---------------------------------------+
| 267 | root | localhost | sbtest | Query   |   12 | User sleep                      | select *,sleep(10) from sbtest2       |
| 271 | root | localhost | sbtest | Query   |    8 | Waiting for table metadata lock | alter table sbtest2 add test3 int     |
| 272 | root | localhost | NULL   | Query   |    0 | starting                        | show processlist                      |
| 311 | root | localhost | NULL   | Query   |    3 | Waiting for table metadata lock | select * from sbtest.sbtest2 limit 10 |
+-----+------+-----------+--------+---------+------+---------------------------------+---------------------------------------+
4 rows in set (0.00 sec)

場景三:顯示或者隱式開啟事務後未提交或回滾,比如查詢完成後未提交或者回滾,使用alter會被堵住。

# SESSION A
mysql> begin;
mysql> select * from sbtest2;

# SESSION B
mysql> alter table sbtest2 add test2 int;   //等待SESSION A執行完;

# SESSION C
mysql> show processlist;
+-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+
| Id  | User | Host      | db     | Command | Time | State                           | Info                              |
+-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+
| 267 | root | localhost | sbtest | Sleep   |   36 |                                 | NULL                              |
| 271 | root | localhost | sbtest | Query   |   30 | Waiting for table metadata lock | alter table sbtest2 add test2 int |
| 272 | root | localhost | NULL   | Query   |    0 | starting                        | show processlist                  |
+-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+
3 rows in set (0.00 sec)

場景四:表上有失敗的查詢事務,比如查詢不存在的列,語句失敗返回,但是事務沒有提交,此時alter仍然會被堵住。

# SESSION A
mysql> begin;
mysql> select error from sbtest2; 
ERROR 1054 (42S22): Unknown column 'error' in 'field list'

# SESSION B
mysql> alter table sbtest2 add test3 int;    //等待SESSION A提交或回滾;

# SESSION C
mysql> show processlist;
+-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+
| Id  | User | Host      | db     | Command | Time | State                           | Info                              |
+-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+
| 267 | root | localhost | sbtest | Sleep   |    7 |                                 | NULL                              |
| 271 | root | localhost | sbtest | Query   |    3 | Waiting for table metadata lock | alter table sbtest2 add test3 int |
| 272 | root | localhost | NULL   | Query   |    0 | starting                        | show processlist                  |
| 311 | root | localhost | NULL   | Sleep   |  413 |                                 | NULL                              |
+-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+
4 rows in set (0.00 sec)

# SESSION D
mysql> select * from information_schema.innodb_trx;
Empty set (0.00 sec)

其實SESSION A中的事務並未開啟,但是由於select獲取表元資料的語句,語法上是有效的,雖然執行失敗了,但是任然不會釋放元資料鎖,故而導致SESSION B的alter動作被阻塞。

通過SESSION D檢視當前開啟事務時,你會發現沒有,從而找不到原因。所以當出現這種場景時,如何判斷是哪個程序導致的呢,我們可以嘗試查看錶performance_schema. events_statements_current,分析程序狀態來進行判斷。

mysql> select * from performance_schema. events_statements_current\G
*************************** 1. row ***************************
              THREAD_ID: 293
               EVENT_ID: 32
           END_EVENT_ID: 32
             EVENT_NAME: statement/sql/select
                 SOURCE: socket_connection.cc:101
            TIMER_START: 212721717099954000
              TIMER_END: 212721717213807000
             TIMER_WAIT: 113853000
              LOCK_TIME: 0
               SQL_TEXT: select error from sbtest2
                 DIGEST: 0bbb2d5d1be45e77debea68111264885
            DIGEST_TEXT: SELECT ERROR FROM `sbtest2` 
         CURRENT_SCHEMA: sbtest
            OBJECT_TYPE: NULL
          OBJECT_SCHEMA: NULL
            OBJECT_NAME: NULL
  OBJECT_INSTANCE_BEGIN: NULL
            MYSQL_ERRNO: 1054
      RETURNED_SQLSTATE: 42S22
           MESSAGE_TEXT: Unknown column 'error' in 'field list'
                 ERRORS: 1

然後找到其sid, kill掉該session,也可以kill掉DDL所在的session解決可以解決此問題。

另外,測試時SESSION A要顯式開啟一個事務,否則查詢會隱式回滾結束,無法重現上面的場景。SESSION B執行alter後,沒有立即阻塞住,而是立馬開始copy to tmp table,這個過程結束後,才進行了MDL鎖等待。這怎麼解釋呢,應該是執行alter操作主要分為建立臨時新表->插入老表的資料->臨時新表rename to老表三個步驟,在這種情況下,到最後一步才需要MDL鎖,所以copy過程中不會阻塞。由於沒有查詢在進行,而且查詢也沒有進入innodb層 (失敗返回),所以show processlist和information_schema.innodb_trx沒有可以參考的資訊。

出現以上幾種情況時,這個時候如果進行如下操作就會引起MDL:

  • 建立、刪除索引。
  • 修改表結構。
  • 表維護操作(optimize table、repair table等)。
  • 刪除表。
  • 獲取表上表級寫鎖 (lock table tab_name write)。

使用Profile分析場景三:顯示或者隱式開啟事務後未提交或回滾,比如查詢完成後未提交或者回滾,使用alter會被堵住

# SESSION A
mysql> set profiling=on;
mysql> begin;
mysql> select * from sbtest.sbtest2 limit 1;

# SESSION B
mysql> set profiling=on;
mysql> alter table sbtest.sbtest5 add test2 int;   //等待SESSION A執行完;

# SESSION C
mysql> set profiling=on;
mysql> select * from sbtest.sbtest2 limit 1;

# SESSION D
mysql> set profiling=on;
mysql> show processlist;
+-----+------+-----------+------+---------+------+---------------------------------+------------------------------------------+
| Id  | User | Host      | db   | Command | Time | State                           | Info                                     |
+-----+------+-----------+------+---------+------+---------------------------------+------------------------------------------+
| 325 | root | localhost | NULL | Query   |   25 | Waiting for table metadata lock | alter table sbtest.sbtest2 add test5 int |
| 326 | root | localhost | NULL | Query   |    3 | Waiting for table metadata lock | select * from sbtest.sbtest2 limit 1     |
| 327 | root | localhost | NULL | Query   |    0 | starting                        | show processlist                         |
| 328 | root | localhost | NULL | Sleep   |   50 |                                 | NULL                                     |
+-----+------+-----------+------+---------+------+---------------------------------+------------------------------------------+
4 rows in set (0.00 sec)

然後回滾SESSION A,等待SESSION B和SESSION C執行完,檢視profile。

檢視SESSION A

# SESSION A
mysql> show profiles;
+----------+-------------+------------------------------------------+
| Query_ID | Duration    | Query                                    |
+----------+-------------+------------------------------------------+
|        1 | 42.81646375 | alter table sbtest.sbtest2 add test5 int |
+----------+-------------+------------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show profile for query 1;
+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| starting                       | 0.000069 |
| checking permissions           | 0.000004 |
| checking permissions           | 0.000004 |
| init                           | 0.000005 |
| Opening tables                 | 0.000128 |
| setup                          | 0.000032 |
| creating table                 | 0.000654 |
| After create                   | 0.000053 |
| Waiting for table metadata loc | 1.000069 |
.....................
| After create                   | 0.000013 |
| Waiting for table metadata loc | 0.871435 |
| After create                   | 0.000042 |
| System lock                    | 0.000013 |
| preparing for alter table      | 0.002475 |
| altering table                 | 9.752928 |
| committing alter table to stor | 0.185624 |
| end                            | 0.000021 |
| query end                      | 0.000010 |
| closing tables                 | 0.000007 |
| freeing items                  | 0.000020 |
| cleaning up                    | 0.000011 |
+--------------------------------+----------+
83 rows in set, 1 warning (0.00 sec)

檢視SESSION C

mysql> show profiles;
+----------+-------------+--------------------------------------+
| Query_ID | Duration    | Query                                |
+----------+-------------+--------------------------------------+
|        1 | 10.75216050 | select * from sbtest.sbtest2 limit 1 |
+----------+-------------+--------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show profile for query 1;
+--------------------------------+-----------+
| Status                         | Duration  |
+--------------------------------+-----------+
| starting                       |  0.000080 |
| checking permissions           |  0.000007 |
| Opening tables                 |  0.000012 |
| Waiting for table metadata loc | 10.751829 |
| Opening tables                 |  0.000094 |
| init                           |  0.000019 |
| System lock                    |  0.000010 |
| optimizing                     |  0.000004 |
| statistics                     |  0.000011 |
| preparing                      |  0.000009 |
| executing                      |  0.000003 |
| Sending data                   |  0.000040 |
| end                            |  0.000006 |
| query end                      |  0.000008 |
| closing tables                 |  0.000008 |
| freeing items                  |  0.000014 |
| cleaning up                    |  0.000009 |
+--------------------------------+-----------+
17 rows in set, 1 warning (0.00 sec)

從上述測試可以看出,SESSION C需要開啟表時碰到了元資料鎖。MySQL不論SESSION A執行的是select還是delete,此時alter table語句無法獲取到metadata獨佔鎖,會進行等待;所以會影響SESSION C的讀取。

這是最基本的一種情形,這個和MySQL 5.6中的online ddl並不衝突。一般alter table的操作過程中,在after create步驟會獲取metadata獨佔鎖,當進行到altering table的過程時(通常是最花時間的步驟),對該表的讀寫都可以正常進行,這就是online ddl的表現,並不會像之前在整個alter table過程中阻塞寫入,當然並不是所有ALTER語句都支援online ddl。

總之,alter table的語句是很危險的(其實他的危險其實是未提交事物或者長事務導致的),在操作之前最好確認對要操作的表沒有任何進行中的操作、沒有未提交事務、也沒有顯式事務中的報錯語句。如果有alter table的維護任務,在無人監管的時候執行,最好通過lock_wait_timeout設定好超時時間,避免長時間的metedata鎖等待。

原文來自微信公眾號:運維那點事