Another reason why SQL_SLAVE_SKIP_COUNTER is bad in MySQL
It is everywhere in the world of MySQL that if your replication is broken because an event caused a duplicate key or a row was not found and it cannot be updated or deleted, then you can use ‘STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; ‘ and be done with it. In some cases this is fine and you can repair the offending row or statements later on. But what if the statement is part of a multi-statement transaction? Well, then it becomes more interesting, because skipping the offending statement will cause the whole transaction to be skipped. This is well
3 rows on the master:
Shell1 2 3 4 5 6 7 8 9 |
master> select * from t;
+----+-----+
| id | pid |
+----+-----+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 | |
2 on the slave:
Shell1 2 3 4 5 6 7 8 | slave> select * from t; +----+-----+ | id | pid | +----+-----+ | 1 | 1 | | 3 | 3 | +----+-----+ 2 rows in set (0.00 sec) |
Execute a transaction on the master to break replication:
Shell1 2 3 4 5 6 7 8 9 10 11 12 13 14 | master> BEGIN; Query OK, 0 rows affected (0.00 sec) master> DELETE FROM t WHERE id = 1; Query OK, 1 row affected (0.00 sec) master> DELETE FROM t WHERE id = 2; Query OK, 1 row affected (0.00 sec) master> DELETE FROM t WHERE id = 3; Query OK, 1 row affected (0.00 sec) master> COMMIT; Query OK, 0 rows affected (0.01 sec) |
Broken slave:
Shell1 2 3 4 5 6 7 | slave> show slave status G *************************** 1. row *************************** ... Last_SQL_Errno: 1032 Last_SQL_Error: Could not execute Delete_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 333 ... 1 row in set (0.00 sec) |
An attempt to fix replication only caused bigger inconsistencies on slave:
Shell1 2 3 4 5 6 7 8 9 10 11 | slave> STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; Query OK, 0 rows affected (0.00 sec) slave> select * from t; +----+-----+ | id | pid | +----+-----+ | 1 | 1 | | 3 | 3 | +----+-----+ 2 rows in set (0.00 sec) |
This happens because the replication honors transaction boundaries, and is definitely something you should consider the next time you try to use this workaround on a broken slave. Of course, there is pt-table-checksum and pt-table-sync to rescue you when inconsistencies occur, however, prevention is always better than cure. Make sure to put safeguards in place to prevent your slaves from drifting.
Lastly, the example above is for ROW-based replication as my colleague pointed out, but can similarly happen with STATEMENT for example with a duplicate key error. You can optionally fix the error above by temporarily setting slave_exec_mode to IDEMPOTENT so errors because of missing rows are skipped, but then again, it does not apply in all cases like an UPDATE statement that cannot be applied because the row on the slave is missing.
Here is a demonstration of the problem with STATEMENT-based replication:
Shell1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | master> select * from t; +----+-----+ | id | pid | +----+-----+ | 4 | 1 | | 6 | 3 | +----+-----+ 2 rows in set (0.00 sec) slave> select * from t; +----+-----+ | id | pid | +----+-----+ | 4 | 1 | | 5 | 2 | | 6 | 3 | +----+-----+ 3 rows in set (0.00 sec) master> BEGIN; Query OK, 0 rows affected (0.00 sec) master> delete from t where id = 4; Query OK, 1 row affected (0.00 sec) master> insert into t values (5,2); Query OK, 1 row affected (0.00 sec) master> delete from t where id = 6; Query OK, 1 row affected (0.00 sec) master> COMMIT; Query OK, 0 rows affected (0.15 sec) slave> show slave status G *************************** 1. row *************************** ... Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '5' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into t values (5,2)' ... 1 row in set (0.00 sec) slave> stop slave; set global sql_slave_skip_counter = 1; start slave; Query OK, 0 rows affected (0.05 sec) slave> select * from t; +----+-----+ | id | pid | +----+-----+ | 4 | 1 | | 5 | 2 | |