[MDEV-8196] Stopping slave IO thread between a row event and COMMIT causes data inconsistency between master and slave Created: 2015-05-21 Updated: 2022-09-08 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 5.5, 10.0, 10.1 |
| Fix Version/s: | 10.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Kristian Nielsen |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Description |
|
Note: the test below is extremely ugly, it was made up just to confirm the theory. The test is all about races, so it's quite likely it won't work on some machines. The test generates a binary log with a bunch of tiny row events, BEGINs and COMMITs. It looks like this:
Also, the test stops IO thread at a random moment; waits till the SQL thread executes whatever was relayed to it; stops it also; and restarts both threads. The IO thread can stop at different points in this binary log. When it happens before BEGIN, or after BEGIN, or after COMMIT, it's all fine. The problem starts when the IO thread stops after the row event, but before COMMIT, e.g. at position 74558 in the snippet above. Then the SQL thread executes the event, but does not properly finish it because there is no COMMIT. Then everything stops, and when the threads are restarted, the event gets executed again, naturally causing data inconsistency and/or replication abort. In non-parallel replication this situation is accompanied by a long list of complaints in the slave error log. It throws multiple warnings
.. then the fatal error
... then yet another error
... and only after replication restart, it ends up with the mentioned replication abort and/or data inconsistency. So, even though nothing good comes out of this, the server's reputation is at least protected by the disclaimers in the error log. With the parallel replication, it's different. No warnings about incomplete groups or errors about relay log are raised; the slave jumps directly to the replication restart and further abort and/or data inconsistency. So, I see two problems here:
|
| Comments |
| Comment by Phil Sweeney [ 2016-03-01 ] |
|
Just experienced this in production (running MariaDB 10.1.12) on a controlled shutdown: 2016-03-01 4:00:03 139962264959744 [Note] Slave I/O thread exiting, read up to log 'log-bin.002820', position 6460292 When the service finishing starting up again (as part of automated process), it immediately got this SQL error: 2016-03-01 4:00:11 140569318804224 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table <table name>; Can't find record in '<table name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log-bin.002745, end_log_pos 47635199, Gtid 0-10-514024699, Internal MariaDB error code: 1032 Skipped 1 statement and it ran OK after that. |