Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.0.12
-
None
Description
When master crashes in the middle of writing a multi-statement transaction to binlog so that slave has already received some events from this transaction, then IO thread reconnects to the restarted master and thinks that it will re-download the same binlog events. But master will actually either not send anything (if no new transactions has executed), or will send completely different events from new transactions which will result in completely different data on the slave compared to data on the master.
I'd bet the root cause of the problem is in how IO thread reconnects when GTID-based replication is turned on, and in these few lines of code starting at sql/slave.cc:5310:
/*
|
Do not queue any format description event that we receive after a
|
reconnect where we are skipping over a partial event group received
|
before the reconnect.
|
|
(If we queued such an event, and it was the first format_description
|
event after master restart, the slave SQL thread would think that
|
the partial event group before it in the relay log was from a
|
previous master crash and should be rolled back).
|
*/
|
if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))
|
gtid_skip_enqueue= true;
|
In the scenario I described above SQL thread actually must roll back the active transaction.
In the attachment is the patch that allows to emulate this scenario. Apply it, run rpl_gtid_crash test and look at the results of last two SELECTs – they will be different on master and slave.
I will look into a way to fix this problem myself, but will appreciate any help. I'll attach a patch if I manage to find a fix before anyone on MariaDB side.