[MDEV-6462] Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.0.12
Fix Version/s: 10.0.14
Component/s: None
Labels:
- gtid

Description

When master crashes in the middle of writing a multi-statement transaction to binlog so that slave has already received some events from this transaction, then IO thread reconnects to the restarted master and thinks that it will re-download the same binlog events. But master will actually either not send anything (if no new transactions has executed), or will send completely different events from new transactions which will result in completely different data on the slave compared to data on the master.

I'd bet the root cause of the problem is in how IO thread reconnects when GTID-based replication is turned on, and in these few lines of code starting at sql/slave.cc:5310:

/*

      Do not queue any format description event that we receive after a

      reconnect where we are skipping over a partial event group received

      before the reconnect.

      (If we queued such an event, and it was the first format_description

      event after master restart, the slave SQL thread would think that

      the partial event group before it in the relay log was from a

      previous master crash and should be rolled back).

*/

    if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))

        gtid_skip_enqueue= true;

In the scenario I described above SQL thread actually must roll back the active transaction.

In the attachment is the patch that allows to emulate this scenario. Apply it, run rpl_gtid_crash test and look at the results of last two SELECTs – they will be different on master and slave.

I will look into a way to fix this problem myself, but will appreciate any help. I'll attach a patch if I manage to find a fix before anyone on MariaDB side.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

patch.txt
2 kB
2014-07-19 02:42
fix_reconnect_crashed_master.txt
8 kB
2014-07-24 01:27

Activity

Ascending order - Click to sort in descending order

Pavel Ivanov created issue - 2014-07-19 02:42

Elena Stepanova made changes - 2014-07-19 12:26

Field	Original Value	New Value
Fix Version/s		10.0 [ 16000 ]
Assignee		Kristian Nielsen [ knielsen ]
Labels		gtid

Pavel Ivanov added a comment - 2014-07-21 09:58

Attaching a proposed fix together with the revised and expanded test case. The fix seem to work well, but please check if it's appropriate enough for non-standard use cases, e.g. for parallel slave.

Pavel Ivanov added a comment - 2014-07-21 09:58 Attaching a proposed fix together with the revised and expanded test case. The fix seem to work well, but please check if it's appropriate enough for non-standard use cases, e.g. for parallel slave.

Pavel Ivanov made changes - 2014-07-21 09:59

Attachment

fix_reconnect_crashed_master.txt [ 32400 ]

Pavel Ivanov made changes - 2014-07-24 01:25

Attachment

fix_reconnect_crashed_master.txt [ 32400 ]

Pavel Ivanov added a comment - 2014-07-24 01:27

Attaching the fix that is modified a little bit after internal review.

Pavel Ivanov added a comment - 2014-07-24 01:27 Attaching the fix that is modified a little bit after internal review.

Pavel Ivanov made changes - 2014-07-24 01:27

Attachment

fix_reconnect_crashed_master.txt [ 32600 ]

Kristian Nielsen made changes - 2014-08-15 13:41

Status

Open [ 1 ]

In Progress [ 3 ]

Kristian Nielsen made changes - 2014-08-15 13:46

Fix Version/s		10.0.14 [ 17101 ]
Fix Version/s	10.0 [ 16000 ]

Kristian Nielsen added a comment - 2014-08-19 13:19

Ok, I'm back from vacation and got to this issue.

There are actually more bugs related to when a master crashes in the middle of
an event. I discovered one: that in this case, the (non-gtid) binlog position
on the slave is not updated to point to the next master binlog file - because
there is no rotate event at the end of the prior binlog file due to crash, and
the following fake rotate event is skipped because a partial transaction is
active.

I suspect that this is poorly tested, and there may be other bugs lurking
related to partially written events at master crash.

I looked at MySQL 5.6, they took a different approach. During binlog recovery
after a master crash, they truncate the crashed binlog file to just before the
partially written event group at the end. This way, the slave never sees a
partial event group.

What do you think of this approach? It might make things more robust that
slave will never see a partial event group. I think LOCK_log on the master
will ensure that a slave will never be able to see a partial event group if we
truncate it away during binlog crash recovery.

(Even with this approach, I think something like your suggested patch could
still be useful. I like the idea of sanity checking things when we reconnect
in the middle of a GTID event group, this is a fragile part of the system. We
might also add a check that the server_id has not changed, in case someone
tried to switch master by routing the IP address to a new server without using
CHANGE MASTER on the slave).

Kristian Nielsen added a comment - 2014-08-19 13:19 Ok, I'm back from vacation and got to this issue. There are actually more bugs related to when a master crashes in the middle of an event. I discovered one: that in this case, the (non-gtid) binlog position on the slave is not updated to point to the next master binlog file - because there is no rotate event at the end of the prior binlog file due to crash, and the following fake rotate event is skipped because a partial transaction is active. I suspect that this is poorly tested, and there may be other bugs lurking related to partially written events at master crash. I looked at MySQL 5.6, they took a different approach. During binlog recovery after a master crash, they truncate the crashed binlog file to just before the partially written event group at the end. This way, the slave never sees a partial event group. What do you think of this approach? It might make things more robust that slave will never see a partial event group. I think LOCK_log on the master will ensure that a slave will never be able to see a partial event group if we truncate it away during binlog crash recovery. (Even with this approach, I think something like your suggested patch could still be useful. I like the idea of sanity checking things when we reconnect in the middle of a GTID event group, this is a fragile part of the system. We might also add a check that the server_id has not changed, in case someone tried to switch master by routing the IP address to a new server without using CHANGE MASTER on the slave).

Kristian Nielsen added a comment - 2014-08-19 14:03

BTW, I think it's not actually possible for the slave to see any of the
partial event group prior to the master crashing. Because LOCK_log is held
while writing the complete group, and also taken before reading any events to
send to a slave.

However, the bug would still be possible to trigger in real life. Because a
slave could be reading events from back in time after the master is restarted
following the crash. And then a network error or something could occur that
would cause an I/O thread reconnect at the exact point of the partially
written event group. So the test case should be valid.

Kristian Nielsen added a comment - 2014-08-19 14:03 BTW, I think it's not actually possible for the slave to see any of the partial event group prior to the master crashing. Because LOCK_log is held while writing the complete group, and also taken before reading any events to send to a slave. However, the bug would still be possible to trigger in real life. Because a slave could be reading events from back in time after the master is restarted following the crash. And then a network error or something could occur that would cause an I/O thread reconnect at the exact point of the partially written event group. So the test case should be valid.

Kristian Nielsen added a comment - 2014-08-19 14:53

The other bug I mentioned is ~~MDEV-6608~~

Kristian Nielsen added a comment - 2014-08-19 14:53 The other bug I mentioned is MDEV-6608

Pavel Ivanov added a comment - 2014-08-19 16:59

When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something?

So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway.

Pavel Ivanov added a comment - 2014-08-19 16:59 When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something? So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway.

Pavel Ivanov made changes - 2014-08-19 17:00

Comment

[ When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something?

So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway. ]

Kristian Nielsen added a comment - 2014-08-19 17:28 - edited

Hm, interesting...

Well, I see this code in mysql_binlog_send():

  log_lock= mysql_bin_log.get_log_lock();

  error = Log_event::read_log_event(&log, packet, log_lock,

                                    info.current_checksum_alg,

                                    log_file_name,

                                    &is_active_binlog)

And this in Log_event::read_log_event():

  if (log_lock)

    mysql_mutex_lock(log_lock);

So it looks like LOCK_log is held while reading each event to be sent to the
slave?

However, if my understanding is correct, that's actually something that I
consider a performance bug. It's pretty bad that sending events to each slave
contends a mutex with committing to the binlog, so something that we should
try to remove, not rely on.

And in any case, I agree that both approaches are good simultaneously -
truncating on the master, as well as dealing sensibly with partial binlog data
on the slave...

Kristian Nielsen added a comment - 2014-08-19 17:28 - edited Hm, interesting... Well, I see this code in mysql_binlog_send(): log_lock= mysql_bin_log.get_log_lock(); error = Log_event::read_log_event(&log, packet, log_lock, info.current_checksum_alg, log_file_name, &is_active_binlog) And this in Log_event::read_log_event(): if (log_lock) mysql_mutex_lock(log_lock); So it looks like LOCK_log is held while reading each event to be sent to the slave? However, if my understanding is correct, that's actually something that I consider a performance bug. It's pretty bad that sending events to each slave contends a mutex with committing to the binlog, so something that we should try to remove, not rely on. And in any case, I agree that both approaches are good simultaneously - truncating on the master, as well as dealing sensibly with partial binlog data on the slave...

Pavel Ivanov added a comment - 2014-08-19 19:47

Ah, I didn't notice such indirect locking. Looks weird though as to why it's necessary at all...
Then indeed sounds like truncating the binlog file should fix this problem.

Pavel Ivanov added a comment - 2014-08-19 19:47 Ah, I didn't notice such indirect locking. Looks weird though as to why it's necessary at all... Then indeed sounds like truncating the binlog file should fix this problem.

Kristian Nielsen added a comment - 2014-08-26 13:48

Hm, I think I found an issue with the patch.

If I understand the patch correctly, it works like this: At I/O thread reconnect, it looks at the file and position of the initial fake rotate event sent from the master. If these are different from the position of the GTID of a partial event group prior to the reconnect, a warning is output and a fake format description event is created to make the SQL thread roll back the partial event group.

But the position of that fake rotate event will always be 4, so it will never be equal to the prior GTID position - if my reading of the code is correct. So it seems the effect of the patch is to always roll back and re-queue a partial event group in case of reconnect?

In fact, the test case rpl.rpl_gtid_reconnect fails because of this:

rpl.rpl_gtid_reconnect 'mix,xtradb' [ fail ] Found warnings/errors in server log file!
Test ended at 2014-08-26 12:35:50
line
140826 12:35:49 [Warning] Unexpected switch of master, GTID has changed binlog position. Expected: master-bin.000001:859, received: master-bin.000001:4. Assuming that the master has crashed.

In fact, for InnoDB, always rolling back and re-queueing seems like a reasonable approach. It should also work for most DDL (not sure about CREATE ... SELECT in row-based mode). But it can cause replication failure for MyISAM, I think.

(Not the first time things would have been easier if a transactional storage engine could be assumed...)

Kristian Nielsen added a comment - 2014-08-26 13:48 Hm, I think I found an issue with the patch. If I understand the patch correctly, it works like this: At I/O thread reconnect, it looks at the file and position of the initial fake rotate event sent from the master. If these are different from the position of the GTID of a partial event group prior to the reconnect, a warning is output and a fake format description event is created to make the SQL thread roll back the partial event group. But the position of that fake rotate event will always be 4, so it will never be equal to the prior GTID position - if my reading of the code is correct. So it seems the effect of the patch is to always roll back and re-queue a partial event group in case of reconnect? In fact, the test case rpl.rpl_gtid_reconnect fails because of this: rpl.rpl_gtid_reconnect 'mix,xtradb' [ fail ] Found warnings/errors in server log file! Test ended at 2014-08-26 12:35:50 line 140826 12:35:49 [Warning] Unexpected switch of master, GTID has changed binlog position. Expected: master-bin.000001:859, received: master-bin.000001:4. Assuming that the master has crashed. In fact, for InnoDB, always rolling back and re-queueing seems like a reasonable approach. It should also work for most DDL (not sure about CREATE ... SELECT in row-based mode). But it can cause replication failure for MyISAM, I think. (Not the first time things would have been easier if a transactional storage engine could be assumed...)

Kristian Nielsen added a comment - 2014-08-26 14:08

An easy fix might be to compare only the file name, not the position?

In case of crash, the file name has changed. In case of normal reconnect (no master crash), the file name will be the same.

Kristian Nielsen added a comment - 2014-08-26 14:08 An easy fix might be to compare only the file name, not the position? In case of crash, the file name has changed. In case of normal reconnect (no master crash), the file name will be the same.

Pavel Ivanov added a comment - 2014-08-26 20:22

Hm... When I wrote that I indeed was worried that first rotate event will have position 4 and won't point to the GTID event. But I think I was convinced somehow that it's not true and it points to GTID event. I don't remember now what made me think so. If you are sure that it always has position 4 then I guess we don't have other choice but only check the file name. But this check won't be as reliable. E.g. when master crashes, then its IP:port is routed to a new master and that server has different transaction at the same file, but different position.

Can we do something better than just file name check?

Pavel Ivanov added a comment - 2014-08-26 20:22 Hm... When I wrote that I indeed was worried that first rotate event will have position 4 and won't point to the GTID event. But I think I was convinced somehow that it's not true and it points to GTID event. I don't remember now what made me think so. If you are sure that it always has position 4 then I guess we don't have other choice but only check the file name. But this check won't be as reliable. E.g. when master crashes, then its IP:port is routed to a new master and that server has different transaction at the same file, but different position. Can we do something better than just file name check?

Kristian Nielsen added a comment - 2014-08-27 13:08

We can detect silent change of master server by checking if the server_id of
the master changes in the reconnect.

Maybe something like this:

1. If during a reconnect we see that the server_id of the master changes, we
always abort (roll back) any partially received event group. I don't think
it's safe to try to fetch the other half from a different server, as the event
could be logged differently on the two.

2. On a reconnect where the server_id of the master is unchanged, it seems
reasonably safe to assume that the current master binlog file is unchanged. In
this case, it should be enough to compare the file name - if the file name is
the same, we can fetch the remainder of any partially received event group. If
it changed, we abort any partial group, as in your patch.

I think this should be safe for all transactional events. I suppose (1) could
cause some MyISAM issues in some cases, but MyISAM replication in general
cannot be made 100% safe, and it's more important to be 100% crash safe for
InnoDB stuff...

Kristian Nielsen added a comment - 2014-08-27 13:08 We can detect silent change of master server by checking if the server_id of the master changes in the reconnect. Maybe something like this: 1. If during a reconnect we see that the server_id of the master changes, we always abort (roll back) any partially received event group. I don't think it's safe to try to fetch the other half from a different server, as the event could be logged differently on the two. 2. On a reconnect where the server_id of the master is unchanged, it seems reasonably safe to assume that the current master binlog file is unchanged. In this case, it should be enough to compare the file name - if the file name is the same, we can fetch the remainder of any partially received event group. If it changed, we abort any partial group, as in your patch. I think this should be safe for all transactional events. I suppose (1) could cause some MyISAM issues in some cases, but MyISAM replication in general cannot be made 100% safe, and it's more important to be 100% crash safe for InnoDB stuff...

Pavel Ivanov added a comment - 2014-08-27 16:42

Yes, the approach of checking server_id and then file name sounds reasonable to me.

Pavel Ivanov added a comment - 2014-08-27 16:42 Yes, the approach of checking server_id and then file name sounds reasonable to me.

Kristian Nielsen added a comment - 2014-09-02 15:08

I committed a patch for this that implements what is discussed above:

http://lists.askmonty.org/pipermail/commits/2014-September/006478.html

Monty, do you think you could review the patch?

Kristian Nielsen added a comment - 2014-09-02 15:08 I committed a patch for this that implements what is discussed above: http://lists.askmonty.org/pipermail/commits/2014-September/006478.html Monty, do you think you could review the patch?

Kristian Nielsen made changes - 2014-09-02 15:08

Assignee	Kristian Nielsen [ knielsen ]	Michael Widenius [ monty ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Kristian Nielsen made changes - 2014-09-02 15:09

Summary

Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

NEED REVIEW: Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

Michael Widenius added a comment - 2014-09-02 20:41

Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41 Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41

Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41 Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41

ok to push

Michael Widenius added a comment - 2014-09-02 20:41 ok to push

Michael Widenius made changes - 2014-09-02 20:41

Status

In Review [ 10002 ]

Stalled [ 10000 ]

Kristian Nielsen made changes - 2014-09-03 10:10

Assignee

Michael Widenius [ monty ]

Kristian Nielsen [ knielsen ]

Kristian Nielsen added a comment - 2014-09-03 10:12

Pushed to 10.0.14.

Kristian Nielsen added a comment - 2014-09-03 10:12 Pushed to 10.0.14.

Kristian Nielsen made changes - 2014-09-03 10:12

Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Kristian Nielsen made changes - 2014-09-03 10:13

Summary

NEED REVIEW: Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

Rasmus Johansson (Inactive) made changes - 2015-05-18 17:51

Workflow

MariaDB v2 [ 50420 ]

MariaDB v3 [ 64183 ]

Sergei Golubchik made changes - 2021-12-06 21:40

Workflow

MariaDB v3 [ 64183 ]

MariaDB v4 [ 148045 ]

People

Assignee:: Kristian Nielsen

Reporter:: Pavel Ivanov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2014-07-19 02:42

Updated:: 2014-09-03 10:13

Resolved:: 2014-09-03 10:12

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration