[MDEV-6462] Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction Created: 2014-07-19  Updated: 2014-09-03  Resolved: 2014-09-03

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.12
Fix Version/s: 10.0.14

Type: Bug Priority: Critical
Reporter: Pavel Ivanov Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: gtid

Attachments: Text File fix_reconnect_crashed_master.txt     Text File patch.txt    

 Description   

When master crashes in the middle of writing a multi-statement transaction to binlog so that slave has already received some events from this transaction, then IO thread reconnects to the restarted master and thinks that it will re-download the same binlog events. But master will actually either not send anything (if no new transactions has executed), or will send completely different events from new transactions which will result in completely different data on the slave compared to data on the master.

I'd bet the root cause of the problem is in how IO thread reconnects when GTID-based replication is turned on, and in these few lines of code starting at sql/slave.cc:5310:

    /*
      Do not queue any format description event that we receive after a
      reconnect where we are skipping over a partial event group received
      before the reconnect.
 
      (If we queued such an event, and it was the first format_description
      event after master restart, the slave SQL thread would think that
      the partial event group before it in the relay log was from a
      previous master crash and should be rolled back).
    */
    if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))
        gtid_skip_enqueue= true;

In the scenario I described above SQL thread actually must roll back the active transaction.

In the attachment is the patch that allows to emulate this scenario. Apply it, run rpl_gtid_crash test and look at the results of last two SELECTs – they will be different on master and slave.

I will look into a way to fix this problem myself, but will appreciate any help. I'll attach a patch if I manage to find a fix before anyone on MariaDB side.



 Comments   
Comment by Pavel Ivanov [ 2014-07-21 ]

Attaching a proposed fix together with the revised and expanded test case. The fix seem to work well, but please check if it's appropriate enough for non-standard use cases, e.g. for parallel slave.

Comment by Pavel Ivanov [ 2014-07-24 ]

Attaching the fix that is modified a little bit after internal review.

Comment by Kristian Nielsen [ 2014-08-19 ]

Ok, I'm back from vacation and got to this issue.

There are actually more bugs related to when a master crashes in the middle of
an event. I discovered one: that in this case, the (non-gtid) binlog position
on the slave is not updated to point to the next master binlog file - because
there is no rotate event at the end of the prior binlog file due to crash, and
the following fake rotate event is skipped because a partial transaction is
active.

I suspect that this is poorly tested, and there may be other bugs lurking
related to partially written events at master crash.

I looked at MySQL 5.6, they took a different approach. During binlog recovery
after a master crash, they truncate the crashed binlog file to just before the
partially written event group at the end. This way, the slave never sees a
partial event group.

What do you think of this approach? It might make things more robust that
slave will never see a partial event group. I think LOCK_log on the master
will ensure that a slave will never be able to see a partial event group if we
truncate it away during binlog crash recovery.

(Even with this approach, I think something like your suggested patch could
still be useful. I like the idea of sanity checking things when we reconnect
in the middle of a GTID event group, this is a fragile part of the system. We
might also add a check that the server_id has not changed, in case someone
tried to switch master by routing the IP address to a new server without using
CHANGE MASTER on the slave).

Comment by Kristian Nielsen [ 2014-08-19 ]

BTW, I think it's not actually possible for the slave to see any of the
partial event group prior to the master crashing. Because LOCK_log is held
while writing the complete group, and also taken before reading any events to
send to a slave.

However, the bug would still be possible to trigger in real life. Because a
slave could be reading events from back in time after the master is restarted
following the crash. And then a network error or something could occur that
would cause an I/O thread reconnect at the exact point of the partially
written event group. So the test case should be valid.

Comment by Kristian Nielsen [ 2014-08-19 ]

The other bug I mentioned is MDEV-6608

Comment by Pavel Ivanov [ 2014-08-19 ]

When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something?

So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway.

Comment by Kristian Nielsen [ 2014-08-19 ]

Hm, interesting...

Well, I see this code in mysql_binlog_send():

  log_lock= mysql_bin_log.get_log_lock();
  error = Log_event::read_log_event(&log, packet, log_lock,
                                    info.current_checksum_alg,
                                    log_file_name,
                                    &is_active_binlog)

And this in Log_event::read_log_event():

  if (log_lock)
    mysql_mutex_lock(log_lock);

So it looks like LOCK_log is held while reading each event to be sent to the
slave?

However, if my understanding is correct, that's actually something that I
consider a performance bug. It's pretty bad that sending events to each slave
contends a mutex with committing to the binlog, so something that we should
try to remove, not rely on.

And in any case, I agree that both approaches are good simultaneously -
truncating on the master, as well as dealing sensibly with partial binlog data
on the slave...

Comment by Pavel Ivanov [ 2014-08-19 ]

Ah, I didn't notice such indirect locking. Looks weird though as to why it's necessary at all...
Then indeed sounds like truncating the binlog file should fix this problem.

Comment by Kristian Nielsen [ 2014-08-26 ]

Hm, I think I found an issue with the patch.

If I understand the patch correctly, it works like this: At I/O thread reconnect, it looks at the file and position of the initial fake rotate event sent from the master. If these are different from the position of the GTID of a partial event group prior to the reconnect, a warning is output and a fake format description event is created to make the SQL thread roll back the partial event group.

But the position of that fake rotate event will always be 4, so it will never be equal to the prior GTID position - if my reading of the code is correct. So it seems the effect of the patch is to always roll back and re-queue a partial event group in case of reconnect?

In fact, the test case rpl.rpl_gtid_reconnect fails because of this:

rpl.rpl_gtid_reconnect 'mix,xtradb' [ fail ] Found warnings/errors in server log file!
Test ended at 2014-08-26 12:35:50
line
140826 12:35:49 [Warning] Unexpected switch of master, GTID has changed binlog position. Expected: master-bin.000001:859, received: master-bin.000001:4. Assuming that the master has crashed.

In fact, for InnoDB, always rolling back and re-queueing seems like a reasonable approach. It should also work for most DDL (not sure about CREATE ... SELECT in row-based mode). But it can cause replication failure for MyISAM, I think.

(Not the first time things would have been easier if a transactional storage engine could be assumed...)

Comment by Kristian Nielsen [ 2014-08-26 ]

An easy fix might be to compare only the file name, not the position?

In case of crash, the file name has changed. In case of normal reconnect (no master crash), the file name will be the same.

Comment by Pavel Ivanov [ 2014-08-26 ]

Hm... When I wrote that I indeed was worried that first rotate event will have position 4 and won't point to the GTID event. But I think I was convinced somehow that it's not true and it points to GTID event. I don't remember now what made me think so. If you are sure that it always has position 4 then I guess we don't have other choice but only check the file name. But this check won't be as reliable. E.g. when master crashes, then its IP:port is routed to a new master and that server has different transaction at the same file, but different position.

Can we do something better than just file name check?

Comment by Kristian Nielsen [ 2014-08-27 ]

We can detect silent change of master server by checking if the server_id of
the master changes in the reconnect.

Maybe something like this:

1. If during a reconnect we see that the server_id of the master changes, we
always abort (roll back) any partially received event group. I don't think
it's safe to try to fetch the other half from a different server, as the event
could be logged differently on the two.

2. On a reconnect where the server_id of the master is unchanged, it seems
reasonably safe to assume that the current master binlog file is unchanged. In
this case, it should be enough to compare the file name - if the file name is
the same, we can fetch the remainder of any partially received event group. If
it changed, we abort any partial group, as in your patch.

I think this should be safe for all transactional events. I suppose (1) could
cause some MyISAM issues in some cases, but MyISAM replication in general
cannot be made 100% safe, and it's more important to be 100% crash safe for
InnoDB stuff...

Comment by Pavel Ivanov [ 2014-08-27 ]

Yes, the approach of checking server_id and then file name sounds reasonable to me.

Comment by Kristian Nielsen [ 2014-09-02 ]

I committed a patch for this that implements what is discussed above:

http://lists.askmonty.org/pipermail/commits/2014-September/006478.html

Monty, do you think you could review the patch?

Comment by Michael Widenius [ 2014-09-02 ]

Have now review it and it looks ok.

Comment by Michael Widenius [ 2014-09-02 ]

Have now review it and it looks ok.

Comment by Michael Widenius [ 2014-09-02 ]

ok to push

Comment by Kristian Nielsen [ 2014-09-03 ]

Pushed to 10.0.14.

Generated at Thu Feb 08 07:12:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.