Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6462

Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.0.12
    • 10.0.14
    • None

    Description

      When master crashes in the middle of writing a multi-statement transaction to binlog so that slave has already received some events from this transaction, then IO thread reconnects to the restarted master and thinks that it will re-download the same binlog events. But master will actually either not send anything (if no new transactions has executed), or will send completely different events from new transactions which will result in completely different data on the slave compared to data on the master.

      I'd bet the root cause of the problem is in how IO thread reconnects when GTID-based replication is turned on, and in these few lines of code starting at sql/slave.cc:5310:

          /*
            Do not queue any format description event that we receive after a
            reconnect where we are skipping over a partial event group received
            before the reconnect.
       
            (If we queued such an event, and it was the first format_description
            event after master restart, the slave SQL thread would think that
            the partial event group before it in the relay log was from a
            previous master crash and should be rolled back).
          */
          if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))
              gtid_skip_enqueue= true;

      In the scenario I described above SQL thread actually must roll back the active transaction.

      In the attachment is the patch that allows to emulate this scenario. Apply it, run rpl_gtid_crash test and look at the results of last two SELECTs – they will be different on master and slave.

      I will look into a way to fix this problem myself, but will appreciate any help. I'll attach a patch if I manage to find a fix before anyone on MariaDB side.

      Attachments

        Activity

          pivanof Pavel Ivanov created issue -
          elenst Elena Stepanova made changes -
          Field Original Value New Value
          Fix Version/s 10.0 [ 16000 ]
          Assignee Kristian Nielsen [ knielsen ]
          Labels gtid
          pivanof Pavel Ivanov added a comment -

          Attaching a proposed fix together with the revised and expanded test case. The fix seem to work well, but please check if it's appropriate enough for non-standard use cases, e.g. for parallel slave.

          pivanof Pavel Ivanov added a comment - Attaching a proposed fix together with the revised and expanded test case. The fix seem to work well, but please check if it's appropriate enough for non-standard use cases, e.g. for parallel slave.
          pivanof Pavel Ivanov made changes -
          Attachment fix_reconnect_crashed_master.txt [ 32400 ]
          pivanof Pavel Ivanov made changes -
          Attachment fix_reconnect_crashed_master.txt [ 32400 ]
          pivanof Pavel Ivanov added a comment -

          Attaching the fix that is modified a little bit after internal review.

          pivanof Pavel Ivanov added a comment - Attaching the fix that is modified a little bit after internal review.
          pivanof Pavel Ivanov made changes -
          Attachment fix_reconnect_crashed_master.txt [ 32600 ]
          knielsen Kristian Nielsen made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          knielsen Kristian Nielsen made changes -
          Fix Version/s 10.0.14 [ 17101 ]
          Fix Version/s 10.0 [ 16000 ]

          Ok, I'm back from vacation and got to this issue.

          There are actually more bugs related to when a master crashes in the middle of
          an event. I discovered one: that in this case, the (non-gtid) binlog position
          on the slave is not updated to point to the next master binlog file - because
          there is no rotate event at the end of the prior binlog file due to crash, and
          the following fake rotate event is skipped because a partial transaction is
          active.

          I suspect that this is poorly tested, and there may be other bugs lurking
          related to partially written events at master crash.

          I looked at MySQL 5.6, they took a different approach. During binlog recovery
          after a master crash, they truncate the crashed binlog file to just before the
          partially written event group at the end. This way, the slave never sees a
          partial event group.

          What do you think of this approach? It might make things more robust that
          slave will never see a partial event group. I think LOCK_log on the master
          will ensure that a slave will never be able to see a partial event group if we
          truncate it away during binlog crash recovery.

          (Even with this approach, I think something like your suggested patch could
          still be useful. I like the idea of sanity checking things when we reconnect
          in the middle of a GTID event group, this is a fragile part of the system. We
          might also add a check that the server_id has not changed, in case someone
          tried to switch master by routing the IP address to a new server without using
          CHANGE MASTER on the slave).

          knielsen Kristian Nielsen added a comment - Ok, I'm back from vacation and got to this issue. There are actually more bugs related to when a master crashes in the middle of an event. I discovered one: that in this case, the (non-gtid) binlog position on the slave is not updated to point to the next master binlog file - because there is no rotate event at the end of the prior binlog file due to crash, and the following fake rotate event is skipped because a partial transaction is active. I suspect that this is poorly tested, and there may be other bugs lurking related to partially written events at master crash. I looked at MySQL 5.6, they took a different approach. During binlog recovery after a master crash, they truncate the crashed binlog file to just before the partially written event group at the end. This way, the slave never sees a partial event group. What do you think of this approach? It might make things more robust that slave will never see a partial event group. I think LOCK_log on the master will ensure that a slave will never be able to see a partial event group if we truncate it away during binlog crash recovery. (Even with this approach, I think something like your suggested patch could still be useful. I like the idea of sanity checking things when we reconnect in the middle of a GTID event group, this is a fragile part of the system. We might also add a check that the server_id has not changed, in case someone tried to switch master by routing the IP address to a new server without using CHANGE MASTER on the slave).

          BTW, I think it's not actually possible for the slave to see any of the
          partial event group prior to the master crashing. Because LOCK_log is held
          while writing the complete group, and also taken before reading any events to
          send to a slave.

          However, the bug would still be possible to trigger in real life. Because a
          slave could be reading events from back in time after the master is restarted
          following the crash. And then a network error or something could occur that
          would cause an I/O thread reconnect at the exact point of the partially
          written event group. So the test case should be valid.

          knielsen Kristian Nielsen added a comment - BTW, I think it's not actually possible for the slave to see any of the partial event group prior to the master crashing. Because LOCK_log is held while writing the complete group, and also taken before reading any events to send to a slave. However, the bug would still be possible to trigger in real life. Because a slave could be reading events from back in time after the master is restarted following the crash. And then a network error or something could occur that would cause an I/O thread reconnect at the exact point of the partially written event group. So the test case should be valid.

          The other bug I mentioned is MDEV-6608

          knielsen Kristian Nielsen added a comment - The other bug I mentioned is MDEV-6608
          pivanof Pavel Ivanov added a comment -

          When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something?

          So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway.

          pivanof Pavel Ivanov added a comment - When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something? So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway.
          pivanof Pavel Ivanov made changes -
          Comment [ When I looked at the code I had an impression that LOCK_log is held only when binlog dumping threads sleep waiting for new events to arrive. But if the dumping thread is actively reading binlog file without stopping then it can easily jump from the previous event group to the next incomplete event group if that one has already partially written (probably that event group should be big so that it's not fully cached in the IO_CACHE during writing). So it's still possible for the slave to receive partial event group before master has crashed. Did I miss something?

          So I'd claim that binlog truncation won't solve this bug completely, but could be a useful addition anyway. ]
          knielsen Kristian Nielsen added a comment - - edited

          Hm, interesting...

          Well, I see this code in mysql_binlog_send():

            log_lock= mysql_bin_log.get_log_lock();
            error = Log_event::read_log_event(&log, packet, log_lock,
                                              info.current_checksum_alg,
                                              log_file_name,
                                              &is_active_binlog)

          And this in Log_event::read_log_event():

            if (log_lock)
              mysql_mutex_lock(log_lock);

          So it looks like LOCK_log is held while reading each event to be sent to the
          slave?

          However, if my understanding is correct, that's actually something that I
          consider a performance bug. It's pretty bad that sending events to each slave
          contends a mutex with committing to the binlog, so something that we should
          try to remove, not rely on.

          And in any case, I agree that both approaches are good simultaneously -
          truncating on the master, as well as dealing sensibly with partial binlog data
          on the slave...

          knielsen Kristian Nielsen added a comment - - edited Hm, interesting... Well, I see this code in mysql_binlog_send(): log_lock= mysql_bin_log.get_log_lock(); error = Log_event::read_log_event(&log, packet, log_lock, info.current_checksum_alg, log_file_name, &is_active_binlog) And this in Log_event::read_log_event(): if (log_lock) mysql_mutex_lock(log_lock); So it looks like LOCK_log is held while reading each event to be sent to the slave? However, if my understanding is correct, that's actually something that I consider a performance bug. It's pretty bad that sending events to each slave contends a mutex with committing to the binlog, so something that we should try to remove, not rely on. And in any case, I agree that both approaches are good simultaneously - truncating on the master, as well as dealing sensibly with partial binlog data on the slave...
          pivanof Pavel Ivanov added a comment -

          Ah, I didn't notice such indirect locking. Looks weird though as to why it's necessary at all...
          Then indeed sounds like truncating the binlog file should fix this problem.

          pivanof Pavel Ivanov added a comment - Ah, I didn't notice such indirect locking. Looks weird though as to why it's necessary at all... Then indeed sounds like truncating the binlog file should fix this problem.

          Hm, I think I found an issue with the patch.

          If I understand the patch correctly, it works like this: At I/O thread reconnect, it looks at the file and position of the initial fake rotate event sent from the master. If these are different from the position of the GTID of a partial event group prior to the reconnect, a warning is output and a fake format description event is created to make the SQL thread roll back the partial event group.

          But the position of that fake rotate event will always be 4, so it will never be equal to the prior GTID position - if my reading of the code is correct. So it seems the effect of the patch is to always roll back and re-queue a partial event group in case of reconnect?

          In fact, the test case rpl.rpl_gtid_reconnect fails because of this:

          rpl.rpl_gtid_reconnect 'mix,xtradb' [ fail ] Found warnings/errors in server log file!
          Test ended at 2014-08-26 12:35:50
          line
          140826 12:35:49 [Warning] Unexpected switch of master, GTID has changed binlog position. Expected: master-bin.000001:859, received: master-bin.000001:4. Assuming that the master has crashed.

          In fact, for InnoDB, always rolling back and re-queueing seems like a reasonable approach. It should also work for most DDL (not sure about CREATE ... SELECT in row-based mode). But it can cause replication failure for MyISAM, I think.

          (Not the first time things would have been easier if a transactional storage engine could be assumed...)

          knielsen Kristian Nielsen added a comment - Hm, I think I found an issue with the patch. If I understand the patch correctly, it works like this: At I/O thread reconnect, it looks at the file and position of the initial fake rotate event sent from the master. If these are different from the position of the GTID of a partial event group prior to the reconnect, a warning is output and a fake format description event is created to make the SQL thread roll back the partial event group. But the position of that fake rotate event will always be 4, so it will never be equal to the prior GTID position - if my reading of the code is correct. So it seems the effect of the patch is to always roll back and re-queue a partial event group in case of reconnect? In fact, the test case rpl.rpl_gtid_reconnect fails because of this: rpl.rpl_gtid_reconnect 'mix,xtradb' [ fail ] Found warnings/errors in server log file! Test ended at 2014-08-26 12:35:50 line 140826 12:35:49 [Warning] Unexpected switch of master, GTID has changed binlog position. Expected: master-bin.000001:859, received: master-bin.000001:4. Assuming that the master has crashed. In fact, for InnoDB, always rolling back and re-queueing seems like a reasonable approach. It should also work for most DDL (not sure about CREATE ... SELECT in row-based mode). But it can cause replication failure for MyISAM, I think. (Not the first time things would have been easier if a transactional storage engine could be assumed...)

          An easy fix might be to compare only the file name, not the position?

          In case of crash, the file name has changed. In case of normal reconnect (no master crash), the file name will be the same.

          knielsen Kristian Nielsen added a comment - An easy fix might be to compare only the file name, not the position? In case of crash, the file name has changed. In case of normal reconnect (no master crash), the file name will be the same.
          pivanof Pavel Ivanov added a comment -

          Hm... When I wrote that I indeed was worried that first rotate event will have position 4 and won't point to the GTID event. But I think I was convinced somehow that it's not true and it points to GTID event. I don't remember now what made me think so. If you are sure that it always has position 4 then I guess we don't have other choice but only check the file name. But this check won't be as reliable. E.g. when master crashes, then its IP:port is routed to a new master and that server has different transaction at the same file, but different position.

          Can we do something better than just file name check?

          pivanof Pavel Ivanov added a comment - Hm... When I wrote that I indeed was worried that first rotate event will have position 4 and won't point to the GTID event. But I think I was convinced somehow that it's not true and it points to GTID event. I don't remember now what made me think so. If you are sure that it always has position 4 then I guess we don't have other choice but only check the file name. But this check won't be as reliable. E.g. when master crashes, then its IP:port is routed to a new master and that server has different transaction at the same file, but different position. Can we do something better than just file name check?

          We can detect silent change of master server by checking if the server_id of
          the master changes in the reconnect.

          Maybe something like this:

          1. If during a reconnect we see that the server_id of the master changes, we
          always abort (roll back) any partially received event group. I don't think
          it's safe to try to fetch the other half from a different server, as the event
          could be logged differently on the two.

          2. On a reconnect where the server_id of the master is unchanged, it seems
          reasonably safe to assume that the current master binlog file is unchanged. In
          this case, it should be enough to compare the file name - if the file name is
          the same, we can fetch the remainder of any partially received event group. If
          it changed, we abort any partial group, as in your patch.

          I think this should be safe for all transactional events. I suppose (1) could
          cause some MyISAM issues in some cases, but MyISAM replication in general
          cannot be made 100% safe, and it's more important to be 100% crash safe for
          InnoDB stuff...

          knielsen Kristian Nielsen added a comment - We can detect silent change of master server by checking if the server_id of the master changes in the reconnect. Maybe something like this: 1. If during a reconnect we see that the server_id of the master changes, we always abort (roll back) any partially received event group. I don't think it's safe to try to fetch the other half from a different server, as the event could be logged differently on the two. 2. On a reconnect where the server_id of the master is unchanged, it seems reasonably safe to assume that the current master binlog file is unchanged. In this case, it should be enough to compare the file name - if the file name is the same, we can fetch the remainder of any partially received event group. If it changed, we abort any partial group, as in your patch. I think this should be safe for all transactional events. I suppose (1) could cause some MyISAM issues in some cases, but MyISAM replication in general cannot be made 100% safe, and it's more important to be 100% crash safe for InnoDB stuff...
          pivanof Pavel Ivanov added a comment -

          Yes, the approach of checking server_id and then file name sounds reasonable to me.

          pivanof Pavel Ivanov added a comment - Yes, the approach of checking server_id and then file name sounds reasonable to me.

          I committed a patch for this that implements what is discussed above:

          http://lists.askmonty.org/pipermail/commits/2014-September/006478.html

          Monty, do you think you could review the patch?

          knielsen Kristian Nielsen added a comment - I committed a patch for this that implements what is discussed above: http://lists.askmonty.org/pipermail/commits/2014-September/006478.html Monty, do you think you could review the patch?
          knielsen Kristian Nielsen made changes -
          Assignee Kristian Nielsen [ knielsen ] Michael Widenius [ monty ]
          Status In Progress [ 3 ] In Review [ 10002 ]
          knielsen Kristian Nielsen made changes -
          Summary Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction NEED REVIEW: Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

          Have now review it and it looks ok.

          monty Michael Widenius added a comment - Have now review it and it looks ok.

          Have now review it and it looks ok.

          monty Michael Widenius added a comment - Have now review it and it looks ok.

          ok to push

          monty Michael Widenius added a comment - ok to push
          monty Michael Widenius made changes -
          Status In Review [ 10002 ] Stalled [ 10000 ]
          knielsen Kristian Nielsen made changes -
          Assignee Michael Widenius [ monty ] Kristian Nielsen [ knielsen ]

          Pushed to 10.0.14.

          knielsen Kristian Nielsen added a comment - Pushed to 10.0.14.
          knielsen Kristian Nielsen made changes -
          Resolution Fixed [ 1 ]
          Status Stalled [ 10000 ] Closed [ 6 ]
          knielsen Kristian Nielsen made changes -
          Summary NEED REVIEW: Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction
          ratzpo Rasmus Johansson (Inactive) made changes -
          Workflow MariaDB v2 [ 50420 ] MariaDB v3 [ 64183 ]
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 64183 ] MariaDB v4 [ 148045 ]

          People

            knielsen Kristian Nielsen
            pivanof Pavel Ivanov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.