Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6462

Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.0.12
    • 10.0.14
    • None

    Description

      When master crashes in the middle of writing a multi-statement transaction to binlog so that slave has already received some events from this transaction, then IO thread reconnects to the restarted master and thinks that it will re-download the same binlog events. But master will actually either not send anything (if no new transactions has executed), or will send completely different events from new transactions which will result in completely different data on the slave compared to data on the master.

      I'd bet the root cause of the problem is in how IO thread reconnects when GTID-based replication is turned on, and in these few lines of code starting at sql/slave.cc:5310:

          /*
            Do not queue any format description event that we receive after a
            reconnect where we are skipping over a partial event group received
            before the reconnect.
       
            (If we queued such an event, and it was the first format_description
            event after master restart, the slave SQL thread would think that
            the partial event group before it in the relay log was from a
            previous master crash and should be rolled back).
          */
          if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))
              gtid_skip_enqueue= true;

      In the scenario I described above SQL thread actually must roll back the active transaction.

      In the attachment is the patch that allows to emulate this scenario. Apply it, run rpl_gtid_crash test and look at the results of last two SELECTs – they will be different on master and slave.

      I will look into a way to fix this problem myself, but will appreciate any help. I'll attach a patch if I manage to find a fix before anyone on MariaDB side.

      Attachments

        1. fix_reconnect_crashed_master.txt
          8 kB
          Pavel Ivanov
        2. patch.txt
          2 kB
          Pavel Ivanov

        Activity

          People

            knielsen Kristian Nielsen
            pivanof Pavel Ivanov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.