[MDEV-6462] Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.0.12
Fix Version/s: 10.0.14
Component/s: None
Labels:
- gtid

Description

When master crashes in the middle of writing a multi-statement transaction to binlog so that slave has already received some events from this transaction, then IO thread reconnects to the restarted master and thinks that it will re-download the same binlog events. But master will actually either not send anything (if no new transactions has executed), or will send completely different events from new transactions which will result in completely different data on the slave compared to data on the master.

I'd bet the root cause of the problem is in how IO thread reconnects when GTID-based replication is turned on, and in these few lines of code starting at sql/slave.cc:5310:

/*

      Do not queue any format description event that we receive after a

      reconnect where we are skipping over a partial event group received

      before the reconnect.

      (If we queued such an event, and it was the first format_description

      event after master restart, the slave SQL thread would think that

      the partial event group before it in the relay log was from a

      previous master crash and should be rolled back).

*/

    if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen))

        gtid_skip_enqueue= true;

In the scenario I described above SQL thread actually must roll back the active transaction.

In the attachment is the patch that allows to emulate this scenario. Apply it, run rpl_gtid_crash test and look at the results of last two SELECTs – they will be different on master and slave.

I will look into a way to fix this problem myself, but will appreciate any help. I'll attach a patch if I manage to find a fix before anyone on MariaDB side.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fix_reconnect_crashed_master.txt
8 kB
2014-07-24 01:27
patch.txt
2 kB
2014-07-19 02:42

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Kristian Nielsen added a comment - 2014-09-02 15:08

I committed a patch for this that implements what is discussed above:

http://lists.askmonty.org/pipermail/commits/2014-September/006478.html

Monty, do you think you could review the patch?

Kristian Nielsen added a comment - 2014-09-02 15:08 I committed a patch for this that implements what is discussed above: http://lists.askmonty.org/pipermail/commits/2014-September/006478.html Monty, do you think you could review the patch?

Michael Widenius added a comment - 2014-09-02 20:41

Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41 Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41

Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41 Have now review it and it looks ok.

Michael Widenius added a comment - 2014-09-02 20:41

ok to push

Michael Widenius added a comment - 2014-09-02 20:41 ok to push

Kristian Nielsen added a comment - 2014-09-03 10:12

Pushed to 10.0.14.

Kristian Nielsen added a comment - 2014-09-03 10:12 Pushed to 10.0.14.

People

Assignee:: Kristian Nielsen

Reporter:: Pavel Ivanov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2014-07-19 02:42

Updated:: 2014-09-03 10:13

Resolved:: 2014-09-03 10:12

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration