[MDEV-5804] If same GTID is received on multiple master connections in multi-source replication, the event is double-executed causing corruption or replication failure - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.9
Fix Version/s: 10.0.10
Component/s: None
Labels:
None

Description

With multi-source replication, it is possible on a slave to receive the same
GTID twice, if there are multiple paths through which events can arrive from a
master. For example a three-node multimaster setup where each server
replicates as a slave from the two others.

With correctly configured GTID, it is easy to detect that an event has already
been received once; just compare the sequence number against the previous GTID
received within that domain, and ignore the event if already applied. But
currently, the code does not handle this correctly; instead it doubly-applies
the event, or in gtid strict mode stops with an error.

We cannot really fix this as default behaviour, as this could break upgrade of
existing setups, and also conflicts with the behaviour in strict mode of
giving an error, which is desired by some user. However, we can implement a
--gtid-ignore-duplicates option, which will enable this behaviour.

So instead of the gtid strict mode behaviour, which fails if we see D-S1-M
after D-S2-N (M<=N), we will handle it as follows:

If we receive T1=D-S1-M, and current position in D is D-S2-N with N >= M, we
drop T1.

If M > N, then the event needs to be applied, however we need to protect
against two different master connections trying to apply the same GTID at the
same time. So the first connection to see the new GTID with M > N is set as
the current owner of the domain, and starts applying the transaction. When it
is done and has committed, the domain is released and the owner is
cleared. Any second connection that received a GTID with M > N while the
domain is already reserved by another owner will need to wait until the
current owner is done, and then make the decision to either discard, or apply
(then becoming the new owner).

A good way to implement it seems to be to have a current owner of each domain,
in the form of the Relay_log_info. In parallel replication, we also have a
reference count of worker threads active in that domain. When a worker thread
gets the lock on the domain, it sets the owner (if unset), and increases the
reference count. When it is done, it decreases the count, and if it reaches
zero it removes the owner and signals any other waiters that the domain is now
free to grab vy someone else.

Normally, if a slave asks to connect at GTID position D-S2-N, but the master is
only at D-S1-M, M < N, then the slave will get an error that it is ahead of
the master. However, if --gtid-ignore-duplicates, then this is a normal
situation (eg. the GTID came directly from A->C, it did not yet come A->B, and
now C wants to connect as a slave to B). So we need to in this case have the
slave tell the master not to give an error; instead the master must simply
wait for the GTID D-S2-N to turn up and then start sending events to the
slave.

Attachments

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2014-03-09 11:29

I now have a patch for this issue, which is surprisingly simple:

lp:~maria-captains/maria/10.0-mdev5804

It currently only is implemented for parallel replication, @@slave_parallel_threads > 0. Other than that, it should be functional.

It would be good to have some testing of it, to get some confidence that it works as expected for the rather complex setups where it is needed.

Kristian Nielsen added a comment - 2014-03-09 11:29 I now have a patch for this issue, which is surprisingly simple: lp:~maria-captains/maria/10.0-mdev5804 It currently only is implemented for parallel replication, @@slave_parallel_threads > 0. Other than that, it should be functional. It would be good to have some testing of it, to get some confidence that it works as expected for the rather complex setups where it is needed.

Kristian Nielsen added a comment - 2014-03-09 11:31

To test, configure the servers with --gtid-ignore-duplicates and --slave-parallel-threads=10 (say). Then setup multi-source replication using GTID and with multiple paths between nodes.

There is an example in the test case:

mysql-test/suite/multi_source/gtid_ignore_duplicates.test

Kristian Nielsen added a comment - 2014-03-09 11:31 To test, configure the servers with --gtid-ignore-duplicates and --slave-parallel-threads=10 (say). Then setup multi-source replication using GTID and with multiple paths between nodes. There is an example in the test case: mysql-test/suite/multi_source/gtid_ignore_duplicates.test

Kristian Nielsen added a comment - 2014-03-12 01:18

I pushed an updated patch to the feature tree:

lp:~maria-captains/maria/10.0-mdev5804

This patch should now be fairly complete. In particular, it no longer
has the limitation that it only works in parallel replication.

Kristian Nielsen added a comment - 2014-03-12 01:18 I pushed an updated patch to the feature tree: lp:~maria-captains/maria/10.0-mdev5804 This patch should now be fairly complete. In particular, it no longer has the limitation that it only works in parallel replication.

Kristian Nielsen added a comment - 2014-03-17 10:45

Pushed to 10.0

Kristian Nielsen added a comment - 2014-03-17 10:45 Pushed to 10.0

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2014-03-07 11:01

Updated:: 2014-03-17 10:45

Resolved:: 2014-03-17 10:45

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server