[MDEV-8354] out-of-order error with --gtid-ignore-duplicates and row-based replication - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.20
Fix Version/s: 10.0.21, 10.1.6
Component/s: Replication
Labels:
- gtid
Environment:
CentOS 7.1.1503 64bit

Description

I'm trying to set up multiple masters with multi-source replication slaves but i'm getting the following error even with gtid-ignore-duplicates ON:
"An attempt was made to binlog GTID (...) which would create an out-of-order sequence number with existing GTID (...), and gtid strict mode is enabled."

If I do just a single insert/update query, it seems to work just fine. Under any sort of load (even mysqlslap) replication stops for one of the slave connections on my slave with the error above. Here's my setup:

A and B are masters. C is the slave.
A is gtid domain ID 1, server ID 1.
B is gtid domain ID 2, server ID 2.
C doesn't have gtid domain ID configured (therefore it's the default of 0), server ID 3.

Replication is currently as follows (all are using master_use_gtid=current_pos)
A --> C
B --> C
A --> B
The goal, once it's working, is to also have B --> A.

The MariaDB-server package is installed from the repo:

# MariaDB 10.0 CentOS repository list

# http://mariadb.org/mariadb/repositories/

[mariadb]

name = MariaDB

baseurl = http://yum.mariadb.org/10.0/centos7-amd64

gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB

gpgcheck=1

Server A config:

[mariadb-10.0]

innodb-strict-mode=1

log-error=wopr-errorlog

relay-log=wopr-relaylog

log-basename=wopr

server-id=1

report-host=mn1-dbtest1

gtid-domain-id=1

auto_increment_increment=2

auto_increment_offset=1

log-bin=wopr-binlog

log-bin-index=wopr-binlog-index

binlog-format=ROW

log-slave-updates=1

slave-exec-mode=IDEMPOTENT

gtid-ignore-duplicates=1

gtid-strict-mode=1

Server B config:

[mariadb-10.0]

innodb-strict-mode=1

log-error=wopr-errorlog

relay-log=wopr-relaylog

log-basename=wopr

server-id=2

report-host=mn1-dbtest2

gtid-domain-id=2

auto_increment_increment=2

auto_increment_offset=2

log-bin=wopr-binlog

log-bin-index=wopr-binlog-index

binlog-format=ROW

log-slave-updates=1

slave-exec-mode=IDEMPOTENT

gtid-ignore-duplicates=1

gtid-strict-mode=1

Server C config:

[mariadb-10.0]

innodb-strict-mode=1

log-error=wopr-errorlog

relay-log=wopr-relaylog

log-basename=wopr

server-id=3

report-host=mn1-dbtest3

log-bin=wopr-binlog

log-bin-index=wopr-binlog-index

binlog-format=ROW

log-slave-updates=1

slave-exec-mode=IDEMPOTENT

gtid-ignore-duplicates=1

gtid-strict-mode=1

read-only=1

—
Steps to reproduce:
Set up the three servers with the configs above. Set up the slave replication from B to A, from C to A, and from C to B using master_use_gtid=current_pos for all of them.

Do a single insert/update/create and it'll work just fine. No error.
Now on server A do:
mysqlslap -u root -p --auto-generate-sql

Now Server C's slave connection to Server B will stop (Slave_SQL_Running is No) with the following error (the GTID position isn't the same every time)
"Last_SQL_Error: An attempt was made to binlog GTID 1-1-6 which would create an out-of-order sequence number with existing GTID 1-1-6, and gtid strict mode is enabled."

Server C's gtid_slave_pos is, however, 1-1-111 since the transactions worked through the replication with server A. If I simply do a

{reset master;}

on server C and then

{start slave 'serverB';}

then the error goes away. Doing the mysqlslap again will cause the error again as well.
—

I spoke to knielsen in the #maria IRC chat about this last week and he seemed to be puzzled by the issue as well.

Attachments

Issue Links

relates to

MDEV-8507 gtid_ignore_duplicates doesn't work with log_slave_updates and multi-source replication

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2015-06-23 08:40

Is that the exact mysqlslap command you used?

For me, mysqlslap -u root -p --auto-generate-sql does not do anything, and I do not get the error:

Average number of seconds to run all queries: 0.011 seconds
Minimum number of seconds to run all queries: 0.011 seconds
Maximum number of seconds to run all queries: 0.011 seconds
Number of clients running queries: 1
Average number of queries per client: 0

"number of queries" is 0, seems like it doesn't run anything. I will try to fiddle with mysqlslap command line arguments, but might be useful to know the exact command line that was used by reporter.

Kristian Nielsen added a comment - 2015-06-23 08:40 Is that the exact mysqlslap command you used? For me, mysqlslap -u root -p --auto-generate-sql does not do anything, and I do not get the error: Average number of seconds to run all queries: 0.011 seconds Minimum number of seconds to run all queries: 0.011 seconds Maximum number of seconds to run all queries: 0.011 seconds Number of clients running queries: 1 Average number of queries per client: 0 "number of queries" is 0, seems like it doesn't run anything. I will try to fiddle with mysqlslap command line arguments, but might be useful to know the exact command line that was used by reporter.

Kristian Nielsen added a comment - 2015-06-23 08:55

Ok, I could reproduce with this mysqlslap command line:

bld/client/mysqlslap --socket=bld/data1/mysql.sock -uroot --auto-generate-sql --auto-generate-sql-execute-number=10000 --concurrency=10

Kristian Nielsen added a comment - 2015-06-23 08:55 Ok, I could reproduce with this mysqlslap command line: bld/client/mysqlslap --socket=bld/data1/mysql.sock -uroot --auto-generate-sql --auto-generate-sql-execute-number=10000 --concurrency=10

Matt Neth added a comment - 2015-06-23 09:06

Yeah, that mysqlslap command is all that I need to run the create the error. gtid_binlog_pos on the master goes up by 109 transactions every time I run that so it's doing the queries.

Mysqlslap is just how I get it to cause the error in my test environment. My main environment has enough natural traffic to cause this issue too when I was originally trying to add a second master to it so it's not just mysqlslap that causes it.

Matt Neth added a comment - 2015-06-23 09:06 Yeah, that mysqlslap command is all that I need to run the create the error. gtid_binlog_pos on the master goes up by 109 transactions every time I run that so it's doing the queries. Mysqlslap is just how I get it to cause the error in my test environment. My main environment has enough natural traffic to cause this issue too when I was originally trying to add a second master to it so it's not just mysqlslap that causes it.

Kristian Nielsen added a comment - 2015-06-23 10:14

Investigation shows that rpl_global_gtid_slave_state.release_domain_owner()
is called too early in the error case. It is called before the GTID state
has been updated with the owned GTID. This then allows the second connection
to become owner of the same GTID and attempt to re-execute it, which creates
the duplicate GTID error.

release_domain_owner() is called from rpl_group_info::cleanup_context().
In the row-based case, cleanup_context() is called already at the end of the
statement, from rows_event_stmt_cleanup().

So it seems that the release_domain_owner() call is in the wrong place, but
this only triggers in case of row-based replication. So this could explain
why it was not seen before (if most users where using statement-based
replication).

More investigation is required to confirm this and find the correct fix, but
seems a plausible explanation, at least.

Kristian Nielsen added a comment - 2015-06-23 10:14 Investigation shows that rpl_global_gtid_slave_state.release_domain_owner() is called too early in the error case. It is called before the GTID state has been updated with the owned GTID. This then allows the second connection to become owner of the same GTID and attempt to re-execute it, which creates the duplicate GTID error. release_domain_owner() is called from rpl_group_info::cleanup_context(). In the row-based case, cleanup_context() is called already at the end of the statement, from rows_event_stmt_cleanup(). So it seems that the release_domain_owner() call is in the wrong place, but this only triggers in case of row-based replication. So this could explain why it was not seen before (if most users where using statement-based replication). More investigation is required to confirm this and find the correct fix, but seems a plausible explanation, at least.

Kristian Nielsen added a comment - 2015-06-23 12:08

I think this is the patch to fix the problem. I will try to create a
mysql-test-run test case to go with it.

diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc

index 9bd0ca5..ef6d68c 100644

--- a/sql/rpl_rli.cc

+++ b/sql/rpl_rli.cc

@@ -1768,6 +1768,13 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)

       rli->clear_flag(Relay_log_info::IN_STMT);

       rli->clear_flag(Relay_log_info::IN_TRANSACTION);

+    /*

+      Ensure we always release the domain for others to process, when using

+      --gtid-ignore-duplicates.

+    */

+    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)

+      rpl_global_gtid_slave_state.release_domain_owner(this);

/*

@@ -1777,13 +1784,6 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)

   thd->variables.option_bits&= ~OPTION_RELAXED_UNIQUE_CHECKS;

/*

-    Ensure we always release the domain for others to process, when using

-    --gtid-ignore-duplicates.

-  */

-  if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)

-    rpl_global_gtid_slave_state.release_domain_owner(this);

-  /*

     Reset state related to long_find_row notes in the error log:

     - timestamp

     - flag that decides whether the slave prints or not

Kristian Nielsen added a comment - 2015-06-23 12:08 I think this is the patch to fix the problem. I will try to create a mysql-test-run test case to go with it. diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc index 9bd0ca5..ef6d68c 100644 --- a/sql/rpl_rli.cc +++ b/sql/rpl_rli.cc @@ -1768,6 +1768,13 @@ void rpl_group_info::cleanup_context(THD *thd, bool error) rli->clear_flag(Relay_log_info::IN_STMT); rli->clear_flag(Relay_log_info::IN_TRANSACTION); } + + /* + Ensure we always release the domain for others to process, when using + --gtid-ignore-duplicates. + */ + if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL) + rpl_global_gtid_slave_state.release_domain_owner(this); } /* @@ -1777,13 +1784,6 @@ void rpl_group_info::cleanup_context(THD *thd, bool error) thd->variables.option_bits&= ~OPTION_RELAXED_UNIQUE_CHECKS; /* - Ensure we always release the domain for others to process, when using - --gtid-ignore-duplicates. - */ - if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL) - rpl_global_gtid_slave_state.release_domain_owner(this); - - /* Reset state related to long_find_row notes in the error log: - timestamp - flag that decides whether the slave prints or not

Kristian Nielsen added a comment - 2015-06-23 15:23

http://lists.askmonty.org/pipermail/commits/2015-June/008083.html

Kristian Nielsen added a comment - 2015-06-23 15:23 http://lists.askmonty.org/pipermail/commits/2015-June/008083.html

People

Assignee:: Kristian Nielsen

Reporter:: Matt Neth

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2015-06-22 18:46

Updated:: 2015-08-01 03:13

Resolved:: 2015-06-24 18:31

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server