[MDEV-8354] out-of-order error with --gtid-ignore-duplicates and row-based replication Created: 2015-06-22  Updated: 2015-08-01  Resolved: 2015-06-24

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.20
Fix Version/s: 10.0.21, 10.1.6

Type: Bug Priority: Major
Reporter: Matt Neth Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: gtid
Environment:

CentOS 7.1.1503 64bit


Issue Links:
Relates
relates to MDEV-8507 gtid_ignore_duplicates doesn't work w... Closed

 Description   

I'm trying to set up multiple masters with multi-source replication slaves but i'm getting the following error even with gtid-ignore-duplicates ON:
"An attempt was made to binlog GTID (...) which would create an out-of-order sequence number with existing GTID (...), and gtid strict mode is enabled."

If I do just a single insert/update query, it seems to work just fine. Under any sort of load (even mysqlslap) replication stops for one of the slave connections on my slave with the error above. Here's my setup:

A and B are masters. C is the slave.
A is gtid domain ID 1, server ID 1.
B is gtid domain ID 2, server ID 2.
C doesn't have gtid domain ID configured (therefore it's the default of 0), server ID 3.

Replication is currently as follows (all are using master_use_gtid=current_pos)
A --> C
B --> C
A --> B
The goal, once it's working, is to also have B --> A.

The MariaDB-server package is installed from the repo:

# MariaDB 10.0 CentOS repository list
# http://mariadb.org/mariadb/repositories/
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/10.0/centos7-amd64
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1

Server A config:

[mariadb-10.0]
innodb-strict-mode=1
log-error=wopr-errorlog
relay-log=wopr-relaylog
log-basename=wopr
server-id=1
report-host=mn1-dbtest1
gtid-domain-id=1
auto_increment_increment=2
auto_increment_offset=1
log-bin=wopr-binlog
log-bin-index=wopr-binlog-index
binlog-format=ROW
log-slave-updates=1
slave-exec-mode=IDEMPOTENT
gtid-ignore-duplicates=1
gtid-strict-mode=1

Server B config:

[mariadb-10.0]
innodb-strict-mode=1
log-error=wopr-errorlog
relay-log=wopr-relaylog
log-basename=wopr
server-id=2
report-host=mn1-dbtest2
gtid-domain-id=2
auto_increment_increment=2
auto_increment_offset=2
log-bin=wopr-binlog
log-bin-index=wopr-binlog-index
binlog-format=ROW
log-slave-updates=1
slave-exec-mode=IDEMPOTENT
gtid-ignore-duplicates=1
gtid-strict-mode=1

Server C config:

[mariadb-10.0]
innodb-strict-mode=1
log-error=wopr-errorlog
relay-log=wopr-relaylog
log-basename=wopr
server-id=3
report-host=mn1-dbtest3
log-bin=wopr-binlog
log-bin-index=wopr-binlog-index
binlog-format=ROW
log-slave-updates=1
slave-exec-mode=IDEMPOTENT
gtid-ignore-duplicates=1
gtid-strict-mode=1
read-only=1


Steps to reproduce:
Set up the three servers with the configs above. Set up the slave replication from B to A, from C to A, and from C to B using master_use_gtid=current_pos for all of them.

Do a single insert/update/create and it'll work just fine. No error.
Now on server A do:
mysqlslap -u root -p --auto-generate-sql

Now Server C's slave connection to Server B will stop (Slave_SQL_Running is No) with the following error (the GTID position isn't the same every time)
"Last_SQL_Error: An attempt was made to binlog GTID 1-1-6 which would create an out-of-order sequence number with existing GTID 1-1-6, and gtid strict mode is enabled."

Server C's gtid_slave_pos is, however, 1-1-111 since the transactions worked through the replication with server A. If I simply do a

{reset master;}

on server C and then

{start slave 'serverB';}

then the error goes away. Doing the mysqlslap again will cause the error again as well.

I spoke to knielsen in the #maria IRC chat about this last week and he seemed to be puzzled by the issue as well.



 Comments   
Comment by Kristian Nielsen [ 2015-06-23 ]

Is that the exact mysqlslap command you used?

For me, mysqlslap -u root -p --auto-generate-sql does not do anything, and I do not get the error:

Average number of seconds to run all queries: 0.011 seconds
Minimum number of seconds to run all queries: 0.011 seconds
Maximum number of seconds to run all queries: 0.011 seconds
Number of clients running queries: 1
Average number of queries per client: 0

"number of queries" is 0, seems like it doesn't run anything. I will try to fiddle with mysqlslap command line arguments, but might be useful to know the exact command line that was used by reporter.

Comment by Kristian Nielsen [ 2015-06-23 ]

Ok, I could reproduce with this mysqlslap command line:

bld/client/mysqlslap --socket=bld/data1/mysql.sock -uroot --auto-generate-sql --auto-generate-sql-execute-number=10000 --concurrency=10

Comment by Matt Neth [ 2015-06-23 ]

Yeah, that mysqlslap command is all that I need to run the create the error. gtid_binlog_pos on the master goes up by 109 transactions every time I run that so it's doing the queries.

Mysqlslap is just how I get it to cause the error in my test environment. My main environment has enough natural traffic to cause this issue too when I was originally trying to add a second master to it so it's not just mysqlslap that causes it.

Comment by Kristian Nielsen [ 2015-06-23 ]

Investigation shows that rpl_global_gtid_slave_state.release_domain_owner()
is called too early in the error case. It is called before the GTID state
has been updated with the owned GTID. This then allows the second connection
to become owner of the same GTID and attempt to re-execute it, which creates
the duplicate GTID error.

release_domain_owner() is called from rpl_group_info::cleanup_context().
In the row-based case, cleanup_context() is called already at the end of the
statement, from rows_event_stmt_cleanup().

So it seems that the release_domain_owner() call is in the wrong place, but
this only triggers in case of row-based replication. So this could explain
why it was not seen before (if most users where using statement-based
replication).

More investigation is required to confirm this and find the correct fix, but
seems a plausible explanation, at least.

Comment by Kristian Nielsen [ 2015-06-23 ]

I think this is the patch to fix the problem. I will try to create a
mysql-test-run test case to go with it.

diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc
index 9bd0ca5..ef6d68c 100644
--- a/sql/rpl_rli.cc
+++ b/sql/rpl_rli.cc
@@ -1768,6 +1768,13 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)
       rli->clear_flag(Relay_log_info::IN_STMT);
       rli->clear_flag(Relay_log_info::IN_TRANSACTION);
     }
+
+    /*
+      Ensure we always release the domain for others to process, when using
+      --gtid-ignore-duplicates.
+    */
+    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)
+      rpl_global_gtid_slave_state.release_domain_owner(this);
   }
 
   /*
@@ -1777,13 +1784,6 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)
   thd->variables.option_bits&= ~OPTION_RELAXED_UNIQUE_CHECKS;
 
   /*
-    Ensure we always release the domain for others to process, when using
-    --gtid-ignore-duplicates.
-  */
-  if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)
-    rpl_global_gtid_slave_state.release_domain_owner(this);
-
-  /*
     Reset state related to long_find_row notes in the error log:
     - timestamp
     - flag that decides whether the slave prints or not

Comment by Kristian Nielsen [ 2015-06-23 ]

http://lists.askmonty.org/pipermail/commits/2015-June/008083.html

Generated at Thu Feb 08 07:26:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.