Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8354

out-of-order error with --gtid-ignore-duplicates and row-based replication

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.20
    • 10.0.21, 10.1.6
    • Replication
    • CentOS 7.1.1503 64bit

    Description

      I'm trying to set up multiple masters with multi-source replication slaves but i'm getting the following error even with gtid-ignore-duplicates ON:
      "An attempt was made to binlog GTID (...) which would create an out-of-order sequence number with existing GTID (...), and gtid strict mode is enabled."

      If I do just a single insert/update query, it seems to work just fine. Under any sort of load (even mysqlslap) replication stops for one of the slave connections on my slave with the error above. Here's my setup:

      A and B are masters. C is the slave.
      A is gtid domain ID 1, server ID 1.
      B is gtid domain ID 2, server ID 2.
      C doesn't have gtid domain ID configured (therefore it's the default of 0), server ID 3.

      Replication is currently as follows (all are using master_use_gtid=current_pos)
      A --> C
      B --> C
      A --> B
      The goal, once it's working, is to also have B --> A.

      The MariaDB-server package is installed from the repo:

      # MariaDB 10.0 CentOS repository list
      # http://mariadb.org/mariadb/repositories/
      [mariadb]
      name = MariaDB
      baseurl = http://yum.mariadb.org/10.0/centos7-amd64
      gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
      gpgcheck=1

      Server A config:

      [mariadb-10.0]
      innodb-strict-mode=1
      log-error=wopr-errorlog
      relay-log=wopr-relaylog
      log-basename=wopr
      server-id=1
      report-host=mn1-dbtest1
      gtid-domain-id=1
      auto_increment_increment=2
      auto_increment_offset=1
      log-bin=wopr-binlog
      log-bin-index=wopr-binlog-index
      binlog-format=ROW
      log-slave-updates=1
      slave-exec-mode=IDEMPOTENT
      gtid-ignore-duplicates=1
      gtid-strict-mode=1

      Server B config:

      [mariadb-10.0]
      innodb-strict-mode=1
      log-error=wopr-errorlog
      relay-log=wopr-relaylog
      log-basename=wopr
      server-id=2
      report-host=mn1-dbtest2
      gtid-domain-id=2
      auto_increment_increment=2
      auto_increment_offset=2
      log-bin=wopr-binlog
      log-bin-index=wopr-binlog-index
      binlog-format=ROW
      log-slave-updates=1
      slave-exec-mode=IDEMPOTENT
      gtid-ignore-duplicates=1
      gtid-strict-mode=1

      Server C config:

      [mariadb-10.0]
      innodb-strict-mode=1
      log-error=wopr-errorlog
      relay-log=wopr-relaylog
      log-basename=wopr
      server-id=3
      report-host=mn1-dbtest3
      log-bin=wopr-binlog
      log-bin-index=wopr-binlog-index
      binlog-format=ROW
      log-slave-updates=1
      slave-exec-mode=IDEMPOTENT
      gtid-ignore-duplicates=1
      gtid-strict-mode=1
      read-only=1

      —
      Steps to reproduce:
      Set up the three servers with the configs above. Set up the slave replication from B to A, from C to A, and from C to B using master_use_gtid=current_pos for all of them.

      Do a single insert/update/create and it'll work just fine. No error.
      Now on server A do:
      mysqlslap -u root -p --auto-generate-sql

      Now Server C's slave connection to Server B will stop (Slave_SQL_Running is No) with the following error (the GTID position isn't the same every time)
      "Last_SQL_Error: An attempt was made to binlog GTID 1-1-6 which would create an out-of-order sequence number with existing GTID 1-1-6, and gtid strict mode is enabled."

      Server C's gtid_slave_pos is, however, 1-1-111 since the transactions worked through the replication with server A. If I simply do a

      {reset master;}

      on server C and then

      {start slave 'serverB';}

      then the error goes away. Doing the mysqlslap again will cause the error again as well.
      —

      I spoke to knielsen in the #maria IRC chat about this last week and he seemed to be puzzled by the issue as well.

      Attachments

        Issue Links

          Activity

            Is that the exact mysqlslap command you used?

            For me, mysqlslap -u root -p --auto-generate-sql does not do anything, and I do not get the error:

            Average number of seconds to run all queries: 0.011 seconds
            Minimum number of seconds to run all queries: 0.011 seconds
            Maximum number of seconds to run all queries: 0.011 seconds
            Number of clients running queries: 1
            Average number of queries per client: 0

            "number of queries" is 0, seems like it doesn't run anything. I will try to fiddle with mysqlslap command line arguments, but might be useful to know the exact command line that was used by reporter.

            knielsen Kristian Nielsen added a comment - Is that the exact mysqlslap command you used? For me, mysqlslap -u root -p --auto-generate-sql does not do anything, and I do not get the error: Average number of seconds to run all queries: 0.011 seconds Minimum number of seconds to run all queries: 0.011 seconds Maximum number of seconds to run all queries: 0.011 seconds Number of clients running queries: 1 Average number of queries per client: 0 "number of queries" is 0, seems like it doesn't run anything. I will try to fiddle with mysqlslap command line arguments, but might be useful to know the exact command line that was used by reporter.

            Ok, I could reproduce with this mysqlslap command line:

            bld/client/mysqlslap --socket=bld/data1/mysql.sock -uroot --auto-generate-sql --auto-generate-sql-execute-number=10000 --concurrency=10

            knielsen Kristian Nielsen added a comment - Ok, I could reproduce with this mysqlslap command line: bld/client/mysqlslap --socket=bld/data1/mysql.sock -uroot --auto-generate-sql --auto-generate-sql-execute-number=10000 --concurrency=10
            mneth1 Matt Neth added a comment -

            Yeah, that mysqlslap command is all that I need to run the create the error. gtid_binlog_pos on the master goes up by 109 transactions every time I run that so it's doing the queries.

            Mysqlslap is just how I get it to cause the error in my test environment. My main environment has enough natural traffic to cause this issue too when I was originally trying to add a second master to it so it's not just mysqlslap that causes it.

            mneth1 Matt Neth added a comment - Yeah, that mysqlslap command is all that I need to run the create the error. gtid_binlog_pos on the master goes up by 109 transactions every time I run that so it's doing the queries. Mysqlslap is just how I get it to cause the error in my test environment. My main environment has enough natural traffic to cause this issue too when I was originally trying to add a second master to it so it's not just mysqlslap that causes it.

            Investigation shows that rpl_global_gtid_slave_state.release_domain_owner()
            is called too early in the error case. It is called before the GTID state
            has been updated with the owned GTID. This then allows the second connection
            to become owner of the same GTID and attempt to re-execute it, which creates
            the duplicate GTID error.

            release_domain_owner() is called from rpl_group_info::cleanup_context().
            In the row-based case, cleanup_context() is called already at the end of the
            statement, from rows_event_stmt_cleanup().

            So it seems that the release_domain_owner() call is in the wrong place, but
            this only triggers in case of row-based replication. So this could explain
            why it was not seen before (if most users where using statement-based
            replication).

            More investigation is required to confirm this and find the correct fix, but
            seems a plausible explanation, at least.

            knielsen Kristian Nielsen added a comment - Investigation shows that rpl_global_gtid_slave_state.release_domain_owner() is called too early in the error case. It is called before the GTID state has been updated with the owned GTID. This then allows the second connection to become owner of the same GTID and attempt to re-execute it, which creates the duplicate GTID error. release_domain_owner() is called from rpl_group_info::cleanup_context(). In the row-based case, cleanup_context() is called already at the end of the statement, from rows_event_stmt_cleanup(). So it seems that the release_domain_owner() call is in the wrong place, but this only triggers in case of row-based replication. So this could explain why it was not seen before (if most users where using statement-based replication). More investigation is required to confirm this and find the correct fix, but seems a plausible explanation, at least.

            I think this is the patch to fix the problem. I will try to create a
            mysql-test-run test case to go with it.

            diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc
            index 9bd0ca5..ef6d68c 100644
            --- a/sql/rpl_rli.cc
            +++ b/sql/rpl_rli.cc
            @@ -1768,6 +1768,13 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)
                   rli->clear_flag(Relay_log_info::IN_STMT);
                   rli->clear_flag(Relay_log_info::IN_TRANSACTION);
                 }
            +
            +    /*
            +      Ensure we always release the domain for others to process, when using
            +      --gtid-ignore-duplicates.
            +    */
            +    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)
            +      rpl_global_gtid_slave_state.release_domain_owner(this);
               }
             
               /*
            @@ -1777,13 +1784,6 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)
               thd->variables.option_bits&= ~OPTION_RELAXED_UNIQUE_CHECKS;
             
               /*
            -    Ensure we always release the domain for others to process, when using
            -    --gtid-ignore-duplicates.
            -  */
            -  if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)
            -    rpl_global_gtid_slave_state.release_domain_owner(this);
            -
            -  /*
                 Reset state related to long_find_row notes in the error log:
                 - timestamp
                 - flag that decides whether the slave prints or not

            knielsen Kristian Nielsen added a comment - I think this is the patch to fix the problem. I will try to create a mysql-test-run test case to go with it. diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc index 9bd0ca5..ef6d68c 100644 --- a/sql/rpl_rli.cc +++ b/sql/rpl_rli.cc @@ -1768,6 +1768,13 @@ void rpl_group_info::cleanup_context(THD *thd, bool error) rli->clear_flag(Relay_log_info::IN_STMT); rli->clear_flag(Relay_log_info::IN_TRANSACTION); } + + /* + Ensure we always release the domain for others to process, when using + --gtid-ignore-duplicates. + */ + if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL) + rpl_global_gtid_slave_state.release_domain_owner(this); } /* @@ -1777,13 +1784,6 @@ void rpl_group_info::cleanup_context(THD *thd, bool error) thd->variables.option_bits&= ~OPTION_RELAXED_UNIQUE_CHECKS; /* - Ensure we always release the domain for others to process, when using - --gtid-ignore-duplicates. - */ - if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL) - rpl_global_gtid_slave_state.release_domain_owner(this); - - /* Reset state related to long_find_row notes in the error log: - timestamp - flag that decides whether the slave prints or not
            knielsen Kristian Nielsen added a comment - http://lists.askmonty.org/pipermail/commits/2015-June/008083.html

            People

              knielsen Kristian Nielsen
              mneth1 Matt Neth
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.