Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34346

[ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 on parallel replica

Details

    Description

      --source include/have_binlog_format_statement.inc
      --source include/have_innodb.inc
      --source include/master-slave.inc
       
      --connection slave
      STOP SLAVE;
      SET GLOBAL slave_parallel_threads=10;
      START SLAVE;
       
      --connection master
      SET sql_mode='';
      CREATE TABLE t (c INT) ENGINE=InnoDB;
      INSERT INTO t VALUES (REPEAT (0,200000));
      INSERT INTO t SELECT * FROM t;
      INSERT INTO t VALUES (1);
      

      Leads to:

      11.5.0 e4afa610539ae01164485554e2de839bea9de816 (Optimized, MTR Output)

      INSERT INTO t VALUES (1);
      ***Warnings generated in error logs during shutdown after running tests: main.test
       
      2024-06-10 13:57:45 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
      2024-06-10 13:57:45 18 [Warning] Slave: Connection was killed Error_code: 1927
      2024-06-10 13:57:45 18 [Warning] Slave: Connection was killed Error_code: 1927
      2024-06-10 13:57:45 18 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2024-06-10 13:57:45 18 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
       
      main.test                                [ fail ]
      

      Slave log (var/log/mysqld.2.err):

      11.5.0 e4afa610539ae01164485554e2de839bea9de816 (Optimized, Slave)

      2024-06-10 13:53:06 6 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication starts at GTID position ''
      2024-06-10 13:53:06 7 [Note] Error reading relay log event: slave SQL thread was killed
      2024-06-10 13:53:06 7 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329; GTID position '', master: 127.0.0.1:16000
      2024-06-10 13:53:06 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16000
      2024-06-10 13:53:06 10 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16000' in log 'master-bin.000001' at position 329
      2024-06-10 13:53:06 11 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 329, relay log './slave-relay-bin.000001' position: 4; GTID position ''
      2024-06-10 13:53:06 10 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication starts at GTID position ''
      2024-06-10 13:53:06 19 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062:  Duplicate entry '0-3' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 832, Gtid 0-1-3, Internal MariaDB error code: 1942
      2024-06-10 13:53:06 19 [ERROR] Slave (additional info): Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
      2024-06-10 13:53:06 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
      2024-06-10 13:53:06 19 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 663; GTID position '0-1-2'
      2024-06-10 13:53:06 11 [Note] Error reading relay log event: slave SQL thread was killed
      2024-06-10 13:53:06 17 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2024-06-10 13:53:06 17 [Warning] Slave: Connection was killed Error_code: 1927
      2024-06-10 13:53:06 17 [Warning] Slave: Connection was killed Error_code: 1927
      2024-06-10 13:53:06 17 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2024-06-10 13:53:06 17 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2024-06-10 13:53:06 17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 663; GTID position '0-1-2'
      2024-06-10 13:53:06 11 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 663; GTID position '0-1-2', master: 127.0.0.1:16000
      2024-06-10 13:53:07 0 [Note] /test/MD170524-mariadb-11.5.0-linux-x86_64-opt/bin/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
      2024-06-10 13:53:07 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 996; GTID position 0-1-4, master 127.0.0.1:16000
      2024-06-10 13:53:07 0 [Note] InnoDB: FTS optimize thread exiting.
      2024-06-10 13:53:07 0 [Note] InnoDB: Starting shutdown...
      2024-06-10 13:53:07 0 [Note] InnoDB: Dumping buffer pool(s) to /test/MD170524-mariadb-11.5.0-linux-x86_64-opt/mariadb-test/var/mysqld.2/data/ib_buffer_pool
      2024-06-10 13:53:07 0 [Note] InnoDB: Restricted to 124 pages due to innodb_buf_pool_dump_pct=25
      2024-06-10 13:53:07 0 [Note] InnoDB: Buffer pool(s) dump completed at 240610 13:53:07
      2024-06-10 13:53:07 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
      2024-06-10 13:53:07 0 [Note] InnoDB: Shutdown completed; log sequence number 52835; transaction id 28
      2024-06-10 13:53:07 0 [Note] /test/MD170524-mariadb-11.5.0-linux-x86_64-opt/bin/mariadbd: Shutdown complete
      

      If the issue does not immediately reproduce in full (it may pass, or just show the 1062 only, but not 1964), try --repeat 10; it is lightly sporadic.

      Confirmed present in 10.5-11.6.

      Attachments

        Issue Links

          Activity

            Reproduced the issue locally

            11.5.1 b2fc885469a7db2c25eab67fbc8e220dc2f805f8 (Optimized)

            ==============================================================================
             
            TEST                                      RESULT   TIME (ms) or COMMENT
            --------------------------------------------------------------------------
             
            worker[01] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
            include/master-slave.inc
            [connection master]
            connection slave;
            STOP SLAVE;
            SET GLOBAL slave_parallel_threads=10;
            START SLAVE;
            connection master;
            SET sql_mode='';
            CREATE TABLE t (c INT) ENGINE=InnoDB;
            INSERT INTO t VALUES (REPEAT (0,200000));
            INSERT INTO t SELECT * FROM t;
            INSERT INTO t VALUES (1);
            ***Warnings generated in error logs during shutdown after running tests: rpl.mytest
             
            2024-06-10  7:14:06 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
            2024-06-10  7:14:06 18 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
            2024-06-10  7:14:06 18 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
             
            rpl.mytest                               [ fail ]
                    Test ended at 2024-06-10 07:14:06
            

            ramesh Ramesh Sivaraman added a comment - Reproduced the issue locally 11.5.1 b2fc885469a7db2c25eab67fbc8e220dc2f805f8 (Optimized) ==============================================================================   TEST RESULT TIME (ms) or COMMENT --------------------------------------------------------------------------   worker[01] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 include/master-slave.inc [connection master] connection slave; STOP SLAVE; SET GLOBAL slave_parallel_threads=10; START SLAVE; connection master; SET sql_mode=''; CREATE TABLE t (c INT) ENGINE=InnoDB; INSERT INTO t VALUES (REPEAT (0,200000)); INSERT INTO t SELECT * FROM t; INSERT INTO t VALUES (1); ***Warnings generated in error logs during shutdown after running tests: rpl.mytest   2024-06-10 7:14:06 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062 2024-06-10 7:14:06 18 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 2024-06-10 7:14:06 18 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964   rpl.mytest [ fail ] Test ended at 2024-06-10 07:14:06
            Elkin Andrei Elkin added a comment -

            ramesh, howdy! What is mysql.gtid_slave_pos' engine? Does not look like Innodb.

            Elkin Andrei Elkin added a comment - ramesh , howdy! What is mysql.gtid_slave_pos' engine? Does not look like Innodb.

            Alternative testcase

            --source include/have_binlog_format_statement.inc
            --source include/have_innodb.inc
            --source include/master-slave.inc
             
            --connection slave
            STOP SLAVE;
            SET GLOBAL slave_parallel_threads=10;
            START SLAVE;
             
            --connection master
            CREATE TABLE t ENGINE=InnoDB AS SELECT 1 c1;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE c1=NULL;
            INSERT INTO t VALUES(1) ON DUPLICATE KEY UPDATE c1=c1;
            INSERT INTO t VALUES(1);
            

            Roel Roel Van de Paar added a comment - Alternative testcase --source include/have_binlog_format_statement.inc --source include/have_innodb.inc --source include/master-slave.inc   --connection slave STOP SLAVE; SET GLOBAL slave_parallel_threads=10; START SLAVE;   --connection master CREATE TABLE t ENGINE=InnoDB AS SELECT 1 c1; INSERT INTO t VALUES (1); DELETE FROM t WHERE c1= NULL ; INSERT INTO t VALUES (1) ON DUPLICATE KEY UPDATE c1=c1; INSERT INTO t VALUES (1);
            Roel Roel Van de Paar added a comment - - edited

            Elkin Hi! ramesh only verified the bug in MTR. I can assist.

            The mysql.gtid_slave_pos table is InnoDB and unchanged from default in the CLI.

            The default for the same table in MTR is Aria apparently, where the issue equally reproduces.

            Roel Roel Van de Paar added a comment - - edited Elkin Hi! ramesh only verified the bug in MTR. I can assist. The mysql.gtid_slave_pos table is InnoDB and unchanged from default in the CLI. The default for the same table in MTR is Aria apparently, where the issue equally reproduces.

            The behaviour described is a duplicate of MDEV-10242.
            Please, make sure that mysql.gtid_slave_pos is using a transactional engine whenever testing optimistic or aggressive parallel replication.

            If a bug was observed with mysql.gtid_slave_pos being InnoDB, please describe the expected and actual behaviour independently from this one, which is caused by the inability of Aria to roll back the update to mysql.gtid_slave_pos.

            knielsen Kristian Nielsen added a comment - The behaviour described is a duplicate of MDEV-10242 . Please, make sure that mysql.gtid_slave_pos is using a transactional engine whenever testing optimistic or aggressive parallel replication. If a bug was observed with mysql.gtid_slave_pos being InnoDB, please describe the expected and actual behaviour independently from this one, which is caused by the inability of Aria to roll back the update to mysql.gtid_slave_pos.
            Elkin Andrei Elkin added a comment -

            Howdy Roel, I suggest to add at least SHOW CREATE TABLE for gsp table, for
            being explicit in this part.

            On Mon, 10 Jun 2024, 09:48 Roel Van de Paar (Jira), <jira@mariadb.org>

            Elkin Andrei Elkin added a comment - Howdy Roel, I suggest to add at least SHOW CREATE TABLE for gsp table, for being explicit in this part. On Mon, 10 Jun 2024, 09:48 Roel Van de Paar (Jira), <jira@mariadb.org>
            Roel Roel Van de Paar added a comment - - edited

            Hmm yes, it seems various issues currently existing in replication are intertwined:
            1. 'Optimistic' being the default parallel replication mode, which fails when mysql.gtid_slave_pos is Aria/MyISAM.
            2. For MTR, where the gtid_slave_pos table is Aria, it can be MDEV-10242.

            Still, there seems to be more than meets the eye here. For the CLI, the issue reproduces when LIMIT (unsafe SBR stmt) and slave_skip_errors is used:

            ./mtr --mysqld=--slave_skip_errors=ALL test; grep '1964' var/log/mysqld.2.err
            

            --source include/have_binlog_format_statement.inc
            --source include/have_innodb.inc
            --source include/master-slave.inc
             
            --connection slave
            STOP SLAVE;
            ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;   ##### < Change to InnoDB
            SET GLOBAL slave_parallel_threads=10;
            START SLAVE;
             
            --connection master
            SET sql_mode='';
            CREATE TABLE t (c INT) ENGINE=InnoDB;
            INSERT INTO t VALUES (REPEAT (0,200000));
            INSERT INTO t SELECT c FROM t LIMIT 1,1;  ##### < LIMIT
            INSERT INTO t VALUES (1);
            

            (Run until you see '1964' output from the grep, which is generally most of the time)
            1. Here we have an InnoDB gtid_slave_pos, but the same error occurs
            2. Why does a single unsafe SBR statement produce the same outcome?
            3. When the slave_skip_errors=ALL is removed, the issue no longer reproduces - why?
            4. When the slave_skip_errors=ALL is removed, there is no error at all - how can this be?
            etc.

            Roel Roel Van de Paar added a comment - - edited Hmm yes, it seems various issues currently existing in replication are intertwined: 1. 'Optimistic' being the default parallel replication mode, which fails when mysql.gtid_slave_pos is Aria/MyISAM. 2. For MTR, where the gtid_slave_pos table is Aria, it can be MDEV-10242 . Still, there seems to be more than meets the eye here. For the CLI, the issue reproduces when LIMIT (unsafe SBR stmt) and slave_skip_errors is used: . /mtr --mysqld=--slave_skip_errors=ALL test ; grep '1964' var /log/mysqld .2.err --source include/have_binlog_format_statement.inc --source include/have_innodb.inc --source include/master-slave.inc   --connection slave STOP SLAVE; ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; ##### < Change to InnoDB SET GLOBAL slave_parallel_threads=10; START SLAVE;   --connection master SET sql_mode= '' ; CREATE TABLE t (c INT ) ENGINE=InnoDB; INSERT INTO t VALUES (REPEAT (0,200000)); INSERT INTO t SELECT c FROM t LIMIT 1,1; ##### < LIMIT INSERT INTO t VALUES (1); (Run until you see '1964' output from the grep, which is generally most of the time) 1. Here we have an InnoDB gtid_slave_pos, but the same error occurs 2. Why does a single unsafe SBR statement produce the same outcome? 3. When the slave_skip_errors=ALL is removed, the issue no longer reproduces - why? 4. When the slave_skip_errors=ALL is removed, there is no error at all - how can this be? etc.

            Roel, I would recommend you spend most of your test efforts without highly unsafe configurations such as --slave-skip-errors and unsafe statements in --binlog-format=statement. This will make your great efforts in testing a lot more useful. The --slave-skip-errors=ALL in particular is a crazy configuration. Note, "most", but not "all" - it's highly valuable to get these corner cases tested to know about issues in unusual and even nonsensical configurations. But the main challenge in replication for the moment is the many regressions like MDEV-32020 etc. in normal and useful configurations, and fixing these should have priority.

            Note that error 1964 isn't a meaningful identification of "the same error occurs". Error 1964 is always a follow-up error from the real issue. It is thrown by a running transaction T2 in parallel replication if the prior transaction T1 got an error and cannot commit. I'm wondering if we shouldn't simply remove the reporting of 1964 in the error log, it seems to cause more confusion than it helps.

            The issue with --slave-skip-errors=ALL was already discussed in MDEV-33954. Optimistic parallel replication can cause a lot of intermediate errors internally, they are handled transparently by retrying affected transactions. Using --slave-skip-errors with this is currently buggy though (MDEV-33954) and interferes with the error handling, causing these issues to pop up.

            Hope this helps,

            - Kristian.

            knielsen Kristian Nielsen added a comment - Roel, I would recommend you spend most of your test efforts without highly unsafe configurations such as --slave-skip-errors and unsafe statements in --binlog-format=statement. This will make your great efforts in testing a lot more useful. The --slave-skip-errors=ALL in particular is a crazy configuration. Note, "most", but not "all" - it's highly valuable to get these corner cases tested to know about issues in unusual and even nonsensical configurations. But the main challenge in replication for the moment is the many regressions like MDEV-32020 etc. in normal and useful configurations, and fixing these should have priority. Note that error 1964 isn't a meaningful identification of "the same error occurs". Error 1964 is always a follow-up error from the real issue. It is thrown by a running transaction T2 in parallel replication if the prior transaction T1 got an error and cannot commit. I'm wondering if we shouldn't simply remove the reporting of 1964 in the error log, it seems to cause more confusion than it helps. The issue with --slave-skip-errors=ALL was already discussed in MDEV-33954 . Optimistic parallel replication can cause a lot of intermediate errors internally, they are handled transparently by retrying affected transactions. Using --slave-skip-errors with this is currently buggy though ( MDEV-33954 ) and interferes with the error handling, causing these issues to pop up. Hope this helps, - Kristian.

            People

              Elkin Andrei Elkin
              Roel Roel Van de Paar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.