Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34010

[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964

Details

    Description

      Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
      Possibly related: MDEV-33909.

      During testing I regularly see this error:

      [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
      

      I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.

      set sql_mode='';
      CREATE TABLE t (c YEAR KEY);
      INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
      DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
      INSERT INTO t VALUES (1)
      

      SET sql_mode='';
      CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
      CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
      CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
      DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
      INSERT INTO t VALUES (1);
      DELETE FROM t WHERE f=29;
      INSERT INTO t2 VALUES (1);
      

      SET sql_mode='';
      CREATE TABLE t (c INT);
      INSERT INTO t VALUES (0);
      UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
      INSERT INTO t VALUES (0);
      

      CREATE TABLE t0 (a BLOB) ;
      CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
      SET SESSION autocommit=0;
      DELETE FROM t0;
      CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
      INSERT INTO mysql.user SELECT * FROM t0;
      INSERT INTO t0 VALUES (0xCFD0);
      SELECT REPEAT (
      ALTER TABLE t CHANGE COLUMN c c BINARY (0);
      INSERT INTO t0 VALUES (1);
      ALTER TABLE t CHANGE COLUMN a a BINARY (0);
      

      Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

      All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

      Here are the options I used:

      MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1"  # or =STATEMENT
      SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
      

      The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

      As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

      This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

      NTS: ~/MDEV-34010

      Attachments

        Issue Links

          Activity

            Roel Roel Van de Paar created issue -
            Roel Roel Van de Paar made changes -
            Field Original Value New Value
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            Attachment MDEV-34010.tar.xz [ 73452 ]
            Roel Roel Van de Paar made changes -
            Labels parallelslave sporadic divergent_slave parallelslave sporadic
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql:title=Testcase #1}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql:title=Testcase #1}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql,title=Testcase #1}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql,title=Testcase #1}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Roel Roel Van de Paar added a comment - - edited

            I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz).

            Roel Roel Van de Paar added a comment - - edited I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz ).
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Roel Roel Van de Paar made changes -
            Comment [ MTR seems to fail to reproduce the issue:
            {code:sql}
            --source include/have_innodb.inc
            --source include/have_binlog_format_statement.inc
            --source include/master-slave.inc
            SET sql_mode='';
            CREATE TABLE t (c INT) ENGINE=InnoDB;
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            --sync_slave_with_master
            INSERT INTO t VALUES (0);
            --source include/rpl_end.inc
            {code}
            Executed as
            {code:bash}
            #Loop MTR on main/test.test till it fails
            LOG="$(mktemp)"; echo "Logfile: ${LOG}"; LOOP=0; while true; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}"; ./mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break; fi; done
            {code} ]
            Roel Roel Van de Paar added a comment - - edited

            MTR seems to fail to reproduce the issue:

            --source include/have_innodb.inc
            --source include/have_binlog_format_statement.inc
            --source include/master-slave.inc
            SET sql_mode='';
            CREATE TABLE t (c INT) ENGINE=InnoDB;
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            --sync_slave_with_master
            INSERT INTO t VALUES (0);
            --source include/rpl_end.inc
            

            Executed as:

            #Loop MTR on main/test.test till it fails
            LOG="$(mktemp)"; echo "Logfile: ${LOG}"; LOOP=0; while true; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}"; ./mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break; fi; done
            

            Does not reproduce the issue.

            Roel Roel Van de Paar added a comment - - edited MTR seems to fail to reproduce the issue: --source include/have_innodb.inc --source include/have_binlog_format_statement.inc --source include/master-slave.inc SET sql_mode= '' ; CREATE TABLE t (c INT ) ENGINE=InnoDB; INSERT INTO t VALUES (0); UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0); --sync_slave_with_master INSERT INTO t VALUES (0); --source include/rpl_end.inc Executed as: #Loop MTR on main/test.test till it fails LOG= "$(mktemp)" ; echo "Logfile: ${LOG}" ; LOOP=0; while true ; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}" ; . /mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break ; fi ; done Does not reproduce the issue.
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

            Roel: --slave-skip-errors doesn't work with optimistic parallel replication currently due to MDEV-33954. You will have to avoid using that option in tests until MDEV-33954 is fixed.

            Do any of these reproduce without --slave-skip-errors?

            Also, it was not clear to me from the description exactly what the observed behaviour is and how it differs from expected behaviour. You get a line in the error log about the 1964? Is that the only wrong behaviour observed (stray message in the error log)? Or Are there other errors before it (the 1964 error is only expected as follow-up error from a "real" error in an earlier transaction)? Does replication continue, or does the SQL thread stop from the error?

            I think 1964 cannot be a skippable error, that's a good point, I'll add a note to the MDEV-33954 about this.

            knielsen Kristian Nielsen added a comment - Roel : --slave-skip-errors doesn't work with optimistic parallel replication currently due to MDEV-33954 . You will have to avoid using that option in tests until MDEV-33954 is fixed. Do any of these reproduce without --slave-skip-errors? Also, it was not clear to me from the description exactly what the observed behaviour is and how it differs from expected behaviour. You get a line in the error log about the 1964? Is that the only wrong behaviour observed (stray message in the error log)? Or Are there other errors before it (the 1964 error is only expected as follow-up error from a "real" error in an earlier transaction)? Does replication continue, or does the SQL thread stop from the error? I think 1964 cannot be a skippable error, that's a good point, I'll add a note to the MDEV-33954 about this.
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

            Thank you knielsen for the feedback.
            The --slave-skip-errors is indeed required.
            The slave does halt, and 1964 is the main error/reason given for the SQL thread being killed afaics:

            10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)

            2024-05-02 16:16:47 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position ''
            2024-05-02 16:16:50 6 [Note] Error reading relay log event: slave SQL thread was killed
            2024-05-02 16:16:50 16 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
            2024-05-02 16:16:50 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
            2024-05-02 16:16:50 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1549; GTID position '0-1-7'
            2024-05-02 16:16:50 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1549; GTID position '0-1-7'
            

            Roel Roel Van de Paar added a comment - Thank you knielsen for the feedback. The --slave-skip-errors is indeed required. The slave does halt, and 1964 is the main error/reason given for the SQL thread being killed afaics: 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) 2024-05-02 16:16:47 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position '' 2024-05-02 16:16:50 6 [Note] Error reading relay log event: slave SQL thread was killed 2024-05-02 16:16:50 16 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964 2024-05-02 16:16:50 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2024-05-02 16:16:50 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1549; GTID position '0-1-7' 2024-05-02 16:16:50 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1549; GTID position '0-1-7'
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            Description Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
            Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
            Possibly related: MDEV-33909.

            During testing I regularly see this error:
            {noformat}
            [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
            {noformat}

            I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
            {code:sql}
            set sql_mode='';
            CREATE TABLE t (c YEAR KEY);
            INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
            DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
            INSERT INTO t VALUES (1)
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
            CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
            CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
            DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
            INSERT INTO t VALUES (1);
            DELETE FROM t WHERE f=29;
            INSERT INTO t2 VALUES (1);
            {code}
            {code:sql}
            SET sql_mode='';
            CREATE TABLE t (c INT);
            INSERT INTO t VALUES (0);
            UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
            INSERT INTO t VALUES (0);
            {code}
            {code:sql}
            CREATE TABLE t0 (a BLOB) ;
            CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
            SET SESSION autocommit=0;
            DELETE FROM t0;
            CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
            INSERT INTO mysql.user SELECT * FROM t0;
            INSERT INTO t0 VALUES (0xCFD0);
            SELECT REPEAT (
            ALTER TABLE t CHANGE COLUMN c c BINARY (0);
            INSERT INTO t0 VALUES (1);
            ALTER TABLE t CHANGE COLUMN a a BINARY (0);
            {code}

            Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

            All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

            Here are the options I used:
            {noformat}
            MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
            SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
            {noformat}

            The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

            As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

            This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

            NTS: ~/MDEV-34010

            "slave SQL thread was killed" - this normally means that a "STOP SLAVE" was run I think, could that be the case?

            Agree that "Commit failed due to failure of an earlier commit" should not be a root cause of slave error (I'm wondering if the error should ever be printed in the error log at all, as it seems more confusing that helpful to users most of the time? But that's a separate issue). Maybe that error was ignored earlier due to --slave-skip-errors, but then was caught in a later place.

            It's clear that parallel replication is not implemented to correctly work with --slave-skip-errors in current code. Errors should only be ignored once they are confirmed as real errors (non-speculative apply). And some errors (like "Commit failed due to failure of an earlier commit") are needed internally in parallel replication and must never be ignored despite --slave-skip-errors.

            knielsen Kristian Nielsen added a comment - "slave SQL thread was killed" - this normally means that a "STOP SLAVE" was run I think, could that be the case? Agree that "Commit failed due to failure of an earlier commit" should not be a root cause of slave error (I'm wondering if the error should ever be printed in the error log at all, as it seems more confusing that helpful to users most of the time? But that's a separate issue). Maybe that error was ignored earlier due to --slave-skip-errors, but then was caught in a later place. It's clear that parallel replication is not implemented to correctly work with --slave-skip-errors in current code. Errors should only be ignored once they are confirmed as real errors (non-speculative apply). And some errors (like "Commit failed due to failure of an earlier commit") are needed internally in parallel replication and must never be ignored despite --slave-skip-errors.
            Roel Roel Van de Paar added a comment - - edited

            > This normally means that a "STOP SLAVE" was run I think, could that be the case?

            Right, we want to make sure the master is not being terminated. I did further testing to make sure that this was not the case. I inserted a 10 minute sleep at the end of the testcase (#3 used).

            While the sleep is running, the slave has already halted:

            10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)

            MariaDB [(none)]> show slave status\G
            *************************** 1. row ***************************
                            Slave_IO_State: Waiting for master to send event
                               Master_Host: 127.0.0.1
                               Master_User: repl_user
                               Master_Port: 44373
                             Connect_Retry: 60
                           Master_Log_File: binlog.000001
                       Read_Master_Log_Pos: 1684
                            Relay_Log_File: slave_pid-relay-bin.000002
                             Relay_Log_Pos: 1816
                     Relay_Master_Log_File: binlog.000001
                          Slave_IO_Running: Yes
                         Slave_SQL_Running: No
                           Replicate_Do_DB: 
                       Replicate_Ignore_DB: 
                        Replicate_Do_Table: 
                    Replicate_Ignore_Table: 
                   Replicate_Wild_Do_Table: 
               Replicate_Wild_Ignore_Table: 
                                Last_Errno: 1964
                                Last_Error: Commit failed due to failure of an earlier commit on which this one depends
                              Skip_Counter: 0
                       Exec_Master_Log_Pos: 1520
                           Relay_Log_Space: 2293
                           Until_Condition: None
                            Until_Log_File: 
                             Until_Log_Pos: 0
                        Master_SSL_Allowed: No
                        Master_SSL_CA_File: 
                        Master_SSL_CA_Path: 
                           Master_SSL_Cert: 
                         Master_SSL_Cipher: 
                            Master_SSL_Key: 
                     Seconds_Behind_Master: NULL
             Master_SSL_Verify_Server_Cert: No
                             Last_IO_Errno: 0
                             Last_IO_Error: 
                            Last_SQL_Errno: 1964
                            Last_SQL_Error: Commit failed due to failure of an earlier commit on which this one depends
               Replicate_Ignore_Server_Ids: 
                          Master_Server_Id: 1
                            Master_SSL_Crl: 
                        Master_SSL_Crlpath: 
                                Using_Gtid: Slave_Pos
                               Gtid_IO_Pos: 0-1-8
                   Replicate_Do_Domain_Ids: 
               Replicate_Ignore_Domain_Ids: 
                             Parallel_Mode: optimistic
                                 SQL_Delay: 0
                       SQL_Remaining_Delay: NULL
                   Slave_SQL_Running_State: 
                          Slave_DDL_Groups: 3
            Slave_Non_Transactional_Groups: 2
                Slave_Transactional_Groups: 3
            1 row in set (0.000 sec)
            

            So it looks like the "slave SQL thread was killed" is the direct result of the 1964 afaics.

            10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)

            2024-05-03  8:49:57 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position ''
            2024-05-03  8:50:00 6 [Note] Error reading relay log event: slave SQL thread was killed
            2024-05-03  8:50:00 12 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
            2024-05-03  8:50:00 12 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
            2024-05-03  8:50:00 12 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1520; GTID position '0-1-7'
            2024-05-03  8:50:00 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1520; GTID position '0-1-7'
            2024-05-03  8:50:00 6 [Note] master was 127.0.0.1:44373
            

            Also noting that issue occurrence is about 1 in 5.

            From another angle, and provided the slave SQL thread termination issue is fixed first, it indeed may (not sure either) make sense to make these silent, provided there is some recovery mechanism for these - i.e. how would a client writing to the master know about these failed slave-side transactions? Catch 22 it seems?

            A manual START SLAVE auto-resolves the issue (verified twice) - it seems that this should not be possible and may lead to data corruption?

            10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)

            MariaDB [(none)]> start slave;
            Query OK, 0 rows affected (0.004 sec)
             
            MariaDB [(none)]> show slave status\G
            *************************** 1. row ***************************
                            Slave_IO_State: Waiting for master to send event
                               Master_Host: 127.0.0.1
                               Master_User: repl_user
                               Master_Port: 44373
                             Connect_Retry: 60
                           Master_Log_File: binlog.000001
                       Read_Master_Log_Pos: 1684
                            Relay_Log_File: slave_pid-relay-bin.000002
                             Relay_Log_Pos: 1980
                     Relay_Master_Log_File: binlog.000001
                          Slave_IO_Running: Yes
                         Slave_SQL_Running: Yes
                           Replicate_Do_DB: 
                       Replicate_Ignore_DB: 
                        Replicate_Do_Table: 
                    Replicate_Ignore_Table: 
                   Replicate_Wild_Do_Table: 
               Replicate_Wild_Ignore_Table: 
                                Last_Errno: 0
                                Last_Error: 
                              Skip_Counter: 0
                       Exec_Master_Log_Pos: 1684
                           Relay_Log_Space: 2293
                           Until_Condition: None
                            Until_Log_File: 
                             Until_Log_Pos: 0
                        Master_SSL_Allowed: No
                        Master_SSL_CA_File: 
                        Master_SSL_CA_Path: 
                           Master_SSL_Cert: 
                         Master_SSL_Cipher: 
                            Master_SSL_Key: 
                     Seconds_Behind_Master: 0
             Master_SSL_Verify_Server_Cert: No
                             Last_IO_Errno: 0
                             Last_IO_Error: 
                            Last_SQL_Errno: 0
                            Last_SQL_Error: 
               Replicate_Ignore_Server_Ids: 
                          Master_Server_Id: 1
                            Master_SSL_Crl: 
                        Master_SSL_Crlpath: 
                                Using_Gtid: Slave_Pos
                               Gtid_IO_Pos: 0-1-8
                   Replicate_Do_Domain_Ids: 
               Replicate_Ignore_Domain_Ids: 
                             Parallel_Mode: optimistic
                                 SQL_Delay: 0
                       SQL_Remaining_Delay: NULL
                   Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
                          Slave_DDL_Groups: 3
            Slave_Non_Transactional_Groups: 2
                Slave_Transactional_Groups: 4
            1 row in set (0.000 sec)
            

            Roel Roel Van de Paar added a comment - - edited > This normally means that a "STOP SLAVE" was run I think, could that be the case? Right, we want to make sure the master is not being terminated. I did further testing to make sure that this was not the case. I inserted a 10 minute sleep at the end of the testcase (#3 used). While the sleep is running, the slave has already halted: 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: repl_user Master_Port: 44373 Connect_Retry: 60 Master_Log_File: binlog.000001 Read_Master_Log_Pos: 1684 Relay_Log_File: slave_pid-relay-bin.000002 Relay_Log_Pos: 1816 Relay_Master_Log_File: binlog.000001 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1964 Last_Error: Commit failed due to failure of an earlier commit on which this one depends Skip_Counter: 0 Exec_Master_Log_Pos: 1520 Relay_Log_Space: 2293 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1964 Last_SQL_Error: Commit failed due to failure of an earlier commit on which this one depends Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-1-8 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 3 Slave_Non_Transactional_Groups: 2 Slave_Transactional_Groups: 3 1 row in set (0.000 sec) So it looks like the "slave SQL thread was killed" is the direct result of the 1964 afaics. 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) 2024-05-03 8:49:57 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position '' 2024-05-03 8:50:00 6 [Note] Error reading relay log event: slave SQL thread was killed 2024-05-03 8:50:00 12 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964 2024-05-03 8:50:00 12 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2024-05-03 8:50:00 12 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1520; GTID position '0-1-7' 2024-05-03 8:50:00 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1520; GTID position '0-1-7' 2024-05-03 8:50:00 6 [Note] master was 127.0.0.1:44373 Also noting that issue occurrence is about 1 in 5. From another angle, and provided the slave SQL thread termination issue is fixed first, it indeed may (not sure either) make sense to make these silent, provided there is some recovery mechanism for these - i.e. how would a client writing to the master know about these failed slave-side transactions? Catch 22 it seems? A manual START SLAVE auto-resolves the issue (verified twice) - it seems that this should not be possible and may lead to data corruption? 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.004 sec)   MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: repl_user Master_Port: 44373 Connect_Retry: 60 Master_Log_File: binlog.000001 Read_Master_Log_Pos: 1684 Relay_Log_File: slave_pid-relay-bin.000002 Relay_Log_Pos: 1980 Relay_Master_Log_File: binlog.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1684 Relay_Log_Space: 2293 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-1-8 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Slave_DDL_Groups: 3 Slave_Non_Transactional_Groups: 2 Slave_Transactional_Groups: 4 1 row in set (0.000 sec)
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 11.3 [ 28565 ]
            Elkin Andrei Elkin made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            Elkin Andrei Elkin added a comment - The priority is lowered as --slave-skip-errors=ALL is impractical in most configurations .
            Roel Roel Van de Paar made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 11.1 [ 28549 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 11.2(EOL) [ 28603 ]

            People

              Elkin Andrei Elkin
              Roel Roel Van de Paar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.