[MDEV-34010] [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964 - Jira

Roel Van de Paar created issue - 2024-04-27 01:59

Roel Van de Paar made changes - 2024-04-27 01:59

Field	Original Value	New Value
Link		This issue relates to TODO-4062 [ TODO-4062 ]

Roel Van de Paar made changes - 2024-04-27 02:00

Link

This issue relates to MDEV-33761 [ MDEV-33761 ]

Roel Van de Paar made changes - 2024-04-27 02:00

Link

This issue relates to MDEV-9386 [ MDEV-9386 ]

Roel Van de Paar made changes - 2024-04-27 02:00

Link

This issue relates to MDEV-32372 [ MDEV-32372 ]

Roel Van de Paar made changes - 2024-04-27 02:09

Attachment

MDEV-34010.tar.xz [ 73452 ]

Roel Van de Paar made changes - 2024-04-27 02:10

Labels

parallelslave sporadic

divergent_slave parallelslave sporadic

Roel Van de Paar made changes - 2024-04-27 02:11

Description

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql:title=Testcase #1}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Roel Van de Paar made changes - 2024-04-27 02:11

Description

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql:title=Testcase #1}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql,title=Testcase #1}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Roel Van de Paar made changes - 2024-04-27 02:11

Description

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql,title=Testcase #1}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Roel Van de Paar added a comment - 2024-04-27 02:13 - edited

I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz).

Roel Van de Paar added a comment - 2024-04-27 02:13 - edited I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz ).

Roel Van de Paar made changes - 2024-04-27 02:52

Description

Likely related: MDEV-9386, MDEV-32372, MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Roel Van de Paar made changes - 2024-04-27 02:52

Description

Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them. Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Roel Van de Paar made changes - 2024-04-27 03:27

Comment

[ MTR seems to fail to reproduce the issue:
{code:sql}
--source include/have_innodb.inc
--source include/have_binlog_format_statement.inc
--source include/master-slave.inc
SET sql_mode='';
CREATE TABLE t (c INT) ENGINE=InnoDB;
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
--sync_slave_with_master
INSERT INTO t VALUES (0);
--source include/rpl_end.inc
{code}
Executed as
{code:bash}
#Loop MTR on main/test.test till it fails
LOG="$(mktemp)"; echo "Logfile: ${LOG}"; LOOP=0; while true; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}"; ./mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break; fi; done
{code} ]

Roel Van de Paar added a comment - 2024-04-27 03:27 - edited

MTR seems to fail to reproduce the issue:

--source include/have_innodb.inc

--source include/have_binlog_format_statement.inc

--source include/master-slave.inc

SET sql_mode='';

CREATE TABLE t (c INT) ENGINE=InnoDB;

INSERT INTO t VALUES (0);

UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);

--sync_slave_with_master

INSERT INTO t VALUES (0);

--source include/rpl_end.inc

Executed as:

#Loop MTR on main/test.test till it fails

LOG="$(mktemp)"; echo "Logfile: ${LOG}"; LOOP=0; while true; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}"; ./mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break; fi; done

Does not reproduce the issue.

Roel Van de Paar added a comment - 2024-04-27 03:27 - edited MTR seems to fail to reproduce the issue: --source include/have_innodb.inc --source include/have_binlog_format_statement.inc --source include/master-slave.inc SET sql_mode= '' ; CREATE TABLE t (c INT ) ENGINE=InnoDB; INSERT INTO t VALUES (0); UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0); --sync_slave_with_master INSERT INTO t VALUES (0); --source include/rpl_end.inc Executed as: #Loop MTR on main/test.test till it fails LOG= "$(mktemp)" ; echo "Logfile: ${LOG}" ; LOOP=0; while true ; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}" ; . /mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break ; fi ; done Does not reproduce the issue.

Roel Van de Paar made changes - 2024-04-27 04:03

Description

Likely related: MDEV-9386, MDEV-32372 (which is RBR, ref below), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Kristian Nielsen added a comment - 2024-04-27 09:44

Roel: --slave-skip-errors doesn't work with optimistic parallel replication currently due to MDEV-33954. You will have to avoid using that option in tests until MDEV-33954 is fixed.

Do any of these reproduce without --slave-skip-errors?

Also, it was not clear to me from the description exactly what the observed behaviour is and how it differs from expected behaviour. You get a line in the error log about the 1964? Is that the only wrong behaviour observed (stray message in the error log)? Or Are there other errors before it (the 1964 error is only expected as follow-up error from a "real" error in an earlier transaction)? Does replication continue, or does the SQL thread stop from the error?

I think 1964 cannot be a skippable error, that's a good point, I'll add a note to the MDEV-33954 about this.

Kristian Nielsen added a comment - 2024-04-27 09:44 Roel : --slave-skip-errors doesn't work with optimistic parallel replication currently due to MDEV-33954 . You will have to avoid using that option in tests until MDEV-33954 is fixed. Do any of these reproduce without --slave-skip-errors? Also, it was not clear to me from the description exactly what the observed behaviour is and how it differs from expected behaviour. You get a line in the error log about the 1964? Is that the only wrong behaviour observed (stray message in the error log)? Or Are there other errors before it (the 1964 error is only expected as follow-up error from a "real" error in an earlier transaction)? Does replication continue, or does the SQL thread stop from the error? I think 1964 cannot be a skippable error, that's a good point, I'll add a note to the MDEV-33954 about this.

Roel Van de Paar made changes - 2024-05-02 06:14

Description

Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB in the CLI.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Roel Van de Paar added a comment - 2024-05-02 06:21

Thank you knielsen for the feedback.
The --slave-skip-errors is indeed required.
The slave does halt, and 1964 is the main error/reason given for the SQL thread being killed afaics:

10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
2024-05-02 16:16:47 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position ''
2024-05-02 16:16:50 6 [Note] Error reading relay log event: slave SQL thread was killed
2024-05-02 16:16:50 16 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
2024-05-02 16:16:50 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-05-02 16:16:50 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1549; GTID position '0-1-7'
2024-05-02 16:16:50 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1549; GTID position '0-1-7'

Roel Van de Paar added a comment - 2024-05-02 06:21 Thank you knielsen for the feedback. The --slave-skip-errors is indeed required. The slave does halt, and 1964 is the main error/reason given for the SQL thread being killed afaics: 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) 2024-05-02 16:16:47 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position '' 2024-05-02 16:16:50 6 [Note] Error reading relay log event: slave SQL thread was killed 2024-05-02 16:16:50 16 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964 2024-05-02 16:16:50 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2024-05-02 16:16:50 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1549; GTID position '0-1-7' 2024-05-02 16:16:50 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1549; GTID position '0-1-7'

Roel Van de Paar made changes - 2024-05-02 06:22

Link

This issue relates to MDEV-33954 [ MDEV-33954 ]

Roel Van de Paar made changes - 2024-05-02 07:10

Description

Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

Likely related: MDEV-9386, MDEV-32372 (related assertion), MDEV-33761, though duplication is not confirmed with any of them.
Possibly related: MDEV-33909.

During testing I regularly see this error:
{noformat}
[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
{noformat}

I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
{code:sql}
set sql_mode='';
CREATE TABLE t (c YEAR KEY);
INSERT INTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETE FROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERT INTO t VALUES (1)
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATE TEMPORARY TABLE t (i INT) ENGINE=InnoDB;
CREATE TABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROP TABLE IF EXISTS t,mysqlt,mysqlt2;
INSERT INTO t VALUES (1);
DELETE FROM t WHERE f=29;
INSERT INTO t2 VALUES (1);
{code}
{code:sql}
SET sql_mode='';
CREATE TABLE t (c INT);
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERT INTO t VALUES (0);
{code}
{code:sql}
CREATE TABLE t0 (a BLOB) ;
CREATE TABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETE FROM t0;
CREATE TABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERT INTO mysql.user SELECT * FROM t0;
INSERT INTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTER TABLE t CHANGE COLUMN c c BINARY (0);
INSERT INTO t0 VALUES (1);
ALTER TABLE t CHANGE COLUMN a a BINARY (0);
{code}

Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).

All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.

Here are the options I used:
{noformat}
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
SLAVE_EXTRA="--slave-parallel-threads=10 --slave_skip_errors=ALL --server_id=2"
{noformat}

The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.

As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.

This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.

NTS: ~/MDEV-34010

Kristian Nielsen added a comment - 2024-05-02 12:14

"slave SQL thread was killed" - this normally means that a "STOP SLAVE" was run I think, could that be the case?

Agree that "Commit failed due to failure of an earlier commit" should not be a root cause of slave error (I'm wondering if the error should ever be printed in the error log at all, as it seems more confusing that helpful to users most of the time? But that's a separate issue). Maybe that error was ignored earlier due to --slave-skip-errors, but then was caught in a later place.

It's clear that parallel replication is not implemented to correctly work with --slave-skip-errors in current code. Errors should only be ignored once they are confirmed as real errors (non-speculative apply). And some errors (like "Commit failed due to failure of an earlier commit") are needed internally in parallel replication and must never be ignored despite --slave-skip-errors.

Kristian Nielsen added a comment - 2024-05-02 12:14 "slave SQL thread was killed" - this normally means that a "STOP SLAVE" was run I think, could that be the case? Agree that "Commit failed due to failure of an earlier commit" should not be a root cause of slave error (I'm wondering if the error should ever be printed in the error log at all, as it seems more confusing that helpful to users most of the time? But that's a separate issue). Maybe that error was ignored earlier due to --slave-skip-errors, but then was caught in a later place. It's clear that parallel replication is not implemented to correctly work with --slave-skip-errors in current code. Errors should only be ignored once they are confirmed as real errors (non-speculative apply). And some errors (like "Commit failed due to failure of an earlier commit") are needed internally in parallel replication and must never be ignored despite --slave-skip-errors.

Roel Van de Paar added a comment - 2024-05-02 23:03 - edited

> This normally means that a "STOP SLAVE" was run I think, could that be the case?

Right, we want to make sure the master is not being terminated. I did further testing to make sure that this was not the case. I inserted a 10 minute sleep at the end of the testcase (#3 used).

While the sleep is running, the slave has already halted:

10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
MariaDB [(none)]> show slave status\G
************************* 1. row *************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: repl_user
Master_Port: 44373
Connect_Retry: 60
Master_Log_File: binlog.000001
Read_Master_Log_Pos: 1684
Relay_Log_File: slave_pid-relay-bin.000002
Relay_Log_Pos: 1816
Relay_Master_Log_File: binlog.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1964
Last_Error: Commit failed due to failure of an earlier commit on which this one depends
Skip_Counter: 0
Exec_Master_Log_Pos: 1520
Relay_Log_Space: 2293
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1964
Last_SQL_Error: Commit failed due to failure of an earlier commit on which this one depends
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-1-8
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Slave_DDL_Groups: 3
Slave_Non_Transactional_Groups: 2
Slave_Transactional_Groups: 3
1 row in set (0.000 sec)

So it looks like the "slave SQL thread was killed" is the direct result of the 1964 afaics.

10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
2024-05-03 8:49:57 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position ''
2024-05-03 8:50:00 6 [Note] Error reading relay log event: slave SQL thread was killed
2024-05-03 8:50:00 12 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
2024-05-03 8:50:00 12 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-05-03 8:50:00 12 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1520; GTID position '0-1-7'
2024-05-03 8:50:00 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1520; GTID position '0-1-7'
2024-05-03 8:50:00 6 [Note] master was 127.0.0.1:44373

Also noting that issue occurrence is about 1 in 5.

From another angle, and provided the slave SQL thread termination issue is fixed first, it indeed may (not sure either) make sense to make these silent, provided there is some recovery mechanism for these - i.e. how would a client writing to the master know about these failed slave-side transactions? Catch 22 it seems?

A manual START SLAVE auto-resolves the issue (verified twice) - it seems that this should not be possible and may lead to data corruption?

10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.004 sec)

MariaDB [(none)]> show slave status\G
************************* 1. row *************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: repl_user
Master_Port: 44373
Connect_Retry: 60
Master_Log_File: binlog.000001
Read_Master_Log_Pos: 1684
Relay_Log_File: slave_pid-relay-bin.000002
Relay_Log_Pos: 1980
Relay_Master_Log_File: binlog.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1684
Relay_Log_Space: 2293
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-1-8
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Slave_DDL_Groups: 3
Slave_Non_Transactional_Groups: 2
Slave_Transactional_Groups: 4
1 row in set (0.000 sec)

Roel Van de Paar added a comment - 2024-05-02 23:03 - edited > This normally means that a "STOP SLAVE" was run I think, could that be the case? Right, we want to make sure the master is not being terminated. I did further testing to make sure that this was not the case. I inserted a 10 minute sleep at the end of the testcase (#3 used). While the sleep is running, the slave has already halted: 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: repl_user Master_Port: 44373 Connect_Retry: 60 Master_Log_File: binlog.000001 Read_Master_Log_Pos: 1684 Relay_Log_File: slave_pid-relay-bin.000002 Relay_Log_Pos: 1816 Relay_Master_Log_File: binlog.000001 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1964 Last_Error: Commit failed due to failure of an earlier commit on which this one depends Skip_Counter: 0 Exec_Master_Log_Pos: 1520 Relay_Log_Space: 2293 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1964 Last_SQL_Error: Commit failed due to failure of an earlier commit on which this one depends Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-1-8 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 3 Slave_Non_Transactional_Groups: 2 Slave_Transactional_Groups: 3 1 row in set (0.000 sec) So it looks like the "slave SQL thread was killed" is the direct result of the 1964 afaics. 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) 2024-05-03 8:49:57 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position '' 2024-05-03 8:50:00 6 [Note] Error reading relay log event: slave SQL thread was killed 2024-05-03 8:50:00 12 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964 2024-05-03 8:50:00 12 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2024-05-03 8:50:00 12 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1520; GTID position '0-1-7' 2024-05-03 8:50:00 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1520; GTID position '0-1-7' 2024-05-03 8:50:00 6 [Note] master was 127.0.0.1:44373 Also noting that issue occurrence is about 1 in 5. From another angle, and provided the slave SQL thread termination issue is fixed first, it indeed may (not sure either) make sense to make these silent, provided there is some recovery mechanism for these - i.e. how would a client writing to the master know about these failed slave-side transactions? Catch 22 it seems? A manual START SLAVE auto-resolves the issue (verified twice) - it seems that this should not be possible and may lead to data corruption? 10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.004 sec) MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: repl_user Master_Port: 44373 Connect_Retry: 60 Master_Log_File: binlog.000001 Read_Master_Log_Pos: 1684 Relay_Log_File: slave_pid-relay-bin.000002 Relay_Log_Pos: 1980 Relay_Master_Log_File: binlog.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1684 Relay_Log_Space: 2293 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-1-8 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Slave_DDL_Groups: 3 Slave_Non_Transactional_Groups: 2 Slave_Transactional_Groups: 4 1 row in set (0.000 sec)

Julien Fritsch made changes - 2024-05-03 14:30

Fix Version/s

11.3 [ 28565 ]

Andrei Elkin made changes - 2024-06-19 12:13

Priority

Critical [ 2 ]

Major [ 3 ]

Andrei Elkin added a comment - 2024-06-19 12:17

The priority is lowered as --slave-skip-errors=ALL is impractical in most configurations.

Andrei Elkin added a comment - 2024-06-19 12:17 The priority is lowered as --slave-skip-errors=ALL is impractical in most configurations .

Roel Van de Paar made changes - 2024-06-21 04:21

Link

This issue relates to MDEV-34346 [ MDEV-34346 ]

Julien Fritsch made changes - 2024-09-10 15:28

Fix Version/s

11.1 [ 28549 ]

Julien Fritsch made changes - 2024-11-22 17:09

Fix Version/s

11.2(EOL) [ 28603 ]

MariaDB Server

[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration