[ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-48, Internal MariaDB error code: 1964
I have reduced various testcases for this sporadic bug. Note tables are by default InnoDB.
set sql_mode='';
CREATETABLE t (c YEARKEY);
INSERTINTO t VALUES (ST_GEOMFROMTEXT ('POINT(1 1)'));
DELETEFROM t WHERE c IN (SELECT 1) LIMIT 2;
INSERTINTO t VALUES (1)
SET sql_mode='';
CREATETABLE t (f INT) ENGINE=mrg_innodb UNION (t);
CREATETEMPORARYTABLE t (i INT) ENGINE=InnoDB;
CREATETABLE t2 (c1 CHAR(1)) ENGINE=InnoDB;
DROPTABLEIF EXISTS t,mysqlt,mysqlt2;
INSERTINTO t VALUES (1);
DELETEFROM t WHERE f=29;
INSERTINTO t2 VALUES (1);
SET sql_mode='';
CREATETABLE t (c INT);
INSERTINTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
INSERTINTO t VALUES (0);
CREATETABLE t0 (a BLOB) ;
CREATETABLE t3 (a TINYINT,b INT,c CHAR(0),d VARCHAR(0),e VARCHAR(0),f VARBINARY(0),g MEDIUMBLOB,h BLOB,id BIGINT,KEY(b),KEY(e)) ;
SET SESSION autocommit=0;
DELETEFROM t0;
CREATETABLE t3 (a INT,b INT UNSIGNED,c BINARY (0),d VARCHAR(0),e VARBINARY(0),f VARCHAR(0),g MEDIUMBLOB,h MEDIUMBLOB,id BIGINT,KEY(b),KEY(e)) ;
INSERTINTO mysql.userSELECT * FROM t0;
INSERTINTO t0 VALUES (0xCFD0);
SELECT REPEAT (
ALTERTABLE t CHANGE COLUMN c c BINARY (0);
INSERTINTO t0 VALUES (1);
ALTERTABLE t CHANGE COLUMN a a BINARY (0);
Sporadicity is about 1 in 3, where at least 150+ mariadbd's are started simultaneously (i.e. system load may play a factor).
All testcases require MBR or SBR to be active (but not RBR!) with slave_parallel_threads >1, best set to =10, so the issue reproduces more readily. Also set --slave_skip_errors=ALL.
Here are the options I used:
MASTER_EXTRA="--log_bin=binlog --binlog_format=MIXED --server_id=1" # or =STATEMENT
The bug was manually confirmed in 10.5, 10.6 and 11.5 using trunk builds a few days old. I have also added in-between versions as they are almost definitely affected also.
As the error happen on the slave and not the master, it is highly likely that this leads to a discrepant slave, likely without being noticed by the main application writing to the master.
This combined with a very commonplace testcase like the 3rd one, and the issues seen in the wild, makes this bug a top priority to be fixed for parallel slaves, and as such the Critical priority.
MDEV-9386Commit failed due to failure of an earlier commit on which this one depends
Open
MDEV-32372Assertion `thd->transaction->stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)' failed in close_thread_tables upon ALTER
Open
MDEV-33761WSREP: Parallel slave worker failed at wsrep_before_command() hook
Open
MDEV-34346[ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 on parallel replica
Confirmed
MDEV-33954--slave-skip-errors=all Incompatible with Optimistic Parallel Replication
I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz).
Roel Van de Paar
added a comment - - edited I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz ).
Roel Van de Paar
added a comment - - edited MTR seems to fail to reproduce the issue:
--source include/have_innodb.inc
--source include/have_binlog_format_statement.inc
--source include/master-slave.inc
SET sql_mode= '' ;
CREATE TABLE t (c INT ) ENGINE=InnoDB;
INSERT INTO t VALUES (0);
UPDATE t SET c=REPEAT (0,0) WHERE c=REPEAT (0,0);
--sync_slave_with_master
INSERT INTO t VALUES (0);
--source include/rpl_end.inc
Executed as:
#Loop MTR on main/test.test till it fails
LOG= "$(mktemp)" ; echo "Logfile: ${LOG}" ; LOOP=0; while true ; do LOOP=$[ ${LOOP} + 1 ]; echo "Loop: ${LOOP}" ; . /mtr --mysqld=--slave_skip_errors=ALL --mysqld=--slave-parallel-threads=10 test 2>&1 >>${LOG}; if grep -q '1964' ${LOG}; then break ; fi ; done
Does not reproduce the issue.
Roel: --slave-skip-errors doesn't work with optimistic parallel replication currently due to MDEV-33954. You will have to avoid using that option in tests until MDEV-33954 is fixed.
Do any of these reproduce without --slave-skip-errors?
Also, it was not clear to me from the description exactly what the observed behaviour is and how it differs from expected behaviour. You get a line in the error log about the 1964? Is that the only wrong behaviour observed (stray message in the error log)? Or Are there other errors before it (the 1964 error is only expected as follow-up error from a "real" error in an earlier transaction)? Does replication continue, or does the SQL thread stop from the error?
I think 1964 cannot be a skippable error, that's a good point, I'll add a note to the MDEV-33954 about this.
Kristian Nielsen
added a comment - Roel : --slave-skip-errors doesn't work with optimistic parallel replication currently due to MDEV-33954 . You will have to avoid using that option in tests until MDEV-33954 is fixed.
Do any of these reproduce without --slave-skip-errors?
Also, it was not clear to me from the description exactly what the observed behaviour is and how it differs from expected behaviour. You get a line in the error log about the 1964? Is that the only wrong behaviour observed (stray message in the error log)? Or Are there other errors before it (the 1964 error is only expected as follow-up error from a "real" error in an earlier transaction)? Does replication continue, or does the SQL thread stop from the error?
I think 1964 cannot be a skippable error, that's a good point, I'll add a note to the MDEV-33954 about this.
Thank you knielsen for the feedback.
The --slave-skip-errors is indeed required.
The slave does halt, and 1964 is the main error/reason given for the SQL thread being killed afaics:
2024-05-02 16:16:50 16 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
2024-05-02 16:16:50 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-05-02 16:16:50 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1549; GTID position '0-1-7'
2024-05-02 16:16:50 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1549; GTID position '0-1-7'
Roel Van de Paar
added a comment - Thank you knielsen for the feedback.
The --slave-skip-errors is indeed required.
The slave does halt, and 1964 is the main error/reason given for the SQL thread being killed afaics:
10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
2024-05-02 16:16:47 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position ''
2024-05-02 16:16:50 6 [Note] Error reading relay log event: slave SQL thread was killed
2024-05-02 16:16:50 16 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
2024-05-02 16:16:50 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-05-02 16:16:50 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1549; GTID position '0-1-7'
2024-05-02 16:16:50 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1549; GTID position '0-1-7'
"slave SQL thread was killed" - this normally means that a "STOP SLAVE" was run I think, could that be the case?
Agree that "Commit failed due to failure of an earlier commit" should not be a root cause of slave error (I'm wondering if the error should ever be printed in the error log at all, as it seems more confusing that helpful to users most of the time? But that's a separate issue). Maybe that error was ignored earlier due to --slave-skip-errors, but then was caught in a later place.
It's clear that parallel replication is not implemented to correctly work with --slave-skip-errors in current code. Errors should only be ignored once they are confirmed as real errors (non-speculative apply). And some errors (like "Commit failed due to failure of an earlier commit") are needed internally in parallel replication and must never be ignored despite --slave-skip-errors.
Kristian Nielsen
added a comment - "slave SQL thread was killed" - this normally means that a "STOP SLAVE" was run I think, could that be the case?
Agree that "Commit failed due to failure of an earlier commit" should not be a root cause of slave error (I'm wondering if the error should ever be printed in the error log at all, as it seems more confusing that helpful to users most of the time? But that's a separate issue). Maybe that error was ignored earlier due to --slave-skip-errors, but then was caught in a later place.
It's clear that parallel replication is not implemented to correctly work with --slave-skip-errors in current code. Errors should only be ignored once they are confirmed as real errors (non-speculative apply). And some errors (like "Commit failed due to failure of an earlier commit") are needed internally in parallel replication and must never be ignored despite --slave-skip-errors.
> This normally means that a "STOP SLAVE" was run I think, could that be the case?
Right, we want to make sure the master is not being terminated. I did further testing to make sure that this was not the case. I inserted a 10 minute sleep at the end of the testcase (#3 used).
While the sleep is running, the slave has already halted:
2024-05-03 8:50:00 12 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
2024-05-03 8:50:00 12 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-05-03 8:50:00 12 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1520; GTID position '0-1-7'
2024-05-03 8:50:00 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1520; GTID position '0-1-7'
2024-05-03 8:50:00 6 [Note] master was 127.0.0.1:44373
Also noting that issue occurrence is about 1 in 5.
From another angle, and provided the slave SQL thread termination issue is fixed first, it indeed may (not sure either) make sense to make these silent, provided there is some recovery mechanism for these - i.e. how would a client writing to the master know about these failed slave-side transactions? Catch 22 it seems?
A manual START SLAVE auto-resolves the issue (verified twice) - it seems that this should not be possible and may lead to data corruption?
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Slave_DDL_Groups: 3
Slave_Non_Transactional_Groups: 2
Slave_Transactional_Groups: 4
1 row in set (0.000 sec)
Roel Van de Paar
added a comment - - edited > This normally means that a "STOP SLAVE" was run I think, could that be the case?
Right, we want to make sure the master is not being terminated. I did further testing to make sure that this was not the case. I inserted a 10 minute sleep at the end of the testcase (#3 used).
While the sleep is running, the slave has already halted:
10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: repl_user
Master_Port: 44373
Connect_Retry: 60
Master_Log_File: binlog.000001
Read_Master_Log_Pos: 1684
Relay_Log_File: slave_pid-relay-bin.000002
Relay_Log_Pos: 1816
Relay_Master_Log_File: binlog.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1964
Last_Error: Commit failed due to failure of an earlier commit on which this one depends
Skip_Counter: 0
Exec_Master_Log_Pos: 1520
Relay_Log_Space: 2293
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1964
Last_SQL_Error: Commit failed due to failure of an earlier commit on which this one depends
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-1-8
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Slave_DDL_Groups: 3
Slave_Non_Transactional_Groups: 2
Slave_Transactional_Groups: 3
1 row in set (0.000 sec)
So it looks like the "slave SQL thread was killed" is the direct result of the 1964 afaics.
10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
2024-05-03 8:49:57 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave_pid-relay-bin.000001' position: 4; GTID position ''
2024-05-03 8:50:00 6 [Note] Error reading relay log event: slave SQL thread was killed
2024-05-03 8:50:00 12 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-8, Internal MariaDB error code: 1964
2024-05-03 8:50:00 12 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-05-03 8:50:00 12 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 1520; GTID position '0-1-7'
2024-05-03 8:50:00 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000001' at position 1520; GTID position '0-1-7'
2024-05-03 8:50:00 6 [Note] master was 127.0.0.1:44373
Also noting that issue occurrence is about 1 in 5.
From another angle, and provided the slave SQL thread termination issue is fixed first, it indeed may (not sure either) make sense to make these silent, provided there is some recovery mechanism for these - i.e. how would a client writing to the master know about these failed slave-side transactions? Catch 22 it seems?
A manual START SLAVE auto-resolves the issue (verified twice) - it seems that this should not be possible and may lead to data corruption?
10.5.25 55cb2c2916a26b699ed8260427002243682c7110 (Debug)
MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.004 sec)
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: repl_user
Master_Port: 44373
Connect_Retry: 60
Master_Log_File: binlog.000001
Read_Master_Log_Pos: 1684
Relay_Log_File: slave_pid-relay-bin.000002
Relay_Log_Pos: 1980
Relay_Master_Log_File: binlog.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1684
Relay_Log_Space: 2293
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-1-8
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Slave_DDL_Groups: 3
Slave_Non_Transactional_Groups: 2
Slave_Transactional_Groups: 4
1 row in set (0.000 sec)
I am able to reliably reproduce the issue about 5-15 times for every 30-40 threads in reducer and using pquery (as uploaded in MDEV-34010.tar.xz
).