[MDEV-34346] [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 on parallel replica - Jira

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.5, 10.6, 10.11, 11.1(EOL), 11.2(EOL), 11.4, 11.5(EOL), 11.6(EOL)
Fix Version/s: 10.5, 10.6, 10.11, 11.4
Component/s: Locking, Replication, Storage Engine - InnoDB
Labels:

Description

--source include/have_binlog_format_statement.inc

--source include/have_innodb.inc

--source include/master-slave.inc

--connection slave

STOP SLAVE;

SET GLOBAL slave_parallel_threads=10;

START SLAVE;

--connection master

SET sql_mode='';

CREATE TABLE t (c INT) ENGINE=InnoDB;

INSERT INTO t VALUES (REPEAT (0,200000));

INSERT INTO t SELECT * FROM t;

INSERT INTO t VALUES (1);

Leads to:

11.5.0 e4afa610539ae01164485554e2de839bea9de816 (Optimized, MTR Output)
INSERT INTO t VALUES (1);
***Warnings generated in error logs during shutdown after running tests: main.test

2024-06-10 13:57:45 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
2024-06-10 13:57:45 18 [Warning] Slave: Connection was killed Error_code: 1927
2024-06-10 13:57:45 18 [Warning] Slave: Connection was killed Error_code: 1927
2024-06-10 13:57:45 18 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2024-06-10 13:57:45 18 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

main.test [ fail ]

Slave log (var/log/mysqld.2.err):

11.5.0 e4afa610539ae01164485554e2de839bea9de816 (Optimized, Slave)
2024-06-10 13:53:06 6 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication starts at GTID position ''
2024-06-10 13:53:06 7 [Note] Error reading relay log event: slave SQL thread was killed
2024-06-10 13:53:06 7 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329; GTID position '', master: 127.0.0.1:16000
2024-06-10 13:53:06 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16000
2024-06-10 13:53:06 10 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16000' in log 'master-bin.000001' at position 329
2024-06-10 13:53:06 11 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 329, relay log './slave-relay-bin.000001' position: 4; GTID position ''
2024-06-10 13:53:06 10 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication starts at GTID position ''
2024-06-10 13:53:06 19 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-3' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 832, Gtid 0-1-3, Internal MariaDB error code: 1942
2024-06-10 13:53:06 19 [ERROR] Slave (additional info): Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
2024-06-10 13:53:06 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
2024-06-10 13:53:06 19 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 663; GTID position '0-1-2'
2024-06-10 13:53:06 11 [Note] Error reading relay log event: slave SQL thread was killed
2024-06-10 13:53:06 17 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-06-10 13:53:06 17 [Warning] Slave: Connection was killed Error_code: 1927
2024-06-10 13:53:06 17 [Warning] Slave: Connection was killed Error_code: 1927
2024-06-10 13:53:06 17 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2024-06-10 13:53:06 17 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2024-06-10 13:53:06 17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 663; GTID position '0-1-2'
2024-06-10 13:53:06 11 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 663; GTID position '0-1-2', master: 127.0.0.1:16000
2024-06-10 13:53:07 0 [Note] /test/MD170524-mariadb-11.5.0-linux-x86_64-opt/bin/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2024-06-10 13:53:07 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 996; GTID position 0-1-4, master 127.0.0.1:16000
2024-06-10 13:53:07 0 [Note] InnoDB: FTS optimize thread exiting.
2024-06-10 13:53:07 0 [Note] InnoDB: Starting shutdown...
2024-06-10 13:53:07 0 [Note] InnoDB: Dumping buffer pool(s) to /test/MD170524-mariadb-11.5.0-linux-x86_64-opt/mariadb-test/var/mysqld.2/data/ib_buffer_pool
2024-06-10 13:53:07 0 [Note] InnoDB: Restricted to 124 pages due to innodb_buf_pool_dump_pct=25
2024-06-10 13:53:07 0 [Note] InnoDB: Buffer pool(s) dump completed at 240610 13:53:07
2024-06-10 13:53:07 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2024-06-10 13:53:07 0 [Note] InnoDB: Shutdown completed; log sequence number 52835; transaction id 28
2024-06-10 13:53:07 0 [Note] /test/MD170524-mariadb-11.5.0-linux-x86_64-opt/bin/mariadbd: Shutdown complete

If the issue does not immediately reproduce in full (it may pass, or just show the 1062 only, but not 1964), try --repeat 10; it is lightly sporadic.

Confirmed present in 10.5-11.6.

Attachments

Issue Links

relates to

MDEV-10242 Non-transactional mysql.gtid_slave_pos breaks --slave-parallel-mode=[optimistic|aggressive] replication

Confirmed

MDEV-34010 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964

Open

Activity

Ascending order - Click to sort in descending order

Ramesh Sivaraman added a comment - 2024-06-10 04:18

Reproduced the issue locally

11.5.1 b2fc885469a7db2c25eab67fbc8e220dc2f805f8 (Optimized)
==============================================================================

TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[01] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
include/master-slave.inc
[connection master]
connection slave;
STOP SLAVE;
SET GLOBAL slave_parallel_threads=10;
START SLAVE;
connection master;
SET sql_mode='';
CREATE TABLE t (c INT) ENGINE=InnoDB;
INSERT INTO t VALUES (REPEAT (0,200000));
INSERT INTO t SELECT * FROM t;
INSERT INTO t VALUES (1);
***Warnings generated in error logs during shutdown after running tests: rpl.mytest

2024-06-10 7:14:06 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062
2024-06-10 7:14:06 18 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2024-06-10 7:14:06 18 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

rpl.mytest [ fail ]
Test ended at 2024-06-10 07:14:06

Ramesh Sivaraman added a comment - 2024-06-10 04:18 Reproduced the issue locally 11.5.1 b2fc885469a7db2c25eab67fbc8e220dc2f805f8 (Optimized) ============================================================================== TEST RESULT TIME (ms) or COMMENT -------------------------------------------------------------------------- worker[01] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 include/master-slave.inc [connection master] connection slave; STOP SLAVE; SET GLOBAL slave_parallel_threads=10; START SLAVE; connection master; SET sql_mode=''; CREATE TABLE t (c INT) ENGINE=InnoDB; INSERT INTO t VALUES (REPEAT (0,200000)); INSERT INTO t SELECT * FROM t; INSERT INTO t VALUES (1); ***Warnings generated in error logs during shutdown after running tests: rpl.mytest 2024-06-10 7:14:06 19 [Warning] Slave: Duplicate entry '0-3' for key 'PRIMARY' Error_code: 1062 2024-06-10 7:14:06 18 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 2024-06-10 7:14:06 18 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 rpl.mytest [ fail ] Test ended at 2024-06-10 07:14:06

Andrei Elkin added a comment - 2024-06-10 05:58

ramesh, howdy! What is mysql.gtid_slave_pos' engine? Does not look like Innodb.

Andrei Elkin added a comment - 2024-06-10 05:58 ramesh , howdy! What is mysql.gtid_slave_pos' engine? Does not look like Innodb.

Roel Van de Paar added a comment - 2024-06-10 06:34

Alternative testcase

--source include/have_binlog_format_statement.inc

--source include/have_innodb.inc

--source include/master-slave.inc

--connection slave

STOP SLAVE;

SET GLOBAL slave_parallel_threads=10;

START SLAVE;

--connection master

CREATE TABLE t ENGINE=InnoDB AS SELECT 1 c1;

INSERT INTO t VALUES (1);

DELETE FROM t WHERE c1=NULL;

INSERT INTO t VALUES(1) ON DUPLICATE KEY UPDATE c1=c1;

INSERT INTO t VALUES(1);

Roel Van de Paar added a comment - 2024-06-10 06:34 Alternative testcase --source include/have_binlog_format_statement.inc --source include/have_innodb.inc --source include/master-slave.inc --connection slave STOP SLAVE; SET GLOBAL slave_parallel_threads=10; START SLAVE; --connection master CREATE TABLE t ENGINE=InnoDB AS SELECT 1 c1; INSERT INTO t VALUES (1); DELETE FROM t WHERE c1= NULL ; INSERT INTO t VALUES (1) ON DUPLICATE KEY UPDATE c1=c1; INSERT INTO t VALUES (1);

Roel Van de Paar added a comment - 2024-06-10 06:38 - edited

Elkin Hi! ramesh only verified the bug in MTR. I can assist.

The mysql.gtid_slave_pos table is InnoDB and unchanged from default in the CLI.

The default for the same table in MTR is Aria apparently, where the issue equally reproduces.

Roel Van de Paar added a comment - 2024-06-10 06:38 - edited Elkin Hi! ramesh only verified the bug in MTR. I can assist. The mysql.gtid_slave_pos table is InnoDB and unchanged from default in the CLI. The default for the same table in MTR is Aria apparently, where the issue equally reproduces.

Kristian Nielsen added a comment - 2024-06-10 06:56

The behaviour described is a duplicate of MDEV-10242.
Please, make sure that mysql.gtid_slave_pos is using a transactional engine whenever testing optimistic or aggressive parallel replication.

If a bug was observed with mysql.gtid_slave_pos being InnoDB, please describe the expected and actual behaviour independently from this one, which is caused by the inability of Aria to roll back the update to mysql.gtid_slave_pos.

Kristian Nielsen added a comment - 2024-06-10 06:56 The behaviour described is a duplicate of MDEV-10242 . Please, make sure that mysql.gtid_slave_pos is using a transactional engine whenever testing optimistic or aggressive parallel replication. If a bug was observed with mysql.gtid_slave_pos being InnoDB, please describe the expected and actual behaviour independently from this one, which is caused by the inability of Aria to roll back the update to mysql.gtid_slave_pos.

Andrei Elkin added a comment - 2024-06-10 07:25

Howdy Roel, I suggest to add at least SHOW CREATE TABLE for gsp table, for
being explicit in this part.

On Mon, 10 Jun 2024, 09:48 Roel Van de Paar (Jira), <jira@mariadb.org>

Andrei Elkin added a comment - 2024-06-10 07:25 Howdy Roel, I suggest to add at least SHOW CREATE TABLE for gsp table, for being explicit in this part. On Mon, 10 Jun 2024, 09:48 Roel Van de Paar (Jira), <jira@mariadb.org>

Roel Van de Paar added a comment - 2024-06-11 03:21 - edited

Hmm yes, it seems various issues currently existing in replication are intertwined:
1. 'Optimistic' being the default parallel replication mode, which fails when mysql.gtid_slave_pos is Aria/MyISAM.
2. For MTR, where the gtid_slave_pos table is Aria, it can be MDEV-10242.

Still, there seems to be more than meets the eye here. For the CLI, the issue reproduces when LIMIT (unsafe SBR stmt) and slave_skip_errors is used:

./mtr --mysqld=--slave_skip_errors=ALL test; grep '1964' var/log/mysqld.2.err

--source include/have_binlog_format_statement.inc

--source include/have_innodb.inc

--source include/master-slave.inc

--connection slave

STOP SLAVE;

ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;   ##### < Change to InnoDB

SET GLOBAL slave_parallel_threads=10;

START SLAVE;

--connection master

SET sql_mode='';

CREATE TABLE t (c INT) ENGINE=InnoDB;

INSERT INTO t VALUES (REPEAT (0,200000));

INSERT INTO t SELECT c FROM t LIMIT 1,1;  ##### < LIMIT

INSERT INTO t VALUES (1);

(Run until you see '1964' output from the grep, which is generally most of the time)
1. Here we have an InnoDB gtid_slave_pos, but the same error occurs
2. Why does a single unsafe SBR statement produce the same outcome?
3. When the slave_skip_errors=ALL is removed, the issue no longer reproduces - why?
4. When the slave_skip_errors=ALL is removed, there is no error at all - how can this be?
etc.

Roel Van de Paar added a comment - 2024-06-11 03:21 - edited Hmm yes, it seems various issues currently existing in replication are intertwined: 1. 'Optimistic' being the default parallel replication mode, which fails when mysql.gtid_slave_pos is Aria/MyISAM. 2. For MTR, where the gtid_slave_pos table is Aria, it can be MDEV-10242 . Still, there seems to be more than meets the eye here. For the CLI, the issue reproduces when LIMIT (unsafe SBR stmt) and slave_skip_errors is used: . /mtr --mysqld=--slave_skip_errors=ALL test ; grep '1964' var /log/mysqld .2.err --source include/have_binlog_format_statement.inc --source include/have_innodb.inc --source include/master-slave.inc --connection slave STOP SLAVE; ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; ##### < Change to InnoDB SET GLOBAL slave_parallel_threads=10; START SLAVE; --connection master SET sql_mode= '' ; CREATE TABLE t (c INT ) ENGINE=InnoDB; INSERT INTO t VALUES (REPEAT (0,200000)); INSERT INTO t SELECT c FROM t LIMIT 1,1; ##### < LIMIT INSERT INTO t VALUES (1); (Run until you see '1964' output from the grep, which is generally most of the time) 1. Here we have an InnoDB gtid_slave_pos, but the same error occurs 2. Why does a single unsafe SBR statement produce the same outcome? 3. When the slave_skip_errors=ALL is removed, the issue no longer reproduces - why? 4. When the slave_skip_errors=ALL is removed, there is no error at all - how can this be? etc.

Kristian Nielsen added a comment - 2024-06-11 07:09

Roel, I would recommend you spend most of your test efforts without highly unsafe configurations such as --slave-skip-errors and unsafe statements in --binlog-format=statement. This will make your great efforts in testing a lot more useful. The --slave-skip-errors=ALL in particular is a crazy configuration. Note, "most", but not "all" - it's highly valuable to get these corner cases tested to know about issues in unusual and even nonsensical configurations. But the main challenge in replication for the moment is the many regressions like MDEV-32020 etc. in normal and useful configurations, and fixing these should have priority.

Note that error 1964 isn't a meaningful identification of "the same error occurs". Error 1964 is always a follow-up error from the real issue. It is thrown by a running transaction T2 in parallel replication if the prior transaction T1 got an error and cannot commit. I'm wondering if we shouldn't simply remove the reporting of 1964 in the error log, it seems to cause more confusion than it helps.

The issue with --slave-skip-errors=ALL was already discussed in MDEV-33954. Optimistic parallel replication can cause a lot of intermediate errors internally, they are handled transparently by retrying affected transactions. Using --slave-skip-errors with this is currently buggy though (MDEV-33954) and interferes with the error handling, causing these issues to pop up.

Hope this helps,

- Kristian.

Kristian Nielsen added a comment - 2024-06-11 07:09 Roel, I would recommend you spend most of your test efforts without highly unsafe configurations such as --slave-skip-errors and unsafe statements in --binlog-format=statement. This will make your great efforts in testing a lot more useful. The --slave-skip-errors=ALL in particular is a crazy configuration. Note, "most", but not "all" - it's highly valuable to get these corner cases tested to know about issues in unusual and even nonsensical configurations. But the main challenge in replication for the moment is the many regressions like MDEV-32020 etc. in normal and useful configurations, and fixing these should have priority. Note that error 1964 isn't a meaningful identification of "the same error occurs". Error 1964 is always a follow-up error from the real issue. It is thrown by a running transaction T2 in parallel replication if the prior transaction T1 got an error and cannot commit. I'm wondering if we shouldn't simply remove the reporting of 1964 in the error log, it seems to cause more confusion than it helps. The issue with --slave-skip-errors=ALL was already discussed in MDEV-33954 . Optimistic parallel replication can cause a lot of intermediate errors internally, they are handled transparently by retrying affected transactions. Using --slave-skip-errors with this is currently buggy though ( MDEV-33954 ) and interferes with the error handling, causing these issues to pop up. Hope this helps, - Kristian.

People

Assignee:: Andrei Elkin

Reporter:: Roel Van de Paar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2024-06-10 03:56

Updated:: 2024-11-22 17:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server