[MDEV-29898] rpl.rpl_parallel_optimistic_xa fail in BB with Slave SQL: XAER_DUPID: The XID already exists Created: 2022-10-27  Updated: 2023-11-28

Status: Stalled
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.5, 10.6, 10.11

Type: Bug Priority: Major
Reporter: Angelique Sklavounos (Inactive) Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File mysqld.1.err     File mysqld.2.err    

 Description   

https://buildbot.mariadb.org/#/builders/538/builds/136

10.11 baf276e6d

rpl.rpl_parallel_optimistic_xa 'innodb,row' w40 [ fail ]
        Test ended at 2022-10-25 06:42:52
 
CURRENT_TEST: rpl.rpl_parallel_optimistic_xa
mysqltest: In included file "./include/sync_with_master_gtid.inc": 
included from /home/buildbot/aarch64-ubuntu-2004-debug/build/mysql-test/suite/rpl/t/rpl_parallel_optimistic_xa.test at line 194:
At line 48: Failed to sync with master
 
The result from queries just before the failure was:
< snip >
SET @old_parallel_mode               = @@GLOBAL.slave_parallel_mode;
SET @@global.slave_parallel_mode     ='optimistic';
SET @old_gtid_cleanup_batch_size     = @@GLOBAL.gtid_cleanup_batch_size;
SET @@global.gtid_cleanup_batch_size = 1000000;
CHANGE MASTER TO master_use_gtid=slave_pos;
connection master;
CREATE TABLE t0 (a int,             b INT) ENGINE=InnoDB;
CREATE TABLE t1 (a int PRIMARY KEY, b INT) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1, 0);
include/save_master_gtid.inc
connection slave;
include/start_slave.inc
include/sync_with_master_gtid.inc
include/stop_slave.inc
connection master;
include/save_master_gtid.inc
connection slave;
include/start_slave.inc
include/sync_with_master_gtid.inc
Timeout in master_gtid_wait('0-1-491', 120), current slave GTID position is: 0-1-317.

Error logs attached. The replica error log shows:

2022-10-25  6:40:51 21 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 44456, relay log './slave-relay-bin.000001' position: 4; GTID position '0-1-161'
2022-10-25  6:40:51 27 [ERROR] Slave SQL: XAER_DUPID: The XID already exists, Gtid 0-1-318, Internal MariaDB error code: 1440
2022-10-25  6:40:51 27 [Warning] Slave: XAER_DUPID: The XID already exists Error_code: 1440
2022-10-25  6:40:51 27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 107867; GTID position '0-1-317'
2022-10-25  6:40:51 21 [Note] Error reading relay log event: slave SQL thread was killed
2022-10-25  6:41:41 24 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 24 [Warning] Slave: Duplicate entry '29' for key 'PRIMARY' Error_code: 1062
2022-10-25  6:41:41 24 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2022-10-25  6:41:41 24 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state Error_code: 1399
2022-10-25  6:41:41 24 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 24 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 107867; GTID position '0-1-317'
2022-10-25  6:41:41 23 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 23 [Warning] Slave: Connection was killed Error_code: 1927
2022-10-25  6:41:41 23 [Warning] Slave: Connection was killed Error_code: 1927
2022-10-25  6:41:41 23 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2022-10-25  6:41:41 23 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state Error_code: 1399
2022-10-25  6:41:41 23 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 23 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 107867; GTID position '0-1-317'
2022-10-25  6:41:41 25 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 25 [Warning] Slave: Connection was killed Error_code: 1927
2022-10-25  6:41:41 25 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2022-10-25  6:41:41 25 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state Error_code: 1399
2022-10-25  6:41:41 25 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 25 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 107867; GTID position '0-1-317'
2022-10-25  6:41:41 26 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 26 [Warning] Slave: Connection was killed Error_code: 1927
2022-10-25  6:41:41 26 [Warning] Slave: Connection was killed Error_code: 1927
2022-10-25  6:41:41 26 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2022-10-25  6:41:41 26 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state Error_code: 1399
2022-10-25  6:41:41 26 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 26 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 107867; GTID position '0-1-317'
2022-10-25  6:41:41 28 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 28 [Warning] Slave: Connection was killed Error_code: 1927
2022-10-25  6:41:41 28 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2022-10-25  6:41:41 28 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state Error_code: 1399
2022-10-25  6:41:41 28 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-10-25  6:41:41 28 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 107867; GTID position '0-1-317'
2022-10-25  6:41:41 21 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 107867; GTID position '0-1-317', master: 127.0.0.1:17080



 Comments   
Comment by Andrei Elkin [ 2022-10-28 ]

The commit is in bb-10.5-andrei branch. It actually implements a bit different approach from what I explained at the zoom session. I took a route to refine the GTID ctor instead 'cos playing with the cache types on a higher level of binlog_commit/rollback which led to side effects that found out by testing. The chosen approach clearly is (more) robust.

I don't test the slave side being content with the fact that only marked with `trans` GTID:s are allowed for the optimistic mode. Specifically

        if (!(gtid_flags & Gtid_log_event::FL_TRANSACTIONAL) ||
            ( (!(gtid_flags & Gtid_log_event::FL_ALLOW_PARALLEL) ||
               (gtid_flags & Gtid_log_event::FL_WAITED)) &&
              (mode < SLAVE_PARALLEL_AGGRESSIVE)))
        {
          /*
            This transaction should not be speculatively run in parallel with
            what came before, either because it cannot safely be rolled back in
            case of a conflict, or because it was marked as likely to conflict
            and require expensive rollback and retry.
 
            Here we mark it as such, and then the worker thread will do a
            wait_for_prior_commit() before starting it. We do not introduce a
            new group_commit_orderer, since we still want following transactions
            to run in parallel with transactions prior to this one.
          */
          speculation= rpl_group_info::SPECULATE_WAIT;
        }
        else
          speculation= rpl_group_info::SPECULATE_OPTIMISTIC;
 

So binlog.binlog_xa_prepared_disconnect watches over that.

Comment by Andrei Elkin [ 2022-11-15 ]

Taking it back to properly
explain the original error.

Comment by Angelique Sklavounos (Inactive) [ 2022-11-21 ]

Also seen with rpl.rpl_parallel_optimistic_xa_lsu_off https://buildbot.mariadb.org/#/builders/172/builds/9873

Generated at Thu Feb 08 10:12:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.