[MDEV-26652] xa transactions binlogged in wrong order Created: 2021-09-20  Updated: 2022-12-19

Status: Open
Project: MariaDB Server
Component/s: Replication, XA
Affects Version/s: 10.5
Fix Version/s: 10.5, 10.6

Type: Bug Priority: Major
Reporter: Sergei Golubchik Assignee: Andrei Elkin
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
PartOf
is part of MDEV-21117 refine the server binlog-based recove... Closed
Problem/Incident
is caused by MDEV-742 LP:803649 - Xa recovery failed on cli... Closed
Relates
relates to MDEV-21469 Implement crash-safe logging of the u... Stalled
relates to MDEV-26682 slave lock timeout with xa and gap locks Closed

 Description   

XA COMMIT is done by running hton->commit() for all participants. Typically in InnoDB-only transactions there are only two participants, binlog and InnoDB, and binlog is the first. But if we trick binlog to be the second (by starting the transaction from non-InnoDB statement), then InnoDB will commit first. That will release all locks, possibly allowing some other transaction to continue, commit, and reach binlog before this transaction's XA COMMIT gets binlogged. Test case:

source include/have_binlog_format_row.inc;
source include/have_innodb.inc;
let $datadir= `select @@datadir`;
create table t1 (a int primary key, b int) engine=innodb;
insert t1 values (1,1),(3,3),(5,5),(7,7);
create table t2 (m int);
xa start '1';
insert t2 values (1);
update t1 set b=50 where b=5;
xa end '1';
xa prepare '1';
connect con1, localhost, root;
send update t1 set b=10 where a=5;
connection default;
xa commit '1';
connection con1;
reap;
flush binary logs;
exec $MYSQL_BINLOG --verbose $datadir/master-bin.000001;
drop table t1,t2;

binlog clearly shows that first transaction updates the row from (5,5) to (5,50), then the second transaction updates (5,50) to (5,10), then the first transaction commits.

In the replication the slave would be stuck on the second update, ultimately timing out.

This bug would be easier to repeat if one adds a delay between commits in different participants:

--- a/sql/handler.cc
+++ b/sql/handler.cc
@@ -1964,6 +1964,7 @@ commit_one_phase_2(THD *thd, bool all
         ++count;
       ha_info_next= ha_info->next();
       ha_info->reset(); /* keep it conveniently zero-filled */
+      my_sleep(1000);
     }
     trans->ha_list= 0;
     trans->no_2pc=0;



 Comments   
Comment by Sergei Golubchik [ 2021-09-21 ]

this should be fixed by the MDEV-21117 semisync recovery patch.

Comment by Andrei Elkin [ 2021-09-21 ]

[:right:] (I forgot that formerly 21469 part has made into the sources already)

Generated at Thu Feb 08 09:46:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.