[MDEV-26682] slave lock timeout with xa and gap locks Created: 2021-09-25  Updated: 2023-11-10  Resolved: 2021-10-18

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB, XA
Affects Version/s: 10.5, 10.6
Fix Version/s: 10.5.13, 10.6.5, 10.7.1

Type: Bug Priority: Major
Reporter: Sergei Golubchik Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
Duplicate
duplicates MDEV-26670 Unable to maintain replication since ... Closed
Problem/Incident
causes MDEV-32272 lock_release_on_prepare_try() does no... Closed
is caused by MDEV-742 LP:803649 - Xa recovery failed on cli... Closed
Relates
relates to MDEV-26652 xa transactions binlogged in wrong order Open
relates to MDEV-28709 unexpected X lock on Supremum in READ... Closed
relates to MDEV-16142 Merge new release of InnoDB MySQL 5.7... Closed

 Description   

there are, perhaps, may ways that gap locks can be places differently on the master and on the slave. combined with 10.5+ XA binlogging (MDEV-742) this can cause locks to timeout on slaves. breaking replication. For example:

# this test case can be run with read-committed (or higher) isolation leven
source include/have_innodb.inc;
source include/have_binlog_format_row.inc;
source include/master-slave.inc;
create table t1 (a int primary key, b int unique) engine=innodb;
insert t1 values (1,1),(3,3),(5,5);
sync_slave_with_master;
 
# set a strong isolation level to keep the read view below.
# alternatively a long-running select can do that too even in read-committed
set session tx_isolation='repeatable-read';
start transaction;
# opens a read view to disable purge on the slave
select * from t1;
 
connect m2, localhost, root;
# now, delete a value, purge it on the master, but not on the slave
delete from t1 where a=3;
xa start 'x1';
# this sets a gap lock on <3>, when it exists (so, on the slave)
update t1 set b=3 where a=5;
xa end 'x1';
xa prepare 'x1';
 
connect m3, localhost, root;
# and this tries to insert straight into the locked gap
insert t1 values (2, 2);
 
echo -->slave;
sync_slave_with_master;
commit;
select * from t1;
 
connection m2;
xa rollback 'x1';
drop table t1;
source include/rpl_end.inc;

A possible way to fix all lock timeouts on the slave caused by gap locks and XA is to release gap locks on XA prepare.



 Comments   
Comment by Sergei Golubchik [ 2021-09-28 ]

Another example of gap locks and XA. This time no purge, no selects on the slave side. This case exploits lock asymmetry — gap lock prevents insert intention lock, but an insert intention lock doesn't prevent a gap lock. We need to execute statements on the master in one order, but have them binlogged in the opposite order:

source include/have_innodb.inc;
source include/have_binlog_format_row.inc;
source include/master-slave.inc;
 
create table t1 (id int not null auto_increment primary key, c1 int not null, unique key(c1)) engine=innodb;
create table t2 (id int not null auto_increment primary key, c1 int not null, foreign key(c1) references t1(c1), unique key(c1)) engine=innodb;
insert t1 values (869,1), (871,3), (873,4), (872,5), (870,6), (877,7);
insert t2 values (795,6), (800,7);
 
xa start '1';
update `t2` set `id` = 9, `c1` = 5 where `c1` in ( null, null, null, null, null, 7, 3 );
 
connect con1, localhost,root;
xa start '2';
delete from `t1` where `c1` like concat( 3, '%' );
xa end '2';
xa prepare '2';
 
connection master;
xa end '1';
xa prepare '1';
 
echo ->slave;
sync_slave_with_master;

Comment by Marko Mäkelä [ 2021-10-18 ]

A possible fix would be to release all non-exclusive locks on XA PREPARE.

Comment by Sergei Golubchik [ 2021-10-18 ]

c1a9b1c2f21e6796b687 is almost ok.

  1. please add a comment near thd_sql_command() explaining why it was needed
  2. add tests, e.g. as above.

after that — ok to push, thanks!

Generated at Thu Feb 08 09:47:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.