[MDEV-32020] XA transaction replicates incorrectly, must be applied at XA COMMIT, not XA PREPARE - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.5.2
Fix Version/s: 10.5
Component/s: Replication, XA
Labels:
- foundation

Description

XA changes done in 10.5 introduces a regression that breaks replication.

The problem is that the slave now applies XA transactions while replicating
an XA_prepare_log_event binlogged by "XA PREPARE" on the master. This is
wrong, the transaction must not be applied on the slave until "XA COMMIT",
as it is done correctly in 10.4.

Applying the XA PREPARE on the slave leaves dangling InnoDB row locks that
can conflict with the replication of later transactions and cause
replication to break. The below test case (also attached) demonstrates one
simple instance of this.

Another problem is that splitting a transaction in this way in the binlog
means there is no longer a unique binlog position corresponding to the
database state. This is demonstrated by the attached testcase
rpl_xa_provision.test .

This test case takes a mysqldump while an XA PREPARED transaction is active
on the master, and uses it to provision a new slave. The new slave's GTID
position cannot be set correctly. Setting it after the XA PREPARED
transaction means the XA COMMIT will fail. But setting it before the XA
PREPARE would also not be correct, as it would duplicate transactions
binlogged after the XA PREPARE. Thus, in 10.5, the provisioned slave breaks
its replication.

The fix is to revert the change so that XA transactions are applied on the
slave only as part of the XA COMMIT event. When the XA PREPARE event is
received by the slave, it must not be applied. Instead it can be saved
somewhere (there are several possible designs). In case of a master crash
and the slave is promoted as the new master, those saved XA PREPAREd events
can then be used to recover the XA transaction into the prepared state for
the application to XA COMMIT or XA ROLLBACK.

High-level design description for this fix:

1. At XA PREPARE, the master will binlog a special Xa_prepared_trx_log_event
event containing the contents of the binlog trx cache. This is binlogged
without a GTID. The existing binlog_checkpoint mechanism is used to preserve
the binlog file while this XA transaction is in the prepared state.

2. The server keeps a record of pending XA PREPARE and their position in the
binlog, for later reference. This is reconstructed from scanning the binlog
at server startup/crash recovery. Optionally, this can be optimized to save
the information (like the GTID state) at clean shutdown to avoid the scan in
non-crash-recovery case.

3. At XA COMMIT, the corresponding binlog trx cache data is read from the
Xa_prepared_trx_log_event entry in the binlog, and a full, normal commit
transaction is binlogged containing the event data.

4. Optionally, we can optimize XA COMMIT to only binlog a placeholder with
reference back to the Xa_prepared_trx_log_event entry, and use the binlog
checkpoint mechanism to preserve the binlog file containing the entry for as
long as needed.

5. By default, the slave will ignore the Xa_prepared_trx_log_event (the dump
thread can simply skip sending these events to the slave, or the slave can
just ignore them), and the XA COMMIT will be replicated as a normal commit.

6. The user can enable some --replicate-xa option, which will make the slave
process the Xa_prepared_trx_log_event. This will require --log-slave-updates
enabled on the slave. When the Xa_prepared_trx_log_event is processed on the
slave, it is simply binlogged, no events are applied. When the commit events
are received, they are applied as a normal transaction, and the pending
Xa_prepared_trx_log_event in the slave's binlog is released.

7. If the optimization in (4) is implemented, the placeholder commit event
on the slave will read the even data to be applied from the relay log or
from the slave's binlog if the data has been processed and the corresponding
relay log purged. The dump thread on the master will check each placeholder
commit event if the slave has already been sent the corresponding
Xa_prepared_trx_log_event, and if not will read out the full commit event
and send to the slave for normal commit processing.

8. When the slave is promoted to a new master, any pending XA PREPAREd
events that have been processed on the slave can be explicitly instantiated
by the user into the XA prepared, committed, or rolled back state. The
corresponding XIDs are listed in XA RECOVER on the slave, and can be
specified in SQL statements XA PREARE <xid>, XA COMMIT <xid>, XA ROLLBACK
<xid>. These statements will then read out the pending events and apply them
as normal. (The exact syntax for this is up for discussion, maybe a separate
option keyword or different statement should be used for slave promotion).

9. If the locking happens to be different on the slave than on the master
when promoting pending XA prepared, these might get locking conflicts.
However, since these can be run by the user in any order and in parallel,
promotion can still proceed and once the external transaction coordinator
issues commit or rollback decisions for one transaction, the blocked
promotion can then continue. This way, statement-based replication and
non-primary-key row-based replication can still for for XA promotion on the
slave.

A Proof-of-concept implementation of this design has been pushed to branch
knielsen_mdev32020. This patch is not yet ready for testing, but it
demonstrates the feasibility of all parts of the design. Once the high-level
design has been finalized wrt. precise syntax to use etc., and a decision
has been made on which version to implement this in, the patch can be easily
completed according to embedded "ToDo" comments.

--source include/have_innodb.inc

--source include/have_binlog_format_row.inc

--source include/master-slave.inc

--connection master

CREATE TABLE t1 (a int, b int, c int,

  INDEX i1(a),

  INDEX i2(b))

  ENGINE=InnoDB;

INSERT INTO t1 VALUES

  (1,1,0), (1,2,0),

  (2,1,0), (2,2,0);

--sync_slave_with_master

--source include/stop_slave.inc

SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;

SET @old_retries= @@GLOBAL.slave_transaction_retries;

SET GLOBAL innodb_lock_wait_timeout= 2;

SET GLOBAL slave_transaction_retries= 3;

--source include/start_slave.inc

--connection master

XA START "t1";

UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=1;

XA END "t1";

XA PREPARE "t1";

--connection master1

XA START "t2";

UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=2;

XA END "t2";

XA PREPARE "t2";

--connection master

XA COMMIT "t1";

--connection master1

XA COMMIT "t2";

--connection master

SELECT * FROM t1 ORDER BY a,b,c;

--sync_slave_with_master

SELECT * FROM t1 ORDER BY a,b,c;

# Cleanup

--connection master

DROP TABLE t1;

--connection slave

SET GLOBAL innodb_lock_wait_timeout= @old_timeout;

SET GLOBAL slave_transaction_retries= @old_retries;

--source include/rpl_end.inc

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rpl_mdev32020.test
1 kB
2023-08-26 12:17
rpl_xa_provision.test
3 kB
2023-08-26 13:40

Issue Links

relates to

MDEV-742 LP:803649 - Xa recovery failed on client disconnection

Closed

MDEV-32830 refactor XA binlogging for better integration with BGC/replication/recovery

In Progress

MDEV-34466 XA prepare don't release unmodified records in non-blocking mode

Closed

MDEV-33921 Replication fails when XA transactions are used where the slave has replicate_do_db set and the client has touched a different database when running DML such as inserts.

Closed

MDEV-34481 optimize away waiting for owned by prepared xa non-unique index record

In Testing

MDEV-34777 Wrong page mode during getting page from buffer pool for lock_rec_unlock_unmodified()

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

View 39 older comments

Kristian Nielsen added a comment - 2024-10-03 11:17 - edited

> Brandon Nesterenko and Kristian Nielsen, this bug is critical and is blocking ~~MDEV-32014~~,
> which is part of the 11.7 preview. Can you please see what can be done to fix it?

julien.fritsch, that makes no sense. I have been trying to get the XA developers to even acknowledge the problem described in this bug for >2 years without success. And why would MDEV-32020 be blocking ~~MDEV-32014~~, which seems completely unrelated?

Kristian.

Kristian Nielsen added a comment - 2024-10-03 11:17 - edited > Brandon Nesterenko and Kristian Nielsen, this bug is critical and is blocking MDEV-32014 , > which is part of the 11.7 preview. Can you please see what can be done to fix it? julien.fritsch , that makes no sense. I have been trying to get the XA developers to even acknowledge the problem described in this bug for >2 years without success. And why would MDEV-32020 be blocking MDEV-32014 , which seems completely unrelated? Kristian.

Andrei Elkin added a comment - 2024-10-04 07:43 - edited

julien.fritsch, ~~MDEV-32014~~ ~~relates~~ does not relate to the current one. This one relates to MDEV-34481 which covers the slave hang part of ~~this~~ ~~MDEV-34466~~ ticket.

Andrei Elkin added a comment - 2024-10-04 07:43 - edited julien.fritsch , MDEV-32014 relates does not relate to the current one. This one relates to MDEV-34481 which covers the slave hang part of this MDEV-34466 ticket.

Marko Mäkelä added a comment - 2024-10-11 08:42

For the record, I ran the test case in the Description with two options (2×2=4 combinations), but each time it would fail as follows:

10.11 8a6a4c947a0ca3d2fdca752d7440bdc5c6c83e37
analyze: sync_with_master
mysqltest: At line 45: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 1786, 300, '')' returned NULL indicating slave SQL thread failure

I tried

./mtr --mysqld=--innodb-snapshot-isolation=ON

and the following patch to revert a work-around of this bug:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 2b304868b14..02c03ccdcae 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -2037,8 +2037,6 @@ trx_prepare(

 			/* Do not release locks for XA COMMIT ONE PHASE

 			or for internal distributed transactions

 			(XID::get_my_xid() would be nonzero). */

-		} else {

-			lock_release_on_prepare(trx);

I feel that releasing any locks before a transaction is committed can lead to potential transaction isolation violations and introduce read/write conflicts. I am glad that with ~~MDEV-34690~~, we stop releasing the locks on the primary server, but we’d still keep doing this on the replicas. I was really hoping that setting innodb_snapshot_isolation=ON would rescue the situation, but based on my experiment above, it does not seem to be the case. I did not debug this in detail; maybe the ER_CHECKREAD handling could be improved?

As far as I understand, the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica is the following: When the primary server fails and a replica is being promoted to a master, in that case, the promoted replica would need to apply each XA PREPARE transaction, and failure scenarios such as the one in the Description would still be possible.

To my understanding, a failover situation should be very rare. The following regression test files do mention fail-over:

mysql-test/suite/rpl/t/rpl_circular_for_4_hosts.test

mysql-test/suite/rpl/t/rpl_gtid_basic.test

mysql-test/suite/rpl/t/rpl_semi_sync_crash.inc

mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test

None of these test files make use of any XA transactions. Hence, such a failover scenario is not being tested and might not work. For proper testing, I think that we would need to involve MaxScale as well.

I do not quite understand the logic of the argument. I paraphrase it as: "We must not do this, because it would not work in (some rare scenario that may not currently work anyway)." It feels a bit like "perfect is the enemy of good enough." Users of XA transactions in "normal" replication (without any fail-over) would stop having any trouble if we implement the suggestion. From the storage engine point of view, replicas would only execute regular transactions when the replication layer processes a XA COMMIT statement. Storage engines would not see any XA PREPARE; anything that ends in XA ROLLBACK would be filtered out by the replication layer. The lock conflict resolution for non-XA transactions works just fine.

Marko Mäkelä added a comment - 2024-10-11 08:42 For the record, I ran the test case in the Description with two options (2×2=4 combinations), but each time it would fail as follows: 10.11 8a6a4c947a0ca3d2fdca752d7440bdc5c6c83e37 analyze: sync_with_master mysqltest: At line 45: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 1786, 300, '')' returned NULL indicating slave SQL thread failure I tried ./mtr --mysqld=--innodb-snapshot-isolation=ON and the following patch to revert a work-around of this bug: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 2b304868b14..02c03ccdcae 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -2037,8 +2037,6 @@ trx_prepare( /* Do not release locks for XA COMMIT ONE PHASE or for internal distributed transactions (XID::get_my_xid() would be nonzero). */ - } else { - lock_release_on_prepare(trx); } } } I feel that releasing any locks before a transaction is committed can lead to potential transaction isolation violations and introduce read/write conflicts. I am glad that with MDEV-34690 , we stop releasing the locks on the primary server, but we’d still keep doing this on the replicas. I was really hoping that setting innodb_snapshot_isolation=ON would rescue the situation, but based on my experiment above, it does not seem to be the case. I did not debug this in detail; maybe the ER_CHECKREAD handling could be improved? As far as I understand, the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica is the following: When the primary server fails and a replica is being promoted to a master, in that case, the promoted replica would need to apply each XA PREPARE transaction, and failure scenarios such as the one in the Description would still be possible. To my understanding, a failover situation should be very rare. The following regression test files do mention fail-over: mysql-test/suite/rpl/t/rpl_circular_for_4_hosts.test mysql-test/suite/rpl/t/rpl_gtid_basic.test mysql-test/suite/rpl/t/rpl_semi_sync_crash.inc mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test None of these test files make use of any XA transactions. Hence, such a failover scenario is not being tested and might not work. For proper testing, I think that we would need to involve MaxScale as well. I do not quite understand the logic of the argument. I paraphrase it as: "We must not do this, because it would not work in (some rare scenario that may not currently work anyway)." It feels a bit like "perfect is the enemy of good enough." Users of XA transactions in "normal" replication (without any fail-over) would stop having any trouble if we implement the suggestion. From the storage engine point of view, replicas would only execute regular transactions when the replication layer processes a XA COMMIT statement. Storage engines would not see any XA PREPARE ; anything that ends in XA ROLLBACK would be filtered out by the replication layer. The lock conflict resolution for non- XA transactions works just fine.

Marko Mäkelä added a comment - 2024-10-11 09:02

I forgot to mention some key points from yesterday’s discussion:

It was suggested that only binlog_format=row be allowed for XA transactions. (This is not a complete solution, because the test case in the Description is already using that format.)
Also innodb_force_primary_key=ON was brought up in the discussions. I think that it would be a reasonable limitation to refuse to replicate XA transactions if the involved tables lack PRIMARY KEY. The test case in the Description would pass if we add PRIMARY KEY(a,b,c).
I suggested to investigate the impact of enabling innodb_snapshot_isolation=ON on replicas. Would we get any ER_CHECKREAD errors? (They will lead to transaction rollback and retry, just like ER_LOCK_DEADLOCK.)

Marko Mäkelä added a comment - 2024-10-11 09:02 I forgot to mention some key points from yesterday’s discussion: It was suggested that only binlog_format=row be allowed for XA transactions. (This is not a complete solution, because the test case in the Description is already using that format.) Also innodb_force_primary_key=ON was brought up in the discussions. I think that it would be a reasonable limitation to refuse to replicate XA transactions if the involved tables lack PRIMARY KEY . The test case in the Description would pass if we add PRIMARY KEY(a,b,c) . I suggested to investigate the impact of enabling innodb_snapshot_isolation=ON on replicas. Would we get any ER_CHECKREAD errors? (They will lead to transaction rollback and retry, just like ER_LOCK_DEADLOCK .)

Andrei Elkin added a comment - 2024-10-11 11:54 - edited

marko, let me reply to your penultimate comments points.

1. to innodb-snapshot-isolation=ON on the description test, I recommend you to
discuss that in MDEV-34481 which I forked out in order to focus on the technical matter of the hang.
On MDEV-34481 fixes branch I run the mtr with the option to see the test pass, as it also does w/o.
To my understanding --innodb-snapshot-isolation is irrelevant to the fixes that essentially implement READ PAST locking mode. That patch must make it perfectly clear the the whole issue is that
Innodb X-locks records it was never going to update. For the non-unique indexes only table ROW-format replication therefore the ~~MDEV-33454~~/~~MDEV-34466~~ policy is insurmountable.

2. Let me confirm
> the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica
it from my side.
As to the rare Failover, unfortunately there's no way to ban it altogether, esp the crash-recovery one .
The raratiness may not be a concern for the user. But always is just one transaction loss.
Let it be xa-prepared one.
The logic of the argument is simple: follow the XA specification at least unless the user willingly succumbs to disgrace, like MDEV-35019.

3. XA recovery in replication is tested e.g in rpl suite tets invoking rpl_xa.inc.

4. Sure conversion XA to normal transactions is a good thing, as we would not have any special branches to implement and maintain, just we should be careful not to throw awat p.2 constraint with bath water.

Andrei Elkin added a comment - 2024-10-11 11:54 - edited marko , let me reply to your penultimate comments points. 1. to innodb-snapshot-isolation=ON on the description test, I recommend you to discuss that in MDEV-34481 which I forked out in order to focus on the technical matter of the hang. On MDEV-34481 fixes branch I run the mtr with the option to see the test pass, as it also does w/o. To my understanding --innodb-snapshot-isolation is irrelevant to the fixes that essentially implement READ PAST locking mode. That patch must make it perfectly clear the the whole issue is that Innodb X-locks records it was never going to update. For the non-unique indexes only table ROW-format replication therefore the MDEV-33454 / MDEV-34466 policy is insurmountable. 2. Let me confirm > the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica it from my side. As to the rare Failover, unfortunately there's no way to ban it altogether, esp the crash-recovery one . The raratiness may not be a concern for the user. But always is just one transaction loss. Let it be xa-prepared one. The logic of the argument is simple: follow the XA specification at least unless the user willingly succumbs to disgrace, like MDEV-35019 . 3. XA recovery in replication is tested e.g in rpl suite tets invoking rpl_xa.inc . 4. Sure conversion XA to normal transactions is a good thing, as we would not have any special branches to implement and maintain, just we should be careful not to throw awat p.2 constraint with bath water.

MariaDB Server

XA transaction replicates incorrectly, must be applied at XA COMMIT, not XA PREPARE

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration