[MDEV-32020] XA transaction replicates incorrectly, must be applied at XA COMMIT, not XA PREPARE Created: 2023-08-26 Updated: 2024-01-21 |
|
| Status: | In Progress |
| Project: | MariaDB Server |
| Component/s: | Replication, XA |
| Affects Version/s: | 10.5.2 |
| Fix Version/s: | 10.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Kristian Nielsen | Assignee: | Kristian Nielsen |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
XA changes done in 10.5 introduces a regression that breaks replication. The problem is that the slave now applies XA transactions while replicating Applying the XA PREPARE on the slave leaves dangling InnoDB row locks that Another problem is that splitting a transaction in this way in the binlog This test case takes a mysqldump while an XA PREPARED transaction is active The fix is to revert the change so that XA transactions are applied on the
|
| Comments |
| Comment by Andrei Elkin [ 2023-08-28 ] | ||||||||||||||||||||||||||||||
|
knielsen, indeed employing a non-default non-unique index is problematic. Thanks for revealing it! I will not argue how critical this issue is. I doubt it's critical, knowing as you are the use of non-unique index in replication is generally fragile, not always leading to hangs but prone to data inconsistency.
To a separate mysqldump issue, it obviously needs proper integration with XA binlogging. | ||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-08-28 ] | ||||||||||||||||||||||||||||||
|
The issue is in the locking order. After the initial INSERT we have the following indexes:
I set RC isolation level for both master and slave to simplify debugging. All set locks are non-gap X-locks. The locking order is the following: Locking order on master for XA "t1" (UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=1):
Locking order on master for XA "t2" (UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=2):
As we can see the transactions don't conflict each other, as index i2 is forced. Let's consider slave. Locking order on slave for XA "t1", executes row events for "UPDATE t1 SET c=c+1 WHERE a=1 AND b=1", index i1 is used:
Locking order on slave for XA "t2", executes row events for "UPDATE t1 SET c=c+1 WHERE a=1 AND b=2", index i1 is used:
As index i1 is used on slave ("force index" is not passed for row-based event), we have lock conflict, which does not happen on master. Note, that for non-XA's we would not have such issue, because after the transaction is committed, its locks are released, and conflicting transaction can be retried. For XA's 'XA PREPARE "t1"' holds locks while XA "t2" is executing, and, as I understood it correctly, Elkin, correct me if I am wrong, XA "t1" can't be committed before XA "t2" is prepared, as for XA's not only commit, but also preparing order is preserved. | ||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-08-28 ] | ||||||||||||||||||||||||||||||
|
> XA "t1" can't be committed before XA "t2" is prepared | ||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-21 ] | ||||||||||||||||||||||||||||||
|
How would this proposal affect | ||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-21 ] | ||||||||||||||||||||||||||||||
|
The only potential problem that I (as someone who does not know replication well) would foresee with this proposal is that the primary server would have to retain all binlog between XA START and XA PREPARE until an XA ROLLBACK has been executed or an XA COMMIT replicated. Hopefully all binlog writing is handling out-of-space issues gracefully; I vaguely remember that otto pointed out a problematic case in the past, but I can’t remember any details. | ||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-21 ] | ||||||||||||||||||||||||||||||
|
knielsen, a drawback of not replicating anything before XA COMMIT would be that when a replica is promoted to a master, it would be missing all transactions that had successfully reached the XA PREPARE state. But, what if we change the logic at the replica, and not the primary server?
| ||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-21 ] | ||||||||||||||||||||||||||||||
|
marko, I'm not sure about MDEV-31999 ("Can not XA COMMIT a recovered prepared XA transaction when autocommit off"), I think this is unrelated to replication or this proposal. Agree that primary server would need to retain binlog of XA PREPARE until XA COMMIT or XA ROLLBACK. (Not XA START I believe, which is not binlogged). The existing binlog checkpoint mechanism could be used to ensure this, and also ensure that the replication state of XA PREPARE will be recoved from the binlog after a crash. I don't see any problems retaining the binlog between XA PREPARE and XA COMMIT. This period will (hopefully) be short compared to the normal period that binlogs are normally preserved, which is usually long (eg. days) since it needs to be retained for the maximum time any slave can be off-line and still able to re-connect and catch up. Agree with the idea to require slaves/replicas to write and retain binlogs for any replicated XA PREPARE that needs to be recoverable on a slave promoted to master. I think this is a reasonable requirement to support XA PREPARE on one server and XA COMMIT on another across async replication. And exactly as you wrote, applying the transaction would start upon XA COMMIT, which is how it used to work, and which solves a lot of fundamental issues and regressions with the current approach. The replication of XA PREPARE would just involve updating some in-memory state that can be used to recover the transaction later, if the slave is promoted as master and an XA COMMIT of that XID is attempted by the user. This in-memory state would be recovered from the slave's binlog in case of slave crash and restart, again re-using the binlog checkpoint mechanism. | ||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-09-21 ] | ||||||||||||||||||||||||||||||
|
knielsen, just to let you know | ||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-22 ] | ||||||||||||||||||||||||||||||
|
There is a concern that an application could create arbitrarily many transactions in XA PREPARE state, and the architecture would have to deal with that. Long-running transactions have a well known impact on InnoDB: they would block the purge of committed transaction history, and any undo logs would keep growing. This is limited by the file system, not any RAM data structure. I would claim that a well-behaving application would not leave behind any XA PREPARE transactions. To my understanding, replication events are primarily buffered in RAM, sometimes backed by temporary files. Without using temporary files, the maximum size of a transaction would be limited by the size of available RAM. Compared to non-distributed transactions, which are written to the binlog at COMMIT time, distributed transactions add an extra complexity that the number of transactions that need buffering is not limited by the number of active client connections; multiple transactions may be ”stashed” (XA PREPARE), and this must be persistent: the transactions must be retained across any server restarts until an explicit XA ROLLBACK or XA COMMIT is received. Sorry, I should have been more careful when looking up possibly related tickets. For this discussion, MDEV-31998 and MDEV-21469 look relevant to me. | ||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-22 ] | ||||||||||||||||||||||||||||||
|
marko, the binlog events for active transactions are stored in IO_CACHE, which is always disk-based AFAIK, RAM usage is limited to what's needed for the buffered I/O. By using the binlog file as the recovery source of (replication events for) XA PREPARED transactions, there would be no need to hold event data in ram for active transactions, so that should be fine. |