[MDEV-29410] abort-and-replay prepared XA transactions on the slave - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Minor
Resolution: Unresolved
Fix Version/s: None
Component/s: Replication, XA
Labels:
None

Description

Parallel applier guarantees that transactions are committed in a fixed predefined order (same as on the master). If trx1 must be committed before trx2, but the parallel applier executes them concurrently and trx2 happen to block trx1, then the applier aborts trx2, allows trx1 to finish, and then re-executes trx2.

This does not work if trx1 is an XA transaction. It becomes persistent on XA PREPARE, so it's XA PREPARE that must happen before trx2, not XA COMMIT. But XA PREPARE doesn't release all locks, so XA PREPARE is not a guarantee that a conflicting trx2 will be able to continue.

How can be fixed?

Attachments

Issue Links

is caused by

MDEV-742 LP:803649 - Xa recovery failed on client disconnection

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2022-09-11 15:00

First, I don't understand why XA PREPARE cannot be rolled back? The whole
point of XA PREPARE is to leave the transaction in a state where it can both
be rolled back and retried. Why not do an XA ROLLBACK in case of conflict
and then re-try?

Is it because the global transaction id is persisted also after rollback,
and a new XA START with the same id will fail? Though that doesn't seem
possible, as it would require being able to look up all ids forever in the
server. Even if this is the case, it can be solved by allocating a new
replication-specific transaction id name in the slave applier (ISTR LOAD
FILE is handled similarly to avoid file name conflict). This could also help
identify slave XA transactions left in the PREPARED state and roll them back
at server startup.

As far as I can see there should be no problem with rolling back and
re-trying an XA PREPAREd transaction, in fact the XA system seems to
guarantee that this is possible.

Second, why binlog XA PREPARE at all? XA is a mechanism to ensure consistent
commit between multiple transactional systems, one of them the original
MariaDB master server. Replication slave servers are not involved in this in
any way. Normal transactions are not replicated until and unless they commit
on the master, why should XA transactions work differently? Maybe this is
the real bug here?

Replicating XA PREPARE leaves a prepared transaction active on the slave,
with all of the complexity that incurs - and it leaves dangling locks on the
slave potentially for a long time, if the user XA COMMIT is delayed for some
reason. It would be much preferable to keep the binlog cache on the master
across XA PREPARE, and binlog it only at XA COMMIT time - just like other
transactions. It doesn't even have to be binlogged as an XA transaction,
just a normal transaction is fine.

This does require to persist the binlog cache to preserve an XA PREPAREd
transaction across master restart, that can be done in a system InnoDB
table. This is some work but relatively straight-forward, and surely simpler
than trying to implement sparse relaylogs on the slave.

Third, the XA COMMIT (and XA ROLLBACK) event groups must be marked non-trans
in the binlog (they are currently marked as "trans"). Unlike XA PREPARE,
these cannot be rolled back, and also they cannot be safely applied in
parallel with earlier transactions (in case of their own XA PREPARE event
group). This seems clearly a bug (with trivial fix). When XA COMMIT and XA
ROLLBACK are not marked transactional, the parallel replication will wait
for all prior commits to complete before executing them.

The bug description mentions that "it can be purged from relay log after
that". I don't see why this is the case? The inuse_relaylog mechanism exists
to ensure that relay logs are kept as long as needed by parallel replication
retry. I also don't see how relaylogs would become sparse. It doesn't seem
justifyable to introduce all this complexity with sparse relaylogs and
re-fetching from master in SQL thread just for the sake of a little-used
feature like XA - nor does it seem necessary as per above suggestions?

Hope this helps,

Kristian.

Kristian Nielsen added a comment - 2022-09-11 15:00 First, I don't understand why XA PREPARE cannot be rolled back? The whole point of XA PREPARE is to leave the transaction in a state where it can both be rolled back and retried. Why not do an XA ROLLBACK in case of conflict and then re-try? Is it because the global transaction id is persisted also after rollback, and a new XA START with the same id will fail? Though that doesn't seem possible, as it would require being able to look up all ids forever in the server. Even if this is the case, it can be solved by allocating a new replication-specific transaction id name in the slave applier (ISTR LOAD FILE is handled similarly to avoid file name conflict). This could also help identify slave XA transactions left in the PREPARED state and roll them back at server startup. As far as I can see there should be no problem with rolling back and re-trying an XA PREPAREd transaction, in fact the XA system seems to guarantee that this is possible. Second, why binlog XA PREPARE at all? XA is a mechanism to ensure consistent commit between multiple transactional systems, one of them the original MariaDB master server. Replication slave servers are not involved in this in any way. Normal transactions are not replicated until and unless they commit on the master, why should XA transactions work differently? Maybe this is the real bug here? Replicating XA PREPARE leaves a prepared transaction active on the slave, with all of the complexity that incurs - and it leaves dangling locks on the slave potentially for a long time, if the user XA COMMIT is delayed for some reason. It would be much preferable to keep the binlog cache on the master across XA PREPARE, and binlog it only at XA COMMIT time - just like other transactions. It doesn't even have to be binlogged as an XA transaction, just a normal transaction is fine. This does require to persist the binlog cache to preserve an XA PREPAREd transaction across master restart, that can be done in a system InnoDB table. This is some work but relatively straight-forward, and surely simpler than trying to implement sparse relaylogs on the slave. Third, the XA COMMIT (and XA ROLLBACK) event groups must be marked non-trans in the binlog (they are currently marked as "trans"). Unlike XA PREPARE, these cannot be rolled back, and also they cannot be safely applied in parallel with earlier transactions (in case of their own XA PREPARE event group). This seems clearly a bug (with trivial fix). When XA COMMIT and XA ROLLBACK are not marked transactional, the parallel replication will wait for all prior commits to complete before executing them. The bug description mentions that "it can be purged from relay log after that". I don't see why this is the case? The inuse_relaylog mechanism exists to ensure that relay logs are kept as long as needed by parallel replication retry. I also don't see how relaylogs would become sparse. It doesn't seem justifyable to introduce all this complexity with sparse relaylogs and re-fetching from master in SQL thread just for the sake of a little-used feature like XA - nor does it seem necessary as per above suggestions? Hope this helps, Kristian.

Sergei Golubchik added a comment - 2022-09-12 10:33

Technically, a transaction after XA PREPARE can be rolled back, and should. This MDEV is about doing exactly that.

But currently an "XA transaction" in relay log is a sequence of events from XA START to XA PREPARE. This is what the master writes to binlog, binlog trx_cache in THD is flushed to binlog on XA PREPARE. So, while a transaction in the sql worker thread can be rolled back after XA PREPARE, from the relay log point of view the transaction was done, relay log forgets about it and it cannot be re-applied. This is what this MDEV wants to fix — to preserve XA transactions over XA PREPARE up to XA COMMIT or XA ROLLBACK. Somehow.

"why binlog XA PREPARE at all" — this was ~~MDEV-742~~, a way to make binlog 2PC capable, so that a binlog would be able to prepare a transaction (make it persistent), and later commit it, or roll it back.

Sergei Golubchik added a comment - 2022-09-12 10:33 Technically, a transaction after XA PREPARE can be rolled back, and should. This MDEV is about doing exactly that. But currently an "XA transaction" in relay log is a sequence of events from XA START to XA PREPARE . This is what the master writes to binlog, binlog trx_cache in THD is flushed to binlog on XA PREPARE . So, while a transaction in the sql worker thread can be rolled back after XA PREPARE , from the relay log point of view the transaction was done, relay log forgets about it and it cannot be re-applied. This is what this MDEV wants to fix — to preserve XA transactions over XA PREPARE up to XA COMMIT or XA ROLLBACK . Somehow. "why binlog XA PREPARE at all" — this was MDEV-742 , a way to make binlog 2PC capable, so that a binlog would be able to prepare a transaction (make it persistent), and later commit it, or roll it back.

Kristian Nielsen added a comment - 2022-09-12 16:02

I still think the inuse_relaylog should ensure that the relaylog does not go away too early.

When the slave worker executes XA PREPARE, this should participate in binlog group commit (it writes to slave binlog, right), which includes doing a wait_for_prior_commit().

Until wait_for_prior_commit() completes, the transaction can be safely rolled back, the XA PREPARE is not yet persisted, the relay log is not yet deleted.

After wait_for_prior_commit(), there are no earlier commits to conflict with, the optimistic parallel replication will not need to rollback and retry the XA PREPARE.

I think if this doesn't work, there is a (simple) bug that should be fixed. Or is there something I'm missing?

Is there a test case that shows the problem?

To the second point, the XA PREPARE is written to the binlog for the sake of 2PC persistency, ok. This doesn't explain why it is replicated to the slaves? There seem to be no benefit for having the transaction in XA PREPAREd state on the slave (and a number of disadvantages).

Save the binlog cache in memory after XA PREPARE on the master. Then at XA COMMIT, write it to the binlog for the slave to replicate (with a normal START TRANSACTION/COMMIT). In case of crash, load it into binlog cache again during crash recovery.

The XA PREPARE binlog event group can be there, just don't send it to the slave, or send it but ignore it on the slave.

It seems needlessly complicated to have replicated transactions in XA PREPAREd state on the slave. For example, what happens if the slave is switched to a different master while an XA PREPAREd transaction is in the middle of being replicated?

Hope this helps,

Kristian.

Kristian Nielsen added a comment - 2022-09-12 16:02 I still think the inuse_relaylog should ensure that the relaylog does not go away too early. When the slave worker executes XA PREPARE, this should participate in binlog group commit (it writes to slave binlog, right), which includes doing a wait_for_prior_commit(). Until wait_for_prior_commit() completes, the transaction can be safely rolled back, the XA PREPARE is not yet persisted, the relay log is not yet deleted. After wait_for_prior_commit(), there are no earlier commits to conflict with, the optimistic parallel replication will not need to rollback and retry the XA PREPARE. I think if this doesn't work, there is a (simple) bug that should be fixed. Or is there something I'm missing? Is there a test case that shows the problem? To the second point, the XA PREPARE is written to the binlog for the sake of 2PC persistency, ok. This doesn't explain why it is replicated to the slaves? There seem to be no benefit for having the transaction in XA PREPAREd state on the slave (and a number of disadvantages). Save the binlog cache in memory after XA PREPARE on the master. Then at XA COMMIT, write it to the binlog for the slave to replicate (with a normal START TRANSACTION/COMMIT). In case of crash, load it into binlog cache again during crash recovery. The XA PREPARE binlog event group can be there, just don't send it to the slave, or send it but ignore it on the slave. It seems needlessly complicated to have replicated transactions in XA PREPAREd state on the slave. For example, what happens if the slave is switched to a different master while an XA PREPAREd transaction is in the middle of being replicated? Hope this helps, Kristian.

Kristian Nielsen added a comment - 2022-09-12 16:13

Reading the original description again:

"trx2 is an XA transaction that managed to do XA PREPARE before trx1 is blocked"

This shouldn't be possible. The XA PREPARE is similar to a commit/XID event, it completes the event group. So it must not complete until all prior transactions have committed (ie. it must do wait_for_prior_commit() before completing).

If it does not currently do that, then maybe that is the real bug here?

If XA PREPARE writes to the binlog (as I would think), there is an optimized code path that does the wait_for_prior_commit() implicitly as part of binlog group commit.

A less optimal way is to just run wait_for_prior_commit() at the start of XA PREPARE.

I don't see a reason that the normal wait_for_prior_commit mechanism to ensure correct parallel replication order and rollback/retry from relay log files should not also work for XA PREPARE.

Kristian Nielsen added a comment - 2022-09-12 16:13 Reading the original description again: "trx2 is an XA transaction that managed to do XA PREPARE before trx1 is blocked" This shouldn't be possible. The XA PREPARE is similar to a commit/XID event, it completes the event group. So it must not complete until all prior transactions have committed (ie. it must do wait_for_prior_commit() before completing). If it does not currently do that, then maybe that is the real bug here? If XA PREPARE writes to the binlog (as I would think), there is an optimized code path that does the wait_for_prior_commit() implicitly as part of binlog group commit. A less optimal way is to just run wait_for_prior_commit() at the start of XA PREPARE. I don't see a reason that the normal wait_for_prior_commit mechanism to ensure correct parallel replication order and rollback/retry from relay log files should not also work for XA PREPARE.

Andrei Elkin added a comment - 2022-09-15 09:14

knielsen, let me reply to some of your questions (I am not yet regular at kbd).

> This shouldn't be possible. The XA PREPARE is ...
> ... then maybe that is the real bug here?

Indeed: ~~MDEV-28709~~, ~~MDEV-26682~~. The latter one aimed to circumvent assymmetric locking behaviour by Innodb. Namely a GAP and InsertIntention locks are conflicting when II is granted
first and GAP is requested last. Combine with that that master and slave can execute lock requests for 2 trx:s in different orders.

The idea to get rid of useless and harmful for replication GAP locks must be the right way to go, but this ticket is rather cautious about implementation of that objective.
So if for any reason a prepared XA blocks a later (in binlog order) trx, we'd remove it temporarily out of the way.

Also to the reason of ~~MDEV-742~~ 's replicating of the XA in prepared state, that's to address
failover: slave becomes promotable to master at any time without losing the prepared trx as the user sees it prepared.

(I'll respond to other questions a bit later)

Andrei Elkin added a comment - 2022-09-15 09:14 knielsen , let me reply to some of your questions (I am not yet regular at kbd). > This shouldn't be possible. The XA PREPARE is ... > ... then maybe that is the real bug here? Indeed: MDEV-28709 , MDEV-26682 . The latter one aimed to circumvent assymmetric locking behaviour by Innodb. Namely a GAP and InsertIntention locks are conflicting when II is granted first and GAP is requested last. Combine with that that master and slave can execute lock requests for 2 trx:s in different orders. The idea to get rid of useless and harmful for replication GAP locks must be the right way to go, but this ticket is rather cautious about implementation of that objective. So if for any reason a prepared XA blocks a later (in binlog order) trx, we'd remove it temporarily out of the way. Also to the reason of MDEV-742 's replicating of the XA in prepared state, that's to address failover: slave becomes promotable to master at any time without losing the prepared trx as the user sees it prepared. (I'll respond to other questions a bit later)

MariaDB Server

abort-and-replay prepared XA transactions on the slave

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration