First, I don't understand why XA PREPARE cannot be rolled back? The whole
point of XA PREPARE is to leave the transaction in a state where it can both
be rolled back and retried. Why not do an XA ROLLBACK in case of conflict
and then re-try?
Is it because the global transaction id is persisted also after rollback,
and a new XA START with the same id will fail? Though that doesn't seem
possible, as it would require being able to look up all ids forever in the
server. Even if this is the case, it can be solved by allocating a new
replication-specific transaction id name in the slave applier (ISTR LOAD
FILE is handled similarly to avoid file name conflict). This could also help
identify slave XA transactions left in the PREPARED state and roll them back
at server startup.
As far as I can see there should be no problem with rolling back and
re-trying an XA PREPAREd transaction, in fact the XA system seems to
guarantee that this is possible.
Second, why binlog XA PREPARE at all? XA is a mechanism to ensure consistent
commit between multiple transactional systems, one of them the original
MariaDB master server. Replication slave servers are not involved in this in
any way. Normal transactions are not replicated until and unless they commit
on the master, why should XA transactions work differently? Maybe this is
the real bug here?
Replicating XA PREPARE leaves a prepared transaction active on the slave,
with all of the complexity that incurs - and it leaves dangling locks on the
slave potentially for a long time, if the user XA COMMIT is delayed for some
reason. It would be much preferable to keep the binlog cache on the master
across XA PREPARE, and binlog it only at XA COMMIT time - just like other
transactions. It doesn't even have to be binlogged as an XA transaction,
just a normal transaction is fine.
This does require to persist the binlog cache to preserve an XA PREPAREd
transaction across master restart, that can be done in a system InnoDB
table. This is some work but relatively straight-forward, and surely simpler
than trying to implement sparse relaylogs on the slave.
Third, the XA COMMIT (and XA ROLLBACK) event groups must be marked non-trans
in the binlog (they are currently marked as "trans"). Unlike XA PREPARE,
these cannot be rolled back, and also they cannot be safely applied in
parallel with earlier transactions (in case of their own XA PREPARE event
group). This seems clearly a bug (with trivial fix). When XA COMMIT and XA
ROLLBACK are not marked transactional, the parallel replication will wait
for all prior commits to complete before executing them.
The bug description mentions that "it can be purged from relay log after
that". I don't see why this is the case? The inuse_relaylog mechanism exists
to ensure that relay logs are kept as long as needed by parallel replication
retry. I also don't see how relaylogs would become sparse. It doesn't seem
justifyable to introduce all this complexity with sparse relaylogs and
re-fetching from master in SQL thread just for the sake of a little-used
feature like XA - nor does it seem necessary as per above suggestions?
Hope this helps,
knielsen, let me reply to some of your questions (I am not yet regular at kbd).
> This shouldn't be possible. The XA PREPARE is ...
> ... then maybe that is the real bug here?
Indeed:
MDEV-28709,MDEV-26682. The latter one aimed to circumvent assymmetric locking behaviour by Innodb. Namely a GAP and InsertIntention locks are conflicting when II is grantedfirst and GAP is requested last. Combine with that that master and slave can execute lock requests for 2 trx:s in different orders.
The idea to get rid of useless and harmful for replication GAP locks must be the right way to go, but this ticket is rather cautious about implementation of that objective.
So if for any reason a prepared XA blocks a later (in binlog order) trx, we'd remove it temporarily out of the way.
Also to the reason of
MDEV-742's replicating of the XA in prepared state, that's to addressfailover: slave becomes promotable to master at any time without losing the prepared trx as the user sees it prepared.
(I'll respond to other questions a bit later)