Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29410

abort-and-replay prepared XA transactions on the slave

Details

    • Task
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • None
    • Replication, XA
    • None

    Description

      Parallel applier guarantees that transactions are committed in a fixed predefined order (same as on the master). If trx1 must be committed before trx2, but the parallel applier executes them concurrently and trx2 happen to block trx1, then the applier aborts trx2, allows trx1 to finish, and then re-executes trx2.

      This does not work if trx1 is an XA transaction. It becomes persistent on XA PREPARE, so it's XA PREPARE that must happen before trx2, not XA COMMIT. But XA PREPARE doesn't release all locks, so XA PREPARE is not a guarantee that a conflicting trx2 will be able to continue.

      How can be fixed?

      Attachments

        Issue Links

          Activity

            First, I don't understand why XA PREPARE cannot be rolled back? The whole
            point of XA PREPARE is to leave the transaction in a state where it can both
            be rolled back and retried. Why not do an XA ROLLBACK in case of conflict
            and then re-try?

            Is it because the global transaction id is persisted also after rollback,
            and a new XA START with the same id will fail? Though that doesn't seem
            possible, as it would require being able to look up all ids forever in the
            server. Even if this is the case, it can be solved by allocating a new
            replication-specific transaction id name in the slave applier (ISTR LOAD
            FILE is handled similarly to avoid file name conflict). This could also help
            identify slave XA transactions left in the PREPARED state and roll them back
            at server startup.

            As far as I can see there should be no problem with rolling back and
            re-trying an XA PREPAREd transaction, in fact the XA system seems to
            guarantee that this is possible.

            Second, why binlog XA PREPARE at all? XA is a mechanism to ensure consistent
            commit between multiple transactional systems, one of them the original
            MariaDB master server. Replication slave servers are not involved in this in
            any way. Normal transactions are not replicated until and unless they commit
            on the master, why should XA transactions work differently? Maybe this is
            the real bug here?

            Replicating XA PREPARE leaves a prepared transaction active on the slave,
            with all of the complexity that incurs - and it leaves dangling locks on the
            slave potentially for a long time, if the user XA COMMIT is delayed for some
            reason. It would be much preferable to keep the binlog cache on the master
            across XA PREPARE, and binlog it only at XA COMMIT time - just like other
            transactions. It doesn't even have to be binlogged as an XA transaction,
            just a normal transaction is fine.

            This does require to persist the binlog cache to preserve an XA PREPAREd
            transaction across master restart, that can be done in a system InnoDB
            table. This is some work but relatively straight-forward, and surely simpler
            than trying to implement sparse relaylogs on the slave.

            Third, the XA COMMIT (and XA ROLLBACK) event groups must be marked non-trans
            in the binlog (they are currently marked as "trans"). Unlike XA PREPARE,
            these cannot be rolled back, and also they cannot be safely applied in
            parallel with earlier transactions (in case of their own XA PREPARE event
            group). This seems clearly a bug (with trivial fix). When XA COMMIT and XA
            ROLLBACK are not marked transactional, the parallel replication will wait
            for all prior commits to complete before executing them.

            The bug description mentions that "it can be purged from relay log after
            that". I don't see why this is the case? The inuse_relaylog mechanism exists
            to ensure that relay logs are kept as long as needed by parallel replication
            retry. I also don't see how relaylogs would become sparse. It doesn't seem
            justifyable to introduce all this complexity with sparse relaylogs and
            re-fetching from master in SQL thread just for the sake of a little-used
            feature like XA - nor does it seem necessary as per above suggestions?

            Hope this helps,

            • Kristian.
            knielsen Kristian Nielsen added a comment - First, I don't understand why XA PREPARE cannot be rolled back? The whole point of XA PREPARE is to leave the transaction in a state where it can both be rolled back and retried. Why not do an XA ROLLBACK in case of conflict and then re-try? Is it because the global transaction id is persisted also after rollback, and a new XA START with the same id will fail? Though that doesn't seem possible, as it would require being able to look up all ids forever in the server. Even if this is the case, it can be solved by allocating a new replication-specific transaction id name in the slave applier (ISTR LOAD FILE is handled similarly to avoid file name conflict). This could also help identify slave XA transactions left in the PREPARED state and roll them back at server startup. As far as I can see there should be no problem with rolling back and re-trying an XA PREPAREd transaction, in fact the XA system seems to guarantee that this is possible. Second, why binlog XA PREPARE at all? XA is a mechanism to ensure consistent commit between multiple transactional systems, one of them the original MariaDB master server. Replication slave servers are not involved in this in any way. Normal transactions are not replicated until and unless they commit on the master, why should XA transactions work differently? Maybe this is the real bug here? Replicating XA PREPARE leaves a prepared transaction active on the slave, with all of the complexity that incurs - and it leaves dangling locks on the slave potentially for a long time, if the user XA COMMIT is delayed for some reason. It would be much preferable to keep the binlog cache on the master across XA PREPARE, and binlog it only at XA COMMIT time - just like other transactions. It doesn't even have to be binlogged as an XA transaction, just a normal transaction is fine. This does require to persist the binlog cache to preserve an XA PREPAREd transaction across master restart, that can be done in a system InnoDB table. This is some work but relatively straight-forward, and surely simpler than trying to implement sparse relaylogs on the slave. Third, the XA COMMIT (and XA ROLLBACK) event groups must be marked non-trans in the binlog (they are currently marked as "trans"). Unlike XA PREPARE, these cannot be rolled back, and also they cannot be safely applied in parallel with earlier transactions (in case of their own XA PREPARE event group). This seems clearly a bug (with trivial fix). When XA COMMIT and XA ROLLBACK are not marked transactional, the parallel replication will wait for all prior commits to complete before executing them. The bug description mentions that "it can be purged from relay log after that". I don't see why this is the case? The inuse_relaylog mechanism exists to ensure that relay logs are kept as long as needed by parallel replication retry. I also don't see how relaylogs would become sparse. It doesn't seem justifyable to introduce all this complexity with sparse relaylogs and re-fetching from master in SQL thread just for the sake of a little-used feature like XA - nor does it seem necessary as per above suggestions? Hope this helps, Kristian.

            Technically, a transaction after XA PREPARE can be rolled back, and should. This MDEV is about doing exactly that.

            But currently an "XA transaction" in relay log is a sequence of events from XA START to XA PREPARE. This is what the master writes to binlog, binlog trx_cache in THD is flushed to binlog on XA PREPARE. So, while a transaction in the sql worker thread can be rolled back after XA PREPARE, from the relay log point of view the transaction was done, relay log forgets about it and it cannot be re-applied. This is what this MDEV wants to fix — to preserve XA transactions over XA PREPARE up to XA COMMIT or XA ROLLBACK. Somehow.

            "why binlog XA PREPARE at all" — this was MDEV-742, a way to make binlog 2PC capable, so that a binlog would be able to prepare a transaction (make it persistent), and later commit it, or roll it back.

            serg Sergei Golubchik added a comment - Technically, a transaction after XA PREPARE can be rolled back, and should. This MDEV is about doing exactly that. But currently an "XA transaction" in relay log is a sequence of events from XA START to XA PREPARE . This is what the master writes to binlog, binlog trx_cache in THD is flushed to binlog on XA PREPARE . So, while a transaction in the sql worker thread can be rolled back after XA PREPARE , from the relay log point of view the transaction was done, relay log forgets about it and it cannot be re-applied. This is what this MDEV wants to fix — to preserve XA transactions over XA PREPARE up to XA COMMIT or XA ROLLBACK . Somehow. "why binlog XA PREPARE at all" — this was MDEV-742 , a way to make binlog 2PC capable, so that a binlog would be able to prepare a transaction (make it persistent), and later commit it, or roll it back.

            I still think the inuse_relaylog should ensure that the relaylog does not go away too early.

            When the slave worker executes XA PREPARE, this should participate in binlog group commit (it writes to slave binlog, right), which includes doing a wait_for_prior_commit().

            Until wait_for_prior_commit() completes, the transaction can be safely rolled back, the XA PREPARE is not yet persisted, the relay log is not yet deleted.

            After wait_for_prior_commit(), there are no earlier commits to conflict with, the optimistic parallel replication will not need to rollback and retry the XA PREPARE.

            I think if this doesn't work, there is a (simple) bug that should be fixed. Or is there something I'm missing?

            Is there a test case that shows the problem?

            To the second point, the XA PREPARE is written to the binlog for the sake of 2PC persistency, ok. This doesn't explain why it is replicated to the slaves? There seem to be no benefit for having the transaction in XA PREPAREd state on the slave (and a number of disadvantages).

            Save the binlog cache in memory after XA PREPARE on the master. Then at XA COMMIT, write it to the binlog for the slave to replicate (with a normal START TRANSACTION/COMMIT). In case of crash, load it into binlog cache again during crash recovery.

            The XA PREPARE binlog event group can be there, just don't send it to the slave, or send it but ignore it on the slave.

            It seems needlessly complicated to have replicated transactions in XA PREPAREd state on the slave. For example, what happens if the slave is switched to a different master while an XA PREPAREd transaction is in the middle of being replicated?

            Hope this helps,

            • Kristian.
            knielsen Kristian Nielsen added a comment - I still think the inuse_relaylog should ensure that the relaylog does not go away too early. When the slave worker executes XA PREPARE, this should participate in binlog group commit (it writes to slave binlog, right), which includes doing a wait_for_prior_commit(). Until wait_for_prior_commit() completes, the transaction can be safely rolled back, the XA PREPARE is not yet persisted, the relay log is not yet deleted. After wait_for_prior_commit(), there are no earlier commits to conflict with, the optimistic parallel replication will not need to rollback and retry the XA PREPARE. I think if this doesn't work, there is a (simple) bug that should be fixed. Or is there something I'm missing? Is there a test case that shows the problem? To the second point, the XA PREPARE is written to the binlog for the sake of 2PC persistency, ok. This doesn't explain why it is replicated to the slaves? There seem to be no benefit for having the transaction in XA PREPAREd state on the slave (and a number of disadvantages). Save the binlog cache in memory after XA PREPARE on the master. Then at XA COMMIT, write it to the binlog for the slave to replicate (with a normal START TRANSACTION/COMMIT). In case of crash, load it into binlog cache again during crash recovery. The XA PREPARE binlog event group can be there, just don't send it to the slave, or send it but ignore it on the slave. It seems needlessly complicated to have replicated transactions in XA PREPAREd state on the slave. For example, what happens if the slave is switched to a different master while an XA PREPAREd transaction is in the middle of being replicated? Hope this helps, Kristian.

            Reading the original description again:

            "trx2 is an XA transaction that managed to do XA PREPARE before trx1 is blocked"

            This shouldn't be possible. The XA PREPARE is similar to a commit/XID event, it completes the event group. So it must not complete until all prior transactions have committed (ie. it must do wait_for_prior_commit() before completing).

            If it does not currently do that, then maybe that is the real bug here?

            If XA PREPARE writes to the binlog (as I would think), there is an optimized code path that does the wait_for_prior_commit() implicitly as part of binlog group commit.

            A less optimal way is to just run wait_for_prior_commit() at the start of XA PREPARE.

            I don't see a reason that the normal wait_for_prior_commit mechanism to ensure correct parallel replication order and rollback/retry from relay log files should not also work for XA PREPARE.

            knielsen Kristian Nielsen added a comment - Reading the original description again: "trx2 is an XA transaction that managed to do XA PREPARE before trx1 is blocked" This shouldn't be possible. The XA PREPARE is similar to a commit/XID event, it completes the event group. So it must not complete until all prior transactions have committed (ie. it must do wait_for_prior_commit() before completing). If it does not currently do that, then maybe that is the real bug here? If XA PREPARE writes to the binlog (as I would think), there is an optimized code path that does the wait_for_prior_commit() implicitly as part of binlog group commit. A less optimal way is to just run wait_for_prior_commit() at the start of XA PREPARE. I don't see a reason that the normal wait_for_prior_commit mechanism to ensure correct parallel replication order and rollback/retry from relay log files should not also work for XA PREPARE.
            Elkin Andrei Elkin added a comment -

            knielsen, let me reply to some of your questions (I am not yet regular at kbd).

            > This shouldn't be possible. The XA PREPARE is ...
            > ... then maybe that is the real bug here?

            Indeed: MDEV-28709, MDEV-26682. The latter one aimed to circumvent assymmetric locking behaviour by Innodb. Namely a GAP and InsertIntention locks are conflicting when II is granted
            first and GAP is requested last. Combine with that that master and slave can execute lock requests for 2 trx:s in different orders.

            The idea to get rid of useless and harmful for replication GAP locks must be the right way to go, but this ticket is rather cautious about implementation of that objective.
            So if for any reason a prepared XA blocks a later (in binlog order) trx, we'd remove it temporarily out of the way.

            Also to the reason of MDEV-742 's replicating of the XA in prepared state, that's to address
            failover: slave becomes promotable to master at any time without losing the prepared trx as the user sees it prepared.

            (I'll respond to other questions a bit later)

            Elkin Andrei Elkin added a comment - knielsen , let me reply to some of your questions (I am not yet regular at kbd). > This shouldn't be possible. The XA PREPARE is ... > ... then maybe that is the real bug here? Indeed: MDEV-28709 , MDEV-26682 . The latter one aimed to circumvent assymmetric locking behaviour by Innodb. Namely a GAP and InsertIntention locks are conflicting when II is granted first and GAP is requested last. Combine with that that master and slave can execute lock requests for 2 trx:s in different orders. The idea to get rid of useless and harmful for replication GAP locks must be the right way to go, but this ticket is rather cautious about implementation of that objective. So if for any reason a prepared XA blocks a later (in binlog order) trx, we'd remove it temporarily out of the way. Also to the reason of MDEV-742 's replicating of the XA in prepared state, that's to address failover: slave becomes promotable to master at any time without losing the prepared trx as the user sees it prepared. (I'll respond to other questions a bit later)

            People

              Elkin Andrei Elkin
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.