Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32020

XA transaction replicates incorrectly, must be applied at XA COMMIT, not XA PREPARE

Details

    Description

      XA changes done in 10.5 introduces a regression that breaks replication.

      The problem is that the slave now applies XA transactions while replicating
      an XA_prepare_log_event binlogged by "XA PREPARE" on the master. This is
      wrong, the transaction must not be applied on the slave until "XA COMMIT",
      as it is done correctly in 10.4.

      Applying the XA PREPARE on the slave leaves dangling InnoDB row locks that
      can conflict with the replication of later transactions and cause
      replication to break. The below test case (also attached) demonstrates one
      simple instance of this.

      Another problem is that splitting a transaction in this way in the binlog
      means there is no longer a unique binlog position corresponding to the
      database state. This is demonstrated by the attached testcase
      rpl_xa_provision.test .

      This test case takes a mysqldump while an XA PREPARED transaction is active
      on the master, and uses it to provision a new slave. The new slave's GTID
      position cannot be set correctly. Setting it after the XA PREPARED
      transaction means the XA COMMIT will fail. But setting it before the XA
      PREPARE would also not be correct, as it would duplicate transactions
      binlogged after the XA PREPARE. Thus, in 10.5, the provisioned slave breaks
      its replication.

      The fix is to revert the change so that XA transactions are applied on the
      slave only as part of the XA COMMIT event. When the XA PREPARE event is
      received by the slave, it must not be applied. Instead it can be saved
      somewhere (there are several possible designs). In case of a master crash
      and the slave is promoted as the new master, those saved XA PREPAREd events
      can then be used to recover the XA transaction into the prepared state for
      the application to XA COMMIT or XA ROLLBACK.

      High-level design description for this fix:

      1. At XA PREPARE, the master will binlog a special Xa_prepared_trx_log_event
      event containing the contents of the binlog trx cache. This is binlogged
      without a GTID. The existing binlog_checkpoint mechanism is used to preserve
      the binlog file while this XA transaction is in the prepared state.

      2. The server keeps a record of pending XA PREPARE and their position in the
      binlog, for later reference. This is reconstructed from scanning the binlog
      at server startup/crash recovery. Optionally, this can be optimized to save
      the information (like the GTID state) at clean shutdown to avoid the scan in
      non-crash-recovery case.

      3. At XA COMMIT, the corresponding binlog trx cache data is read from the
      Xa_prepared_trx_log_event entry in the binlog, and a full, normal commit
      transaction is binlogged containing the event data.

      4. Optionally, we can optimize XA COMMIT to only binlog a placeholder with
      reference back to the Xa_prepared_trx_log_event entry, and use the binlog
      checkpoint mechanism to preserve the binlog file containing the entry for as
      long as needed.

      5. By default, the slave will ignore the Xa_prepared_trx_log_event (the dump
      thread can simply skip sending these events to the slave, or the slave can
      just ignore them), and the XA COMMIT will be replicated as a normal commit.

      6. The user can enable some --replicate-xa option, which will make the slave
      process the Xa_prepared_trx_log_event. This will require --log-slave-updates
      enabled on the slave. When the Xa_prepared_trx_log_event is processed on the
      slave, it is simply binlogged, no events are applied. When the commit events
      are received, they are applied as a normal transaction, and the pending
      Xa_prepared_trx_log_event in the slave's binlog is released.

      7. If the optimization in (4) is implemented, the placeholder commit event
      on the slave will read the even data to be applied from the relay log or
      from the slave's binlog if the data has been processed and the corresponding
      relay log purged. The dump thread on the master will check each placeholder
      commit event if the slave has already been sent the corresponding
      Xa_prepared_trx_log_event, and if not will read out the full commit event
      and send to the slave for normal commit processing.

      8. When the slave is promoted to a new master, any pending XA PREPAREd
      events that have been processed on the slave can be explicitly instantiated
      by the user into the XA prepared, committed, or rolled back state. The
      corresponding XIDs are listed in XA RECOVER on the slave, and can be
      specified in SQL statements XA PREARE <xid>, XA COMMIT <xid>, XA ROLLBACK
      <xid>. These statements will then read out the pending events and apply them
      as normal. (The exact syntax for this is up for discussion, maybe a separate
      option keyword or different statement should be used for slave promotion).

      9. If the locking happens to be different on the slave than on the master
      when promoting pending XA prepared, these might get locking conflicts.
      However, since these can be run by the user in any order and in parallel,
      promotion can still proceed and once the external transaction coordinator
      issues commit or rollback decisions for one transaction, the blocked
      promotion can then continue. This way, statement-based replication and
      non-primary-key row-based replication can still for for XA promotion on the
      slave.

      A Proof-of-concept implementation of this design has been pushed to branch
      knielsen_mdev32020. This patch is not yet ready for testing, but it
      demonstrates the feasibility of all parts of the design. Once the high-level
      design has been finalized wrt. precise syntax to use etc., and a decision
      has been made on which version to implement this in, the patch can be easily
      completed according to embedded "ToDo" comments.

      --source include/have_innodb.inc
      --source include/have_binlog_format_row.inc
      --source include/master-slave.inc
       
      --connection master
       
      CREATE TABLE t1 (a int, b int, c int,
        INDEX i1(a),
        INDEX i2(b))
        ENGINE=InnoDB;
       
      INSERT INTO t1 VALUES
        (1,1,0), (1,2,0),
        (2,1,0), (2,2,0);
      --sync_slave_with_master
       
      --source include/stop_slave.inc
      SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;
      SET @old_retries= @@GLOBAL.slave_transaction_retries;
      SET GLOBAL innodb_lock_wait_timeout= 2;
      SET GLOBAL slave_transaction_retries= 3;
      --source include/start_slave.inc
       
      --connection master
      XA START "t1";
      UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=1;
      XA END "t1";
      XA PREPARE "t1";
       
      --connection master1
      XA START "t2";
      UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=2;
      XA END "t2";
      XA PREPARE "t2";
       
      --connection master
      XA COMMIT "t1";
       
      --connection master1
      XA COMMIT "t2";
       
      --connection master
      SELECT * FROM t1 ORDER BY a,b,c;
       
      --sync_slave_with_master
      SELECT * FROM t1 ORDER BY a,b,c;
       
      # Cleanup
      --connection master
      DROP TABLE t1;
       
      --connection slave
      SET GLOBAL innodb_lock_wait_timeout= @old_timeout;
      SET GLOBAL slave_transaction_retries= @old_retries;
       
      --source include/rpl_end.inc
      

      Attachments

        Issue Links

          Activity

            knielsen Kristian Nielsen added a comment - - edited

            > Brandon Nesterenko and Kristian Nielsen, this bug is critical and is blocking MDEV-32014,
            > which is part of the 11.7 preview. Can you please see what can be done to fix it?

            julien.fritsch, that makes no sense. I have been trying to get the XA developers to even acknowledge the problem described in this bug for >2 years without success. And why would MDEV-32020 be blocking MDEV-32014, which seems completely unrelated?

            • Kristian.
            knielsen Kristian Nielsen added a comment - - edited > Brandon Nesterenko and Kristian Nielsen, this bug is critical and is blocking MDEV-32014 , > which is part of the 11.7 preview. Can you please see what can be done to fix it? julien.fritsch , that makes no sense. I have been trying to get the XA developers to even acknowledge the problem described in this bug for >2 years without success. And why would MDEV-32020 be blocking MDEV-32014 , which seems completely unrelated? Kristian.
            Elkin Andrei Elkin added a comment - - edited

            julien.fritsch, MDEV-32014 relates does not relate to the current one. This one relates to MDEV-34481 which covers the slave hang part of this MDEV-34466 ticket.

            Elkin Andrei Elkin added a comment - - edited julien.fritsch , MDEV-32014 relates does not relate to the current one. This one relates to MDEV-34481 which covers the slave hang part of this MDEV-34466 ticket.

            For the record, I ran the test case in the Description with two options (2×2=4 combinations), but each time it would fail as follows:

            10.11 8a6a4c947a0ca3d2fdca752d7440bdc5c6c83e37

            analyze: sync_with_master
            mysqltest: At line 45: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 1786, 300, '')' returned NULL indicating slave SQL thread failure
            

            I tried

            ./mtr --mysqld=--innodb-snapshot-isolation=ON
            

            and the following patch to revert a work-around of this bug:

            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 2b304868b14..02c03ccdcae 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -2037,8 +2037,6 @@ trx_prepare(
             			/* Do not release locks for XA COMMIT ONE PHASE
             			or for internal distributed transactions
             			(XID::get_my_xid() would be nonzero). */
            -		} else {
            -			lock_release_on_prepare(trx);
             		}
             	}
             }
            

            I feel that releasing any locks before a transaction is committed can lead to potential transaction isolation violations and introduce read/write conflicts. I am glad that with MDEV-34690, we stop releasing the locks on the primary server, but we’d still keep doing this on the replicas. I was really hoping that setting innodb_snapshot_isolation=ON would rescue the situation, but based on my experiment above, it does not seem to be the case. I did not debug this in detail; maybe the ER_CHECKREAD handling could be improved?

            As far as I understand, the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica is the following: When the primary server fails and a replica is being promoted to a master, in that case, the promoted replica would need to apply each XA PREPARE transaction, and failure scenarios such as the one in the Description would still be possible.

            To my understanding, a failover situation should be very rare. The following regression test files do mention fail-over:

            mysql-test/suite/rpl/t/rpl_circular_for_4_hosts.test
            mysql-test/suite/rpl/t/rpl_gtid_basic.test
            mysql-test/suite/rpl/t/rpl_semi_sync_crash.inc
            mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
            

            None of these test files make use of any XA transactions. Hence, such a failover scenario is not being tested and might not work. For proper testing, I think that we would need to involve MaxScale as well.

            I do not quite understand the logic of the argument. I paraphrase it as: "We must not do this, because it would not work in (some rare scenario that may not currently work anyway)." It feels a bit like "perfect is the enemy of good enough." Users of XA transactions in "normal" replication (without any fail-over) would stop having any trouble if we implement the suggestion. From the storage engine point of view, replicas would only execute regular transactions when the replication layer processes a XA COMMIT statement. Storage engines would not see any XA PREPARE; anything that ends in XA ROLLBACK would be filtered out by the replication layer. The lock conflict resolution for non-XA transactions works just fine.

            marko Marko Mäkelä added a comment - For the record, I ran the test case in the Description with two options (2×2=4 combinations), but each time it would fail as follows: 10.11 8a6a4c947a0ca3d2fdca752d7440bdc5c6c83e37 analyze: sync_with_master mysqltest: At line 45: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 1786, 300, '')' returned NULL indicating slave SQL thread failure I tried ./mtr --mysqld=--innodb-snapshot-isolation=ON and the following patch to revert a work-around of this bug: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 2b304868b14..02c03ccdcae 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -2037,8 +2037,6 @@ trx_prepare( /* Do not release locks for XA COMMIT ONE PHASE or for internal distributed transactions (XID::get_my_xid() would be nonzero). */ - } else { - lock_release_on_prepare(trx); } } } I feel that releasing any locks before a transaction is committed can lead to potential transaction isolation violations and introduce read/write conflicts. I am glad that with MDEV-34690 , we stop releasing the locks on the primary server, but we’d still keep doing this on the replicas. I was really hoping that setting innodb_snapshot_isolation=ON would rescue the situation, but based on my experiment above, it does not seem to be the case. I did not debug this in detail; maybe the ER_CHECKREAD handling could be improved? As far as I understand, the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica is the following: When the primary server fails and a replica is being promoted to a master, in that case, the promoted replica would need to apply each XA PREPARE transaction, and failure scenarios such as the one in the Description would still be possible. To my understanding, a failover situation should be very rare. The following regression test files do mention fail-over: mysql-test/suite/rpl/t/rpl_circular_for_4_hosts.test mysql-test/suite/rpl/t/rpl_gtid_basic.test mysql-test/suite/rpl/t/rpl_semi_sync_crash.inc mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test None of these test files make use of any XA transactions. Hence, such a failover scenario is not being tested and might not work. For proper testing, I think that we would need to involve MaxScale as well. I do not quite understand the logic of the argument. I paraphrase it as: "We must not do this, because it would not work in (some rare scenario that may not currently work anyway)." It feels a bit like "perfect is the enemy of good enough." Users of XA transactions in "normal" replication (without any fail-over) would stop having any trouble if we implement the suggestion. From the storage engine point of view, replicas would only execute regular transactions when the replication layer processes a XA COMMIT statement. Storage engines would not see any XA PREPARE ; anything that ends in XA ROLLBACK would be filtered out by the replication layer. The lock conflict resolution for non- XA transactions works just fine.

            I forgot to mention some key points from yesterday’s discussion:

            • It was suggested that only binlog_format=row be allowed for XA transactions. (This is not a complete solution, because the test case in the Description is already using that format.)
            • Also innodb_force_primary_key=ON was brought up in the discussions. I think that it would be a reasonable limitation to refuse to replicate XA transactions if the involved tables lack PRIMARY KEY. The test case in the Description would pass if we add PRIMARY KEY(a,b,c).
            • I suggested to investigate the impact of enabling innodb_snapshot_isolation=ON on replicas. Would we get any ER_CHECKREAD errors? (They will lead to transaction rollback and retry, just like ER_LOCK_DEADLOCK.)
            marko Marko Mäkelä added a comment - I forgot to mention some key points from yesterday’s discussion: It was suggested that only binlog_format=row be allowed for XA transactions. (This is not a complete solution, because the test case in the Description is already using that format.) Also innodb_force_primary_key=ON was brought up in the discussions. I think that it would be a reasonable limitation to refuse to replicate XA transactions if the involved tables lack PRIMARY KEY . The test case in the Description would pass if we add PRIMARY KEY(a,b,c) . I suggested to investigate the impact of enabling innodb_snapshot_isolation=ON on replicas. Would we get any ER_CHECKREAD errors? (They will lead to transaction rollback and retry, just like ER_LOCK_DEADLOCK .)
            Elkin Andrei Elkin added a comment - - edited

            marko, let me reply to your penultimate comments points.

            1. to innodb-snapshot-isolation=ON on the description test, I recommend you to
            discuss that in MDEV-34481 which I forked out in order to focus on the technical matter of the hang.
            On MDEV-34481 fixes branch I run the mtr with the option to see the test pass, as it also does w/o.
            To my understanding --innodb-snapshot-isolation is irrelevant to the fixes that essentially implement READ PAST locking mode. That patch must make it perfectly clear the the whole issue is that
            Innodb X-locks records it was never going to update. For the non-unique indexes only table ROW-format replication therefore the MDEV-33454/MDEV-34466 policy is insurmountable.

            2. Let me confirm
            > the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica
            it from my side.
            As to the rare Failover, unfortunately there's no way to ban it altogether, esp the crash-recovery one .
            The raratiness may not be a concern for the user. But always is just one transaction loss.
            Let it be xa-prepared one.
            The logic of the argument is simple: follow the XA specification at least unless the user willingly succumbs to disgrace, like MDEV-35019.

            3. XA recovery in replication is tested e.g in rpl suite tets invoking rpl_xa.inc.

            4. Sure conversion XA to normal transactions is a good thing, as we would not have any special branches to implement and maintain, just we should be careful not to throw awat p.2 constraint with bath water.

            Elkin Andrei Elkin added a comment - - edited marko , let me reply to your penultimate comments points. 1. to innodb-snapshot-isolation=ON on the description test, I recommend you to discuss that in MDEV-34481 which I forked out in order to focus on the technical matter of the hang. On MDEV-34481 fixes branch I run the mtr with the option to see the test pass, as it also does w/o. To my understanding --innodb-snapshot-isolation is irrelevant to the fixes that essentially implement READ PAST locking mode. That patch must make it perfectly clear the the whole issue is that Innodb X-locks records it was never going to update. For the non-unique indexes only table ROW-format replication therefore the MDEV-33454 / MDEV-34466 policy is insurmountable. 2. Let me confirm > the only argument against changing the way how XA PREPARE and XA COMMIT are applied on a replica it from my side. As to the rare Failover, unfortunately there's no way to ban it altogether, esp the crash-recovery one . The raratiness may not be a concern for the user. But always is just one transaction loss. Let it be xa-prepared one. The logic of the argument is simple: follow the XA specification at least unless the user willingly succumbs to disgrace, like MDEV-35019 . 3. XA recovery in replication is tested e.g in rpl suite tets invoking rpl_xa.inc . 4. Sure conversion XA to normal transactions is a good thing, as we would not have any special branches to implement and maintain, just we should be careful not to throw awat p.2 constraint with bath water.

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              3 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.