Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15740

Galera does not recover prepared XA-transactions correctly

Details

    Description

      A stripped down version of galera.galera_gcache_recover:

      mysql-test/suite/galera/t/foo.test (no need for a .result file):
       
      --source include/galera_cluster.inc
       
      --connection node_1
      CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1);
       
      --connection node_2
      --let $wait_condition = SELECT COUNT(*) > 0 FROM t1;
      --source include/wait_condition.inc
       
      # If we don't sleep here, the tail command below prints:
      #   WSREP: Recovered position: ...:1
      # If we do sleep then we recover to seqno=2:
      #   WSREP: Recovered position: ...:2
      #--sleep 1
       
      --source include/kill_galera.inc
       
      --connection node_2
      --let $galera_wsrep_recover_server_id=2
      --source suite/galera/include/galera_wsrep_recover.inc
       
      --echo Recovered node2
      --exec tail -1 $MYSQL_TMP_DIR/galera_wsrep_recover.log
      --echo sleep 1000...
      --sleep 1000
      

      The problem with that test is that (without the sleep) the contents of the t1 table on node2 goes out of sync with the recorded wsrep xid. Wsrep xid seqno=2 is lost and when it recovers later to 1 it would try to re-apply the changes behind seqno=2 from another node, BUT that would cause a duplicate key error because the row is already in the table t1.

      The reason for this is that the mini-transaction which writes wsrep xid commits, but the redo log is never flushed up to that LSN before the node is killed. This bug is not present in MySQL because there the log is flushed properly during commit:

      trx_sys_update_wsrep_checkpoint() writes seqno=2 using mtr=0x7fffdf9b4ff8
      mtr_commit(mtr=0x7fffdf9b4ff8) end_lsn=1631956
       
      innobase_commit()
        trx_commit_complete_for_mysql()
          trx_flush_log_if_needed()  // this is innodb_flush_log_at_trx_commit handling
            trx_flush_log_if_needed_low()
              log_write_up_to(lsn=1631956)
      

      In MariaDB this code path is cancelled at trx_commit_complete_for_mysql():

      1d0f70c2f894 (Michael Widenius 2243) trx_commit_complete_for_mysql(
      1d0f70c2f894 (Michael Widenius 2244) /*==========================*/
      068c61978e3a (Michael Widenius 2245)     trx_t*    trx)    /*!< in/out: transaction */
      1d0f70c2f894 (Michael Widenius 2246) {
      2e814d4702d7 (Jan Lindström    2247)     if (trx->id != 0
      2e814d4702d7 (Jan Lindström    2248)         || !trx->must_flush_log_later
      36e81a23c567 (Kristian Nielsen 2249)         || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
      1d0f70c2f894 (Michael Widenius 2250) 
      068c61978e3a (Michael Widenius 2251)         return;
      1d0f70c2f894 (Michael Widenius 2252)     }
      1d0f70c2f894 (Michael Widenius 2253) 
      068c61978e3a (Michael Widenius 2254)     trx_flush_log_if_needed(trx->commit_lsn, trx);
      

      The return; on line 2251 is executed and trx_flush_log_if_needed() is never called. This is because line 2249 evaluates to true - both srv_flush_log_at_trx_commit and trx->active_commit_ordered are 1.

      commit 36e81a23c56
      Parent: 5ae598390aa
      Author:     Kristian Nielsen <knielsen@knielsen-hq.org>
      AuthorDate: Mon Aug 7 12:38:47 2017 +0200
      Commit:     Kristian Nielsen <knielsen@knielsen-hq.org>
      CommitDate: Mon Aug 7 18:23:55 2017 +0200
       
          MDEV-11937: InnoDB flushes redo log too often
       
          Problem was introduced with the InnoDB 5.7 merge, the code related to
          avoiding extra fsync at the end of commit when binlog is enabled. The
          MariaDB method for this was removed, but the replacement MySQL method
          based on thd_get_durability_property() is not functional in MariaDB.
       
          This commit reverts the offending parts of the merge and adds a test
          case, to fix the problem for InnoDB. But other storage engines are
          likely to have a similar problem.
       
      ...
       
      -           || thd_requested_durability(trx->mysql_thd)
      -              == HA_IGNORE_DURABILITY) {
      +           || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
      

      This looks particularly strange because changing innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 is supposed to be more durable and slow compared to 2. Now, with the current code 2 is more durable!

      Attachments

        Issue Links

          Activity

            marko Yes exactly as in trx_write_serialization_history has a condition

            	trx_sysf_t* sys_header = trx_sysf_get(mtr);
            #ifdef WITH_WSREP
            	/* Update latest MySQL wsrep XID in trx sys header. */
            	if (wsrep_is_wsrep_xid(trx->xid)) {
            		trx_sys_update_wsrep_checkpoint(trx->xid, sys_header, mtr);
            	}
            #endif /* WITH_WSREP */
            

            And in that wsrep_is_wsrep_xid is

            int wsrep_is_wsrep_xid(const XID* xid)
            {
              return (xid->formatID      == 1                   &&
                      xid->gtrid_length  == WSREP_XID_GTRID_LEN &&
                      xid->bqual_length  == 0                   &&
                      !memcmp(xid->data, WSREP_XID_PREFIX, WSREP_XID_PREFIX_LEN));
            }
            

            jplindst Jan Lindström (Inactive) added a comment - marko Yes exactly as in trx_write_serialization_history has a condition trx_sysf_t* sys_header = trx_sysf_get(mtr); #ifdef WITH_WSREP /* Update latest MySQL wsrep XID in trx sys header. */ if (wsrep_is_wsrep_xid(trx->xid)) { trx_sys_update_wsrep_checkpoint(trx->xid, sys_header, mtr); } #endif /* WITH_WSREP */ And in that wsrep_is_wsrep_xid is int wsrep_is_wsrep_xid(const XID* xid) { return (xid->formatID == 1 && xid->gtrid_length == WSREP_XID_GTRID_LEN && xid->bqual_length == 0 && !memcmp(xid->data, WSREP_XID_PREFIX, WSREP_XID_PREFIX_LEN)); }

            Would the following patch work? I did not test whether it even compiles.

            The purpose of invalidating the trx->xid is to ensure that XA RECOVER is not returning the same XIDs multiple times. With this patch, we would skip the invalidation only for Galera transactions.

            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 93fc1bb0ed2..230edd7377e 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -2737,6 +2737,12 @@ trx_get_trx_by_xid_low(
             		if (trx->is_recovered
             		    && trx_state_eq(trx, TRX_STATE_PREPARED)
             			&& xid->eq((XID*)trx->xid)) {
            +#ifdef WITH_WSREP
            +			/* The commit of a prepared recovered Galera
            +			transaction needs a valid trx->xid for
            +			invoking trx_sys_update_wsrep_checkpoint(). */
            +			if (wsrep_is_wsrep_xid(trx->xid)) break;
            +#endif
             
             			/* Invalidate the XID, so that subsequent calls
             			will not find it. */
            

            marko Marko Mäkelä added a comment - Would the following patch work? I did not test whether it even compiles. The purpose of invalidating the trx->xid is to ensure that XA RECOVER is not returning the same XIDs multiple times. With this patch, we would skip the invalidation only for Galera transactions. diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 93fc1bb0ed2..230edd7377e 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -2737,6 +2737,12 @@ trx_get_trx_by_xid_low( if (trx->is_recovered && trx_state_eq(trx, TRX_STATE_PREPARED) && xid->eq((XID*)trx->xid)) { +#ifdef WITH_WSREP + /* The commit of a prepared recovered Galera + transaction needs a valid trx->xid for + invoking trx_sys_update_wsrep_checkpoint(). */ + if (wsrep_is_wsrep_xid(trx->xid)) break; +#endif /* Invalidate the XID, so that subsequent calls will not find it. */

            commit 1d56d875fe28f34cefc1bb4e162331ae9f4ce9bb (HEAD -> 10.1, origin/bb-10.1-galera, origin/10.1, bb-10.1-galera)
            Author: Jan Lindström <jan.lindstrom@mariadb.com>
            Date: Mon Jan 7 12:12:30 2019 +0200

            MDEV-15740: InnoDB does not flush redo log when it shoul

            During database recovery, a transaction with wsrep XID is
            recovered from InnoDB in prepared state. However, when the
            transaction is looked up with trx_get_trx_by_xid() in
            innobase_commit_by_xid(), trx->xid gets cleared in
            trx_get_trx_by_xid_low() and commit time serialization history
            write does not update the wsrep XID in trx sys header for
            that recovered trx. As a result the transaction gets
            committed during recovery but the wsrep position does not
            get updated appropriately.

            As a fix, we preserve trx->xid for Galera over transaction
            commit in recovery phase.

            Fix authored by: Teemu Ollakka (GaleraCluster) and Marko Mäkelä.

            jplindst Jan Lindström (Inactive) added a comment - commit 1d56d875fe28f34cefc1bb4e162331ae9f4ce9bb (HEAD -> 10.1, origin/bb-10.1-galera, origin/10.1, bb-10.1-galera) Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Mon Jan 7 12:12:30 2019 +0200 MDEV-15740 : InnoDB does not flush redo log when it shoul During database recovery, a transaction with wsrep XID is recovered from InnoDB in prepared state. However, when the transaction is looked up with trx_get_trx_by_xid() in innobase_commit_by_xid(), trx->xid gets cleared in trx_get_trx_by_xid_low() and commit time serialization history write does not update the wsrep XID in trx sys header for that recovered trx. As a result the transaction gets committed during recovery but the wsrep position does not get updated appropriately. As a fix, we preserve trx->xid for Galera over transaction commit in recovery phase. Fix authored by: Teemu Ollakka (GaleraCluster) and Marko Mäkelä.

            The test galera.galera_gcache_recover_manytrx is failing in 10.2:

            10.2 248dc12e60d5e3ba269fe2a13689bca32790f9f5

            2019-01-14 11:19:38 140592221543424 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
            

            Also, I think that the bug title is misleading. There is no evidence of missed redo log writes, and no change was made related to that.

            marko Marko Mäkelä added a comment - The test galera.galera_gcache_recover_manytrx is failing in 10.2: 10.2 248dc12e60d5e3ba269fe2a13689bca32790f9f5 2019-01-14 11:19:38 140592221543424 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. Also, I think that the bug title is misleading. There is no evidence of missed redo log writes, and no change was made related to that.

            teemu.ollakka Can you have a look this failure on 10.2.

            jplindst Jan Lindström (Inactive) added a comment - teemu.ollakka Can you have a look this failure on 10.2.

            People

              teemu.ollakka Teemu Ollakka
              Dimov Vasil (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.