[MDEV-15740] Galera does not recover prepared XA-transactions correctly - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2.8, 10.3.1, 10.1(EOL)
Fix Version/s: 10.4.2, 10.1.38, 10.2.22, 10.3.13
Component/s: Galera
Labels:
None

Description

A stripped down version of galera.galera_gcache_recover:

mysql-test/suite/galera/t/foo.test (no need for a .result file):

--source include/galera_cluster.inc

--connection node_1

CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1);

--connection node_2

--let $wait_condition = SELECT COUNT(*) > 0 FROM t1;

--source include/wait_condition.inc

# If we don't sleep here, the tail command below prints:

#   WSREP: Recovered position: ...:1

# If we do sleep then we recover to seqno=2:

#   WSREP: Recovered position: ...:2

#--sleep 1

--source include/kill_galera.inc

--connection node_2

--let $galera_wsrep_recover_server_id=2

--source suite/galera/include/galera_wsrep_recover.inc

--echo Recovered node2

--exec tail -1 $MYSQL_TMP_DIR/galera_wsrep_recover.log

--echo sleep 1000...

--sleep 1000

The problem with that test is that (without the sleep) the contents of the t1 table on node2 goes out of sync with the recorded wsrep xid. Wsrep xid seqno=2 is lost and when it recovers later to 1 it would try to re-apply the changes behind seqno=2 from another node, BUT that would cause a duplicate key error because the row is already in the table t1.

The reason for this is that the mini-transaction which writes wsrep xid commits, but the redo log is never flushed up to that LSN before the node is killed. This bug is not present in MySQL because there the log is flushed properly during commit:

trx_sys_update_wsrep_checkpoint() writes seqno=2 using mtr=0x7fffdf9b4ff8

mtr_commit(mtr=0x7fffdf9b4ff8) end_lsn=1631956

innobase_commit()

  trx_commit_complete_for_mysql()

    trx_flush_log_if_needed()  // this is innodb_flush_log_at_trx_commit handling

      trx_flush_log_if_needed_low()

        log_write_up_to(lsn=1631956)

In MariaDB this code path is cancelled at trx_commit_complete_for_mysql():

1d0f70c2f894 (Michael Widenius 2243) trx_commit_complete_for_mysql(

1d0f70c2f894 (Michael Widenius 2244) /*==========================*/

068c61978e3a (Michael Widenius 2245)     trx_t*    trx)    /*!< in/out: transaction */

1d0f70c2f894 (Michael Widenius 2246) {

2e814d4702d7 (Jan Lindström    2247)     if (trx->id != 0

2e814d4702d7 (Jan Lindström    2248)         || !trx->must_flush_log_later

36e81a23c567 (Kristian Nielsen 2249)         || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {

1d0f70c2f894 (Michael Widenius 2250)

068c61978e3a (Michael Widenius 2251)         return;

1d0f70c2f894 (Michael Widenius 2252)     }

1d0f70c2f894 (Michael Widenius 2253)

068c61978e3a (Michael Widenius 2254)     trx_flush_log_if_needed(trx->commit_lsn, trx);

The return; on line 2251 is executed and trx_flush_log_if_needed() is never called. This is because line 2249 evaluates to true - both srv_flush_log_at_trx_commit and trx->active_commit_ordered are 1.

commit 36e81a23c56

Parent: 5ae598390aa

Author:     Kristian Nielsen <knielsen@knielsen-hq.org>

AuthorDate: Mon Aug 7 12:38:47 2017 +0200

Commit:     Kristian Nielsen <knielsen@knielsen-hq.org>

CommitDate: Mon Aug 7 18:23:55 2017 +0200

    MDEV-11937: InnoDB flushes redo log too often

    Problem was introduced with the InnoDB 5.7 merge, the code related to

    avoiding extra fsync at the end of commit when binlog is enabled. The

    MariaDB method for this was removed, but the replacement MySQL method

    based on thd_get_durability_property() is not functional in MariaDB.

    This commit reverts the offending parts of the merge and adds a test

    case, to fix the problem for InnoDB. But other storage engines are

    likely to have a similar problem.

...

-           || thd_requested_durability(trx->mysql_thd)

-              == HA_IGNORE_DURABILITY) {

+           || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {

This looks particularly strange because changing innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 is supposed to be more durable and slow compared to 2. Now, with the current code 2 is more durable!

Attachments

Issue Links

blocks

MDEV-15540 Galera suite MTR tests issuing wsrep_recover fail

Closed

causes

MDEV-16571 backup tests fail with missing data after restore

Open

is caused by

MDEV-11937 InnoDB flushes redo log too often

Closed

relates to

MDEV-18009 Missing redo log flush in innodb.instant_alter_crash

Closed

MDEV-14188 mariabackup.incremental_encrypted failed in buildbot with wrong result

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2018-11-01 10:15

knielsen, could you please take a look at this? Not having proper durable transactions has been hurting some tests during development too. At least we should have proper durability when binlog or replication are disabled.

Related to this, I discussed an idea with Elkin: Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position".

Marko Mäkelä added a comment - 2018-11-01 10:15 knielsen , could you please take a look at this? Not having proper durable transactions has been hurting some tests during development too. At least we should have proper durability when binlog or replication are disabled. Related to this, I discussed an idea with Elkin : Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position".

Marko Mäkelä added a comment - 2018-11-02 08:19

In 10.2 abcd09c95a49d02bf136a21d62d38a2d03672f26, this seems to be affecting Galera only. I tested with the following:

diff --git a/mysql-test/suite/innodb/t/innodb_force_recovery.test b/mysql-test/suite/innodb/t/innodb_force_recovery.test

index f9af16f6609..724f422566b 100644

--- a/mysql-test/suite/innodb/t/innodb_force_recovery.test

+++ b/mysql-test/suite/innodb/t/innodb_force_recovery.test

@@ -144,7 +144,7 @@ connect (con1,localhost,root,,);

 create table t3(a int)engine=innodb;

 --echo # Force a redo log flush of the above uncommitted UPDATE

 SET GLOBAL innodb_flush_log_at_trx_commit=1;

-drop table t3;

+insert into t3 set a=0;

 disconnect con1;

 connection default;

@@ -167,5 +167,6 @@ insert into t2 values(9,10);

 --source include/restart_mysqld.inc

 select * from t2;

-drop table t2;

+select * from t3;

+drop table t2,t3;

 show tables;

Also in this case, I am seeing the correct results for the two different transaction isolation levels:

--let $restart_parameters= --innodb-force-recovery=3

--source include/start_mysqld.inc

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

select * from t2;

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

select * from t2;

Marko Mäkelä added a comment - 2018-11-02 08:19 In 10.2 abcd09c95a49d02bf136a21d62d38a2d03672f26, this seems to be affecting Galera only. I tested with the following: diff --git a/mysql-test/suite/innodb/t/innodb_force_recovery.test b/mysql-test/suite/innodb/t/innodb_force_recovery.test index f9af16f6609..724f422566b 100644 --- a/mysql-test/suite/innodb/t/innodb_force_recovery.test +++ b/mysql-test/suite/innodb/t/innodb_force_recovery.test @@ -144,7 +144,7 @@ connect (con1,localhost,root,,); create table t3(a int)engine=innodb; --echo # Force a redo log flush of the above uncommitted UPDATE SET GLOBAL innodb_flush_log_at_trx_commit=1; -drop table t3; +insert into t3 set a=0; disconnect con1; connection default; @@ -167,5 +167,6 @@ insert into t2 values(9,10); --source include/restart_mysqld.inc select * from t2; -drop table t2; +select * from t3; +drop table t2,t3; show tables; Also in this case, I am seeing the correct results for the two different transaction isolation levels: --let $restart_parameters= --innodb-force-recovery=3 --source include/start_mysqld.inc SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ; select * from t2; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ; select * from t2;

Teemu Ollakka added a comment - 2019-01-03 13:06

The test galera_gcache_recover fails for the following reason: During database recovery, a transaction with wsrep XID is recovered from InnoDB in prepared state. This transaction gets then committed from code path

ha_recover()

  xarecover_handlerton()

    innobase_commit_by_xid()

      innobase_commit_low()

However, when the transaction is looked up with trx_get_trx_by_xid() in innobase_commit_by_xid(), trx->xid gets cleared in trx_get_trx_by_xid_low() and commit time serailisation history write does not update the wsrep XID in trx sys header for that recovered trx. As a result the transaction gets committed during recovery but the wsrep position does not get updated appropriately.

As a fix, either the trx->xid should be preserved over transaction commit in recovery phase or all wsrep transactions in prepared state should be rolled back during recovery. If the commit approach is taken, it must be made sure that the transactions commit in order of increasing XID seqno.

Teemu Ollakka added a comment - 2019-01-03 13:06 The test galera_gcache_recover fails for the following reason: During database recovery, a transaction with wsrep XID is recovered from InnoDB in prepared state. This transaction gets then committed from code path ha_recover() xarecover_handlerton() innobase_commit_by_xid() innobase_commit_low() However, when the transaction is looked up with trx_get_trx_by_xid() in innobase_commit_by_xid() , trx->xid gets cleared in trx_get_trx_by_xid_low() and commit time serailisation history write does not update the wsrep XID in trx sys header for that recovered trx. As a result the transaction gets committed during recovery but the wsrep position does not get updated appropriately. As a fix, either the trx->xid should be preserved over transaction commit in recovery phase or all wsrep transactions in prepared state should be rolled back during recovery. If the commit approach is taken, it must be made sure that the transactions commit in order of increasing XID seqno.

Marko Mäkelä added a comment - 2019-01-03 16:52

teemu.ollakka, your observation is a consequence of what the bug title says.
When there is an important state change that needs to be made durable, InnoDB has to write the redo log buffers to the file at least up to the log sequence number of the mini-transaction commit that implemented the state change. For example, the mini-transaction commit that marked a user transaction committed.

Like Dimov found out, the necessary redo log write is not taking place in Galera. I had an email exchange over this with knielsen, and I was convinced by his explanation that there is no problem outside Galera. Someone (not me) should study how the durability is supposed to work in Galera, and how to fix the logic.

Marko Mäkelä added a comment - 2019-01-03 16:52 teemu.ollakka , your observation is a consequence of what the bug title says. When there is an important state change that needs to be made durable, InnoDB has to write the redo log buffers to the file at least up to the log sequence number of the mini-transaction commit that implemented the state change. For example, the mini-transaction commit that marked a user transaction committed. Like Dimov found out, the necessary redo log write is not taking place in Galera. I had an email exchange over this with knielsen , and I was convinced by his explanation that there is no problem outside Galera. Someone (not me) should study how the durability is supposed to work in Galera, and how to fix the logic.

Teemu Ollakka added a comment - 2019-01-04 13:12 - edited

marko thanks for your comments.

I'm getting convinced that this is recovery time problem. As knielsen explains here: https://lists.launchpad.net/maria-developers/msg11497.html

1. There is 2-phase commit (eg. between InnoDB and binlog). The innodb redo
log is synced to disk in prepare, commit_ordered() is called where
trx->active_commit_ordered is set, and we skip syncing to disk in commit
(because we ca recover from the prepared transaction).

In 10.2 all Galera transactions are committed in 2PC. This is because wsrep replication is implemented as handlerton and during commit time both InnoDB and wsrep htons are registered and are in RW mode, therefore 2PC happens (see ha_commit_trans()). Moreover, this process happens one-by-one for each transaction under Galera commit monitor (this is why group commit is not effective for Galera), so I can't see how Galera could break the commit process.

I run the test with debug tracing on, and observed that the following happens for the problematic transaction during commit time:

T@4    : | | | >innobase_commit_ordered

...

T@4    : | | | | >innobase_commit_ordered_2

...

T@4    : | | | | <innobase_commit_ordered_2

T@4    : | | | <innobase_commit_ordered

...

T@4    : | | | | >innobase_commit

T@4    : | | | | | enter: commit_trx: 1

T@4    : | | | | | trans: ending transaction

...

T@4    : | | | | | info: readonly: 1

T@4    : | | | | | trans: id: 0 must_fush_log_later: 1 active_commit_ordered: 1

T@4    : | | | | <innobase_commit

---

The line

T@4    : | | | | | trans: id: 0 must_fush_log_later: 1 active_commit_ordered: 1

was added by me in trx_commit_complete_for_mysql(), which shows that the function was called with trx->active_commit_ordered set, which indeed verifies that innobase_commit_ordered() was called.

At this point during the test galera_gcache_recover the transaction is visible on node_2 and the node is killed

--connection node_2

--let $wait_condition = SELECT COUNT(*) > 0 FROM t1;

--source include/wait_condition.inc

SET SESSION wsrep_sync_wait = 0;

--source include/kill_galera.inc

When the wsrep recovery step is run later in the test, the log for recovery process shows:

2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Starting recovery for XA transactions...

2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Transaction 1796 in prepared state after recovery

2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Transaction contains changes to 1 rows

2019-01-04 13:35:31 139971066427200 [Note] InnoDB: 1 transactions in prepared state after recovery

2019-01-04 13:35:31 139971066427200 [Note] Found 1 prepared transaction(s) in InnoDB

This matches to Kristian's explanation above that the prepared transaction can be recovered.

This transaction gets committed during recovery process, but in innodbase_commit_by_xid()

if (trx_t* trx = trx_get_trx_by_xid(xid)) {

	/* use cases are: disconnected xa, slave xa, recovery */

	innobase_commit_low(trx);

	ut_ad(trx->mysql_thd == NULL);

	trx_deregister_from_2pc(trx);

	ut_ad(!trx->will_lock);    /* trx cache requirement */

	trx_free_for_background(trx);

	return(XA_OK);

} else {

	return(XAER_NOTA);

Call to trx_get_trx_by_xid() clears the trx->xid in trx_get_trx_by_xid_low()

		/* Invalidate the XID, so that subsequent calls

		will not find it. */

		trx->xid->null();

		break;

This is why the serialisation history write does not update the wsrep XID in trx sys header during innobase_commit_low().

I made an experiment with the following patch:

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index 62924ab..a1a1952 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -17231,8 +17231,10 @@ innobase_commit_by_xid(

                ut_ad(trx->mysql_thd == NULL);

                trx_deregister_from_2pc(trx);

                ut_ad(!trx->will_lock);    /* trx cache requirement */

+               /* Invalidate the XID, so that subsequent calls

+               will not find it. */

+               trx->xid->null();

                trx_free_for_background(trx);

                return(XA_OK);

        } else {

                return(XAER_NOTA);

@@ -17261,6 +17263,9 @@ innobase_rollback_by_xid(

                int ret = innobase_rollback_trx(trx);

                trx_deregister_from_2pc(trx);

                ut_ad(!trx->will_lock);

+               /* Invalidate the XID, so that subsequent calls

+               will not find it. */

+               trx->xid->null();

                trx_free_for_background(trx);

                return(ret);

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 93fc1bb..61d7330 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -2738,9 +2738,6 @@ trx_get_trx_by_xid_low(

                    && trx_state_eq(trx, TRX_STATE_PREPARED)

                        && xid->eq((XID*)trx->xid)) {

-                       /* Invalidate the XID, so that subsequent calls

-                       will not find it. */

-                       trx->xid->null();

                        break;

which clears the trx->xid only after the transaction has been committed in innodbase_commit_by_xid(). This makes the test pass with TC_LOG_MMAP. Unfortunately with binlog enabled this does not work because binlog does not store full XID and the recovery process fails for other reasons.

Does this sound clear or am I still missing something?

Teemu Ollakka added a comment - 2019-01-04 13:12 - edited marko thanks for your comments. I'm getting convinced that this is recovery time problem. As knielsen explains here: https://lists.launchpad.net/maria-developers/msg11497.html 1. There is 2-phase commit (eg. between InnoDB and binlog). The innodb redo log is synced to disk in prepare, commit_ordered() is called where trx->active_commit_ordered is set, and we skip syncing to disk in commit (because we ca recover from the prepared transaction). In 10.2 all Galera transactions are committed in 2PC. This is because wsrep replication is implemented as handlerton and during commit time both InnoDB and wsrep htons are registered and are in RW mode, therefore 2PC happens (see ha_commit_trans() ). Moreover, this process happens one-by-one for each transaction under Galera commit monitor (this is why group commit is not effective for Galera), so I can't see how Galera could break the commit process. I run the test with debug tracing on, and observed that the following happens for the problematic transaction during commit time: T@4 : | | | >innobase_commit_ordered ... T@4 : | | | | >innobase_commit_ordered_2 ... T@4 : | | | | <innobase_commit_ordered_2 T@4 : | | | <innobase_commit_ordered ... T@4 : | | | | >innobase_commit T@4 : | | | | | enter: commit_trx: 1 T@4 : | | | | | trans: ending transaction ... T@4 : | | | | | info: readonly: 1 T@4 : | | | | | trans: id: 0 must_fush_log_later: 1 active_commit_ordered: 1 T@4 : | | | | <innobase_commit --- The line T@4 : | | | | | trans: id: 0 must_fush_log_later: 1 active_commit_ordered: 1 was added by me in trx_commit_complete_for_mysql() , which shows that the function was called with trx->active_commit_ordered set, which indeed verifies that innobase_commit_ordered() was called. At this point during the test galera_gcache_recover the transaction is visible on node_2 and the node is killed --connection node_2 --let $wait_condition = SELECT COUNT(*) > 0 FROM t1; --source include/wait_condition.inc SET SESSION wsrep_sync_wait = 0; --source include/kill_galera.inc When the wsrep recovery step is run later in the test, the log for recovery process shows: 2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Starting recovery for XA transactions... 2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Transaction 1796 in prepared state after recovery 2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Transaction contains changes to 1 rows 2019-01-04 13:35:31 139971066427200 [Note] InnoDB: 1 transactions in prepared state after recovery 2019-01-04 13:35:31 139971066427200 [Note] Found 1 prepared transaction(s) in InnoDB This matches to Kristian's explanation above that the prepared transaction can be recovered. This transaction gets committed during recovery process, but in innodbase_commit_by_xid() if (trx_t* trx = trx_get_trx_by_xid(xid)) { /* use cases are: disconnected xa, slave xa, recovery */ innobase_commit_low(trx); ut_ad(trx->mysql_thd == NULL); trx_deregister_from_2pc(trx); ut_ad(!trx->will_lock); /* trx cache requirement */ trx_free_for_background(trx); return(XA_OK); } else { return(XAER_NOTA); } Call to trx_get_trx_by_xid() clears the trx->xid in trx_get_trx_by_xid_low() /* Invalidate the XID, so that subsequent calls will not find it. */ trx->xid->null(); break; This is why the serialisation history write does not update the wsrep XID in trx sys header during innobase_commit_low() . I made an experiment with the following patch: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 62924ab..a1a1952 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -17231,8 +17231,10 @@ innobase_commit_by_xid( ut_ad(trx->mysql_thd == NULL); trx_deregister_from_2pc(trx); ut_ad(!trx->will_lock); /* trx cache requirement */ + /* Invalidate the XID, so that subsequent calls + will not find it. */ + trx->xid->null(); trx_free_for_background(trx); - return(XA_OK); } else { return(XAER_NOTA); @@ -17261,6 +17263,9 @@ innobase_rollback_by_xid( int ret = innobase_rollback_trx(trx); trx_deregister_from_2pc(trx); ut_ad(!trx->will_lock); + /* Invalidate the XID, so that subsequent calls + will not find it. */ + trx->xid->null(); trx_free_for_background(trx); return(ret); diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 93fc1bb..61d7330 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -2738,9 +2738,6 @@ trx_get_trx_by_xid_low( && trx_state_eq(trx, TRX_STATE_PREPARED) && xid->eq((XID*)trx->xid)) { - /* Invalidate the XID, so that subsequent calls - will not find it. */ - trx->xid->null(); break; } } which clears the trx->xid only after the transaction has been committed in innodbase_commit_by_xid() . This makes the test pass with TC_LOG_MMAP. Unfortunately with binlog enabled this does not work because binlog does not store full XID and the recovery process fails for other reasons. Does this sound clear or am I still missing something?

Marko Mäkelä added a comment - 2019-01-05 05:27

teemu.ollakka, did I understand correctly that Galera only allows one transaction to commit at a time, and that the second phase of the commit is supposed to invoke trx_sys_update_wsrep_checkpoint()?

You seem to say that the issue is that when a recovered prepared transaction is committed as part of the Galera recovery process, wsrep_is_wsrep_xid() would not hold, because trx->xid->null() had been called. Is this correct?

Marko Mäkelä added a comment - 2019-01-05 05:27 teemu.ollakka , did I understand correctly that Galera only allows one transaction to commit at a time, and that the second phase of the commit is supposed to invoke trx_sys_update_wsrep_checkpoint() ? You seem to say that the issue is that when a recovered prepared transaction is committed as part of the Galera recovery process, wsrep_is_wsrep_xid() would not hold, because trx->xid->null() had been called. Is this correct?

Jan Lindström (Inactive) added a comment - 2019-01-05 07:20

marko Yes exactly as in trx_write_serialization_history has a condition

	trx_sysf_t* sys_header = trx_sysf_get(mtr);

#ifdef WITH_WSREP

	/* Update latest MySQL wsrep XID in trx sys header. */

	if (wsrep_is_wsrep_xid(trx->xid)) {

		trx_sys_update_wsrep_checkpoint(trx->xid, sys_header, mtr);

#endif /* WITH_WSREP */

And in that wsrep_is_wsrep_xid is

int wsrep_is_wsrep_xid(const XID* xid)

  return (xid->formatID      == 1                   &&

          xid->gtrid_length  == WSREP_XID_GTRID_LEN &&

          xid->bqual_length  == 0                   &&

          !memcmp(xid->data, WSREP_XID_PREFIX, WSREP_XID_PREFIX_LEN));

Jan Lindström (Inactive) added a comment - 2019-01-05 07:20 marko Yes exactly as in trx_write_serialization_history has a condition trx_sysf_t* sys_header = trx_sysf_get(mtr); #ifdef WITH_WSREP /* Update latest MySQL wsrep XID in trx sys header. */ if (wsrep_is_wsrep_xid(trx->xid)) { trx_sys_update_wsrep_checkpoint(trx->xid, sys_header, mtr); } #endif /* WITH_WSREP */ And in that wsrep_is_wsrep_xid is int wsrep_is_wsrep_xid(const XID* xid) { return (xid->formatID == 1 && xid->gtrid_length == WSREP_XID_GTRID_LEN && xid->bqual_length == 0 && !memcmp(xid->data, WSREP_XID_PREFIX, WSREP_XID_PREFIX_LEN)); }

Marko Mäkelä added a comment - 2019-01-05 15:00

Would the following patch work? I did not test whether it even compiles.

The purpose of invalidating the trx->xid is to ensure that XA RECOVER is not returning the same XIDs multiple times. With this patch, we would skip the invalidation only for Galera transactions.

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 93fc1bb0ed2..230edd7377e 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -2737,6 +2737,12 @@ trx_get_trx_by_xid_low(

 		if (trx->is_recovered

 		    && trx_state_eq(trx, TRX_STATE_PREPARED)

 			&& xid->eq((XID*)trx->xid)) {

+#ifdef WITH_WSREP

+			/* The commit of a prepared recovered Galera

+			transaction needs a valid trx->xid for

+			invoking trx_sys_update_wsrep_checkpoint(). */

+			if (wsrep_is_wsrep_xid(trx->xid)) break;

+#endif

 			/* Invalidate the XID, so that subsequent calls

 			will not find it. */

Marko Mäkelä added a comment - 2019-01-05 15:00 Would the following patch work? I did not test whether it even compiles. The purpose of invalidating the trx->xid is to ensure that XA RECOVER is not returning the same XIDs multiple times. With this patch, we would skip the invalidation only for Galera transactions. diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 93fc1bb0ed2..230edd7377e 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -2737,6 +2737,12 @@ trx_get_trx_by_xid_low( if (trx->is_recovered && trx_state_eq(trx, TRX_STATE_PREPARED) && xid->eq((XID*)trx->xid)) { +#ifdef WITH_WSREP + /* The commit of a prepared recovered Galera + transaction needs a valid trx->xid for + invoking trx_sys_update_wsrep_checkpoint(). */ + if (wsrep_is_wsrep_xid(trx->xid)) break; +#endif /* Invalidate the XID, so that subsequent calls will not find it. */

Jan Lindström (Inactive) added a comment - 2019-01-07 17:03

commit 1d56d875fe28f34cefc1bb4e162331ae9f4ce9bb (HEAD -> 10.1, origin/bb-10.1-galera, origin/10.1, bb-10.1-galera)
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date: Mon Jan 7 12:12:30 2019 +0200

~~MDEV-15740~~: InnoDB does not flush redo log when it shoul

During database recovery, a transaction with wsrep XID is
recovered from InnoDB in prepared state. However, when the
transaction is looked up with trx_get_trx_by_xid() in
innobase_commit_by_xid(), trx->xid gets cleared in
trx_get_trx_by_xid_low() and commit time serialization history
write does not update the wsrep XID in trx sys header for
that recovered trx. As a result the transaction gets
committed during recovery but the wsrep position does not
get updated appropriately.

As a fix, we preserve trx->xid for Galera over transaction
commit in recovery phase.

Fix authored by: Teemu Ollakka (GaleraCluster) and Marko Mäkelä.

Jan Lindström (Inactive) added a comment - 2019-01-07 17:03 commit 1d56d875fe28f34cefc1bb4e162331ae9f4ce9bb (HEAD -> 10.1, origin/bb-10.1-galera, origin/10.1, bb-10.1-galera) Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Mon Jan 7 12:12:30 2019 +0200 MDEV-15740 : InnoDB does not flush redo log when it shoul During database recovery, a transaction with wsrep XID is recovered from InnoDB in prepared state. However, when the transaction is looked up with trx_get_trx_by_xid() in innobase_commit_by_xid(), trx->xid gets cleared in trx_get_trx_by_xid_low() and commit time serialization history write does not update the wsrep XID in trx sys header for that recovered trx. As a result the transaction gets committed during recovery but the wsrep position does not get updated appropriately. As a fix, we preserve trx->xid for Galera over transaction commit in recovery phase. Fix authored by: Teemu Ollakka (GaleraCluster) and Marko Mäkelä.

Marko Mäkelä added a comment - 2019-01-14 12:26

The test galera.galera_gcache_recover_manytrx is failing in 10.2:

10.2 248dc12e60d5e3ba269fe2a13689bca32790f9f5

2019-01-14 11:19:38 140592221543424 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.

Also, I think that the bug title is misleading. There is no evidence of missed redo log writes, and no change was made related to that.

Marko Mäkelä added a comment - 2019-01-14 12:26 The test galera.galera_gcache_recover_manytrx is failing in 10.2: 10.2 248dc12e60d5e3ba269fe2a13689bca32790f9f5 2019-01-14 11:19:38 140592221543424 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. Also, I think that the bug title is misleading. There is no evidence of missed redo log writes, and no change was made related to that.

Jan Lindström (Inactive) added a comment - 2019-01-24 07:47

teemu.ollakka Can you have a look this failure on 10.2.

Jan Lindström (Inactive) added a comment - 2019-01-24 07:47 teemu.ollakka Can you have a look this failure on 10.2.

People

Assignee:: Teemu Ollakka

Reporter:: Vasil (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 2018-03-30 07:59

Updated:: 2019-01-28 10:02

Resolved:: 2019-01-28 10:02

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server