marko thanks for your comments.
I'm getting convinced that this is recovery time problem. As knielsen explains here: https://lists.launchpad.net/maria-developers/msg11497.html
1. There is 2-phase commit (eg. between InnoDB and binlog). The innodb redo
log is synced to disk in prepare, commit_ordered() is called where
trx->active_commit_ordered is set, and we skip syncing to disk in commit
(because we ca recover from the prepared transaction).
In 10.2 all Galera transactions are committed in 2PC. This is because wsrep replication is implemented as handlerton and during commit time both InnoDB and wsrep htons are registered and are in RW mode, therefore 2PC happens (see ha_commit_trans()). Moreover, this process happens one-by-one for each transaction under Galera commit monitor (this is why group commit is not effective for Galera), so I can't see how Galera could break the commit process.
I run the test with debug tracing on, and observed that the following happens for the problematic transaction during commit time:
T@4 : | | | >innobase_commit_ordered
|
...
|
T@4 : | | | | >innobase_commit_ordered_2
|
...
|
T@4 : | | | | <innobase_commit_ordered_2
|
T@4 : | | | <innobase_commit_ordered
|
...
|
T@4 : | | | | >innobase_commit
|
T@4 : | | | | | enter: commit_trx: 1
|
T@4 : | | | | | trans: ending transaction
|
...
|
T@4 : | | | | | info: readonly: 1
|
T@4 : | | | | | trans: id: 0 must_fush_log_later: 1 active_commit_ordered: 1
|
T@4 : | | | | <innobase_commit
|
---
|
The line
T@4 : | | | | | trans: id: 0 must_fush_log_later: 1 active_commit_ordered: 1
|
was added by me in trx_commit_complete_for_mysql(), which shows that the function was called with trx->active_commit_ordered set, which indeed verifies that innobase_commit_ordered() was called.
At this point during the test galera_gcache_recover the transaction is visible on node_2 and the node is killed
--connection node_2
|
--let $wait_condition = SELECT COUNT(*) > 0 FROM t1;
|
--source include/wait_condition.inc
|
|
SET SESSION wsrep_sync_wait = 0;
|
--source include/kill_galera.inc
|
When the wsrep recovery step is run later in the test, the log for recovery process shows:
2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Starting recovery for XA transactions...
|
2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Transaction 1796 in prepared state after recovery
|
2019-01-04 13:35:31 139971066427200 [Note] InnoDB: Transaction contains changes to 1 rows
|
2019-01-04 13:35:31 139971066427200 [Note] InnoDB: 1 transactions in prepared state after recovery
|
2019-01-04 13:35:31 139971066427200 [Note] Found 1 prepared transaction(s) in InnoDB
|
This matches to Kristian's explanation above that the prepared transaction can be recovered.
This transaction gets committed during recovery process, but in innodbase_commit_by_xid()
if (trx_t* trx = trx_get_trx_by_xid(xid)) {
|
/* use cases are: disconnected xa, slave xa, recovery */
|
innobase_commit_low(trx);
|
ut_ad(trx->mysql_thd == NULL);
|
trx_deregister_from_2pc(trx);
|
ut_ad(!trx->will_lock); /* trx cache requirement */
|
trx_free_for_background(trx);
|
|
return(XA_OK);
|
} else {
|
return(XAER_NOTA);
|
}
|
Call to trx_get_trx_by_xid() clears the trx->xid in trx_get_trx_by_xid_low()
/* Invalidate the XID, so that subsequent calls
|
will not find it. */
|
trx->xid->null();
|
break;
|
This is why the serialisation history write does not update the wsrep XID in trx sys header during innobase_commit_low().
I made an experiment with the following patch:
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
|
index 62924ab..a1a1952 100644
|
--- a/storage/innobase/handler/ha_innodb.cc
|
+++ b/storage/innobase/handler/ha_innodb.cc
|
@@ -17231,8 +17231,10 @@ innobase_commit_by_xid(
|
ut_ad(trx->mysql_thd == NULL);
|
trx_deregister_from_2pc(trx);
|
ut_ad(!trx->will_lock); /* trx cache requirement */
|
+ /* Invalidate the XID, so that subsequent calls
|
+ will not find it. */
|
+ trx->xid->null();
|
trx_free_for_background(trx);
|
-
|
return(XA_OK);
|
} else {
|
return(XAER_NOTA);
|
@@ -17261,6 +17263,9 @@ innobase_rollback_by_xid(
|
int ret = innobase_rollback_trx(trx);
|
trx_deregister_from_2pc(trx);
|
ut_ad(!trx->will_lock);
|
+ /* Invalidate the XID, so that subsequent calls
|
+ will not find it. */
|
+ trx->xid->null();
|
trx_free_for_background(trx);
|
|
return(ret);
|
diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
|
index 93fc1bb..61d7330 100644
|
--- a/storage/innobase/trx/trx0trx.cc
|
+++ b/storage/innobase/trx/trx0trx.cc
|
@@ -2738,9 +2738,6 @@ trx_get_trx_by_xid_low(
|
&& trx_state_eq(trx, TRX_STATE_PREPARED)
|
&& xid->eq((XID*)trx->xid)) {
|
|
- /* Invalidate the XID, so that subsequent calls
|
- will not find it. */
|
- trx->xid->null();
|
break;
|
}
|
}
|
which clears the trx->xid only after the transaction has been committed in innodbase_commit_by_xid(). This makes the test pass with TC_LOG_MMAP. Unfortunately with binlog enabled this does not work because binlog does not store full XID and the recovery process fails for other reasons.
Does this sound clear or am I still missing something?
knielsen, could you please take a look at this? Not having proper durable transactions has been hurting some tests during development too. At least we should have proper durability when binlog or replication are disabled.
Related to this, I discussed an idea with Elkin: Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position".