[MDEV-15740] Galera does not recover prepared XA-transactions correctly Created: 2018-03-30 Updated: 2019-01-28 Resolved: 2019-01-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.1, 10.2.8, 10.3.1 |
| Fix Version/s: | 10.4.2, 10.1.38, 10.2.22, 10.3.13 |
| Type: | Bug | Priority: | Major |
| Reporter: | Vasil (Inactive) | Assignee: | Teemu Ollakka |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Description |
|
A stripped down version of galera.galera_gcache_recover:
The problem with that test is that (without the sleep) the contents of the t1 table on node2 goes out of sync with the recorded wsrep xid. Wsrep xid seqno=2 is lost and when it recovers later to 1 it would try to re-apply the changes behind seqno=2 from another node, BUT that would cause a duplicate key error because the row is already in the table t1. The reason for this is that the mini-transaction which writes wsrep xid commits, but the redo log is never flushed up to that LSN before the node is killed. This bug is not present in MySQL because there the log is flushed properly during commit:
In MariaDB this code path is cancelled at trx_commit_complete_for_mysql():
The return; on line 2251 is executed and trx_flush_log_if_needed() is never called. This is because line 2249 evaluates to true - both srv_flush_log_at_trx_commit and trx->active_commit_ordered are 1.
This looks particularly strange because changing innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 is supposed to be more durable and slow compared to 2. Now, with the current code 2 is more durable! |
| Comments |
| Comment by Marko Mäkelä [ 2018-11-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen, could you please take a look at this? Not having proper durable transactions has been hurting some tests during development too. At least we should have proper durability when binlog or replication are disabled. Related to this, I discussed an idea with Elkin: Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position". | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-11-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In 10.2 abcd09c95a49d02bf136a21d62d38a2d03672f26, this seems to be affecting Galera only. I tested with the following:
Also in this case, I am seeing the correct results for the two different transaction isolation levels:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Teemu Ollakka [ 2019-01-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test galera_gcache_recover fails for the following reason: During database recovery, a transaction with wsrep XID is recovered from InnoDB in prepared state. This transaction gets then committed from code path
However, when the transaction is looked up with trx_get_trx_by_xid() in innobase_commit_by_xid(), trx->xid gets cleared in trx_get_trx_by_xid_low() and commit time serailisation history write does not update the wsrep XID in trx sys header for that recovered trx. As a result the transaction gets committed during recovery but the wsrep position does not get updated appropriately. As a fix, either the trx->xid should be preserved over transaction commit in recovery phase or all wsrep transactions in prepared state should be rolled back during recovery. If the commit approach is taken, it must be made sure that the transactions commit in order of increasing XID seqno. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-01-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
teemu.ollakka, your observation is a consequence of what the bug title says. Like Dimov found out, the necessary redo log write is not taking place in Galera. I had an email exchange over this with knielsen, and I was convinced by his explanation that there is no problem outside Galera. Someone (not me) should study how the durability is supposed to work in Galera, and how to fix the logic. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Teemu Ollakka [ 2019-01-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko thanks for your comments. I'm getting convinced that this is recovery time problem. As knielsen explains here: https://lists.launchpad.net/maria-developers/msg11497.html
In 10.2 all Galera transactions are committed in 2PC. This is because wsrep replication is implemented as handlerton and during commit time both InnoDB and wsrep htons are registered and are in RW mode, therefore 2PC happens (see ha_commit_trans()). Moreover, this process happens one-by-one for each transaction under Galera commit monitor (this is why group commit is not effective for Galera), so I can't see how Galera could break the commit process. I run the test with debug tracing on, and observed that the following happens for the problematic transaction during commit time:
The line
was added by me in trx_commit_complete_for_mysql(), which shows that the function was called with trx->active_commit_ordered set, which indeed verifies that innobase_commit_ordered() was called. At this point during the test galera_gcache_recover the transaction is visible on node_2 and the node is killed
When the wsrep recovery step is run later in the test, the log for recovery process shows:
This matches to Kristian's explanation above that the prepared transaction can be recovered. This transaction gets committed during recovery process, but in innodbase_commit_by_xid()
Call to trx_get_trx_by_xid() clears the trx->xid in trx_get_trx_by_xid_low()
This is why the serialisation history write does not update the wsrep XID in trx sys header during innobase_commit_low(). I made an experiment with the following patch:
which clears the trx->xid only after the transaction has been committed in innodbase_commit_by_xid(). This makes the test pass with TC_LOG_MMAP. Unfortunately with binlog enabled this does not work because binlog does not store full XID and the recovery process fails for other reasons. Does this sound clear or am I still missing something? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-01-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
teemu.ollakka, did I understand correctly that Galera only allows one transaction to commit at a time, and that the second phase of the commit is supposed to invoke trx_sys_update_wsrep_checkpoint()? You seem to say that the issue is that when a recovered prepared transaction is committed as part of the Galera recovery process, wsrep_is_wsrep_xid() would not hold, because trx->xid->null() had been called. Is this correct? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-01-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko Yes exactly as in trx_write_serialization_history has a condition
And in that wsrep_is_wsrep_xid is
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-01-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Would the following patch work? I did not test whether it even compiles. The purpose of invalidating the trx->xid is to ensure that XA RECOVER is not returning the same XIDs multiple times. With this patch, we would skip the invalidation only for Galera transactions.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-01-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
commit 1d56d875fe28f34cefc1bb4e162331ae9f4ce9bb (HEAD -> 10.1, origin/bb-10.1-galera, origin/10.1, bb-10.1-galera) During database recovery, a transaction with wsrep XID is As a fix, we preserve trx->xid for Galera over transaction Fix authored by: Teemu Ollakka (GaleraCluster) and Marko Mäkelä. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-01-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test galera.galera_gcache_recover_manytrx is failing in 10.2:
Also, I think that the bug title is misleading. There is no evidence of missed redo log writes, and no change was made related to that. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-01-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
teemu.ollakka Can you have a look this failure on 10.2. |