Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.3.5
-
None
Description
The XID may not be read correctly from rollback segment header in the case if the rollback segment containing the highest trx id was not written by wsrep thread.
For example, running the following MTR test in galera test suite will demonstrate the problem:
--source include/have_innodb.inc
|
--source include/galera_cluster.inc
|
|
# Initialize table on node_1
|
CREATE TABLE t1 (f1 INT PRIMARY KEY) ENGINE=InnoDB;
|
INSERT INTO t1 VALUES (1);
|
|
# Go to node_2, verify that the previous INSERT completed.
|
# Take node_2 out of the cluster, insert and delete a record
|
# on a table with wsrep_on.
|
--connection node_2
|
SELECT * FROM t1;
|
SET GLOBAL wsrep_cluster_address='';
|
SET SESSION wsrep_on=0;
|
INSERT INTO t1 VALUES (2);
|
DELETE FROM t1 WHERE f1 = 2;
|
|
# Shutdown node_2
|
--source include/shutdown_mysqld.inc
|
|
# On node_1, verify that the node has left the cluster.
|
--connection node_1
|
--let $wait_condition = SELECT VARIABLE_VALUE = 1 FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_size';
|
--source include/wait_condition.inc
|
|
# Insert into t1 to enforce IST on node_2 when it is restarted.
|
INSERT INTO t1 VALUES (2);
|
|
# Restart node_2
|
--connection node_2
|
--source include/start_mysqld.inc
|
|
--connection node_1
|
DROP TABLE t1;
|
When the node_2 is started at the end of the test, the rollback segment wsrep seqnos look like the following (zero seqno means invalid wsrep XID):
rseg_id: 0 trx_id: 40 wsrep seqno: 1
|
rseg_id: 1 trx_id: 5 wsrep seqno: 0
|
rseg_id: 2 trx_id: 40 wsrep seqno: 0
|
rseg_id: 3 trx_id: 42 wsrep seqno: 2
|
rseg_id: 4 trx_id: 44 wsrep seqno: 0
|
rseg_id: 5 trx_id: 46 wsrep seqno: 0
|
rseg_id: 6 trx_id: 15 wsrep seqno: 0
|
rseg_id: 7 trx_id: 17 wsrep seqno: 0
|
rseg_id: 8 trx_id: 19 wsrep seqno: 0
|
rseg_id: 9 trx_id: 0 wsrep seqno: 0
|
rseg_id: 10 trx_id: 22 wsrep seqno: 0
|
rseg_id: 11 trx_id: 24 wsrep seqno: 0
|
rseg_id: 12 trx_id: 26 wsrep seqno: 0
|
rseg_id: 13 trx_id: 32 wsrep seqno: 0
|
rseg_id: 14 trx_id: 29 wsrep seqno: 0
|
rseg_id: 15 trx_id: 31 wsrep seqno: 0
|
rseg_id: 16 trx_id: 0 wsrep seqno: 0
|
The rest of the rsegs haven't been written into (have trx_id: 0).
Now, the function
trx_rseg_read_wsrep_checkpoint(XID& xid)
|
reads the XID from the rseg with highest trx id:
trx_id_t id = mach_read_from_8(rseg_header
|
+ TRX_RSEG_MAX_TRX_ID);
|
|
if (id < max_id) {
|
continue;
|
}
|
|
max_id = id;
|
found = trx_rseg_read_wsrep_checkpoint(rseg_header, xid)
|
|| found;
|
In the example dump above the highest trx id is in rseg 5, which does not contain a valid wsrep XID. As a result,
trx_rseg_wsrep_checkpoint(rseg_header, xid)
|
overwrites the previously found XID with zeroes and XID with all zeros is returned from this call. This leads to the following error
2018-03-01 4:35:43 0 [Note] WSREP: Read WSREPXid from InnoDB: 00000000-0000-0000-0000-000000000000:-1
|
2018-03-01 4:35:43 0 [Note] WSREP: SST received: 00000000-0000-0000-0000-000000000000:2
|
2018-03-01 4:35:43 1 [ERROR] WSREP: Application received wrong state:
|
Received: 00000000-0000-0000-0000-000000000000
|
Required: 2cba85c0-1cf9-11e8-b0ce-17781a6132b8
|
2018-03-01 4:35:43 1 [ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required.
|
Expected result:
The node_2 restarts and rejoins the cluster via IST.
Affects only 10.3, the test passes with 10.2.
Attachments
Issue Links
- is caused by
-
MDEV-15158 On commit, do not write to the TRX_SYS page
- Closed