[MDEV-10406] IST does not occur after schema changes in RSU mode with specific delete query Created: 2016-07-20  Updated: 2019-12-12  Resolved: 2019-12-12

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.14, 10.1.16
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Mark Wadham Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: galera


 Description   

If we put a node into RSU mode, temporarily take it out of the cluster by setting the gcomm string to "gcomm://", perform a schema change, and on the nodes that remain in the cluster execute a delete statement in the format:

delete from <table> where <pk> >= <n>;

then when the node rejoins the cluster (by restoring the gcomm string), IST is not performed and the node determines that it has the same state as the other cluster nodes, despite those nodes having had a bunch of rows deleted from a table. If we then try to delete the rows manually on the affected node, mysqld on the other nodes crashes immediately because it can't process the delete transaction for rows that don't exist.

If we do not alter the schema while the node is out of the cluster, the error does not occur and IST is performed correctly. We did not get any issues with insert or update statements, and also deletes in the format of:

delete from <table> where id = <b>;

also did not trigger a problem, so it seems to be a specific issue with:

delete from <table> where id >= <n>;

In our tests the table that had the schema changes was always a different table to the one that had rows deleted. The latter table had around 120 rows, and our delete statement deleted around ~20 of them.

Steps to reproduce:

1. Set up a 3-node cluster with two tables, one that has thousands of rows (table a) and another that has around ~120 (table b).

2. On node 0, enter RSU mode:

SET GLOBAL wsrep_OSU_method='RSU';

3. Disconnect the node from the cluster by resetting the gcomm string:

SET GLOBAL wsrep_cluster_address="gcomm://";

4. On node 0, add a single column to table a with a default value of null to ensure backward-compatibility.

alter table table_a add column foo int(11) unsigned null;

5. On node 1 or node 2, delete ~20 rows from table b using a query with a >= where condition, eg:

delete from table_b where id >= <n>;

6. Rejoin node 0 to the cluster by setting the gcomm string back to the correct gcomm string for the cluster that references all three nodes:

SET GLOBAL wsrep_cluster_address='node1,node2,node3';

EXPECTED RESULT:

The node will rejoin the cluster and perform IST to synchronise the changes. The 20 rows from table_b will be deleted on node 0.

ACTUAL RESULT:

The node joins the cluster without performing IST. All three nodes are deemed to be in sync but the deleted rows remain on node 0.



 Comments   
Comment by Nirbhay Choubey (Inactive) [ 2016-07-20 ]

Some related scenarios :
https://gist.github.com/nirbhayc/5f84cc9a243fd1a58d287b191313f5a6

Comment by Nirbhay Choubey (Inactive) [ 2016-07-20 ]

2. On node 0, enter RSU mode:
SET GLOBAL wsrep_OSU_method='RSU';
...
alter table table_a add column foo int(11) unsigned null;

You should update wsrep_osu_method's SESSION value (instead of GLOBAL), in case you are
performing the subsequent ALTER in the same session.

Comment by Nirbhay Choubey (Inactive) [ 2016-07-20 ]

10.0-galera HEAD shows similar behavior.

Comment by Jan Lindström (Inactive) [ 2019-12-12 ]

Support for 10.0-galera has ended.

Generated at Thu Feb 08 07:41:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.