Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.1.14, 10.1.16
Description
If we put a node into RSU mode, temporarily take it out of the cluster by setting the gcomm string to "gcomm://", perform a schema change, and on the nodes that remain in the cluster execute a delete statement in the format:
delete from <table> where <pk> >= <n>;
then when the node rejoins the cluster (by restoring the gcomm string), IST is not performed and the node determines that it has the same state as the other cluster nodes, despite those nodes having had a bunch of rows deleted from a table. If we then try to delete the rows manually on the affected node, mysqld on the other nodes crashes immediately because it can't process the delete transaction for rows that don't exist.
If we do not alter the schema while the node is out of the cluster, the error does not occur and IST is performed correctly. We did not get any issues with insert or update statements, and also deletes in the format of:
delete from <table> where id = <b>;
also did not trigger a problem, so it seems to be a specific issue with:
delete from <table> where id >= <n>;
In our tests the table that had the schema changes was always a different table to the one that had rows deleted. The latter table had around 120 rows, and our delete statement deleted around ~20 of them.
Steps to reproduce:
1. Set up a 3-node cluster with two tables, one that has thousands of rows (table a) and another that has around ~120 (table b).
2. On node 0, enter RSU mode:
SET GLOBAL wsrep_OSU_method='RSU';
3. Disconnect the node from the cluster by resetting the gcomm string:
SET GLOBAL wsrep_cluster_address="gcomm://";
4. On node 0, add a single column to table a with a default value of null to ensure backward-compatibility.
alter table table_a add column foo int(11) unsigned null;
5. On node 1 or node 2, delete ~20 rows from table b using a query with a >= where condition, eg:
delete from table_b where id >= <n>;
6. Rejoin node 0 to the cluster by setting the gcomm string back to the correct gcomm string for the cluster that references all three nodes:
SET GLOBAL wsrep_cluster_address='node1,node2,node3';
EXPECTED RESULT:
The node will rejoin the cluster and perform IST to synchronise the changes. The 20 rows from table_b will be deleted on node 0.
ACTUAL RESULT:
The node joins the cluster without performing IST. All three nodes are deemed to be in sync but the deleted rows remain on node 0.