Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.1.21
-
CentOS Linux release 7.3.1611 (Core)
Galera 25.3.19(r3667)
-
10.1.23
Description
We have 2 tables linked by a foreign key. Operations (insert, select, update, delete) are done on both tables on many cluster nodes simultaneously.
Then all nodes "crash" with the following error message:
2017-03-28 14:47:46 140353592244992 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table jfd.pbs; Cannot delete or update a parent row: a foreign key constraint fails (`jfd`.`pbt`, CONSTRAINT `pbt_ibfk_1` FOREIGN KEY (`pbs_uid`) REFERENCES `pbs` (`pbs_uid`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 143, Internal MariaDB error code: 1451
2017-03-28 14:47:46 140353592244992 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply warning: 152, 43
2017-03-28 14:47:46 140353592244992 [Warning] WSREP: failed to replay trx: source: 79b129b2-13e6-11e7-be78-1ef9b108083a version: 3 local: 1 state: REPLAYING flags: 1 conn_id: 81 trx_id: 48291 seqnos (l: 50, g: 43, s: 41, d: 42, ts: 21192340744708)
2017-03-28 14:47:46 140353592244992 [Warning] WSREP: Failed to apply trx 43 4 times
2017-03-28 14:47:46 140353592244992 [ERROR] WSREP: trx_replay failed for: 6, schema: (null), query: void
2017-03-28 14:47:46 140353592244992 [ERROR] Aborting
We expect to have foreign key failures when executing the delete queries and to see deadlocks since many nodes are modifying the same tables simultaneously.
To reproduce this issue we created a 2 nodes cluster with a simulated 50ms latency. Following command was executed on both cluster node to add 25ms latency on each:
tc qdisc add dev eth0 root handle 1: netem delay 25ms
Without the latency we have a hard time reproducing this issue.
We are currently using MariaDB 10.1.21 and Galera 25.3.19(r3667).
We created a Perl script (publicTester.pl attached) which can reproduce this issue.
The script needs to be edited to put the mysql command line client location as well as the database user name and password.
The database and tables are created with the following command:
./publicTester.pl create
Then on the first node we execute the following command:
./publicTester.pl s
This inserts, queries, and deletes a single row from the parent table.
On the second node we execute the following command:
./publicTester.pl t
This inserts, queries, and deletes a single row from the child table.
Once processes are started on each node, the failure occurs within 5 minutes bringing all cluster nodes down with the same error message.
The script can be used to clean tables before doing other attempts:
./publicTester.pl clean
If there is any issue with the Perl script let me know.
And same if you need more information just let me know.
Here are the tables structure definitions:
CREATE TABLE pbs (
pbs_uid bigint(20) unsigned NOT NULL AUTO_INCREMENT,
type varchar(16) COLLATE latin1_bin NOT NULL,
unique_key varchar(161) COLLATE latin1_bin NOT NULL,
schema_version varchar(30) COLLATE latin1_bin DEFAULT NULL,
priority smallint(6) NOT NULL,
scheduled_time bigint(20) unsigned NOT NULL,
attempt_nb int(11) NOT NULL DEFAULT 0,
assigned_to varchar(30) COLLATE latin1_bin DEFAULT NULL,
assigned_time bigint(20) unsigned DEFAULT NULL,
PRIMARY KEY (pbs_uid),
UNIQUE KEY IDX_pbs_03 (type,unique_key),
KEY IDX_pbs_01 (assigned_to),
KEY IDX_pbs_02 (priority,scheduled_time,pbs_uid)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1 COLLATE=latin1_bin ROW_FORMAT=COMPRESSED;
CREATE TABLE pbt (
pbt_uid bigint(20) unsigned NOT NULL AUTO_INCREMENT,
pbs_uid bigint(20) unsigned NOT NULL,
created_time bigint(20) unsigned NOT NULL,
handler varchar(80) COLLATE latin1_bin NOT NULL,
handler_data varchar(161) COLLATE latin1_bin DEFAULT NULL,
PRIMARY KEY (pbt_uid),
KEY pbs_uid (pbs_uid),
CONSTRAINT pbt_ibfk_1 FOREIGN KEY (pbs_uid) REFERENCES pbs (pbs_uid)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1 COLLATE=latin1_bin ROW_FORMAT=COMPRESSED;