[MDEV-15611] Due to the failure of foreign key detection, Galera slave node killed himself. Created: 2018-03-21  Updated: 2021-05-02  Resolved: 2018-04-19

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.8, 10.2.10, 10.2.12, 10.3.1
Fix Version/s: 10.1.35, 10.2.15

Type: Bug Priority: Critical
Reporter: Devin Yu Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: galera
Environment:

CentOS Linux release 7.3.1611 (Core) 3.10.0-514.26.2.el7.x86_64
CentOS Linux release 7.4.1708 (Core) 3.10.0-693.5.2.el7.x86_64

MariaDB Galera Cluster (3 nodes)
10.2.12-MariaDB-log / 10.2.10-MariaDB-log
wsrep_25.21 / wsrep_25.20


Attachments: File galera_mdev_15611.cnf     File galera_mdev_15611.test     PNG File image-2018-03-21-10-45-36-084.png    
Issue Links:
Duplicate
is duplicated by MDEV-15252 DELETE with FKs crashes Galera nodes ... Closed
Relates
relates to MDEV-13246 Stale rows despite ON DELETE CASCADE ... Closed
relates to MDEV-13498 DELETE with CASCADE constraints takes... Closed
relates to MDEV-13678 DELETE with CASCADE takes a long time... Closed
relates to MDEV-14222 Unnecessary 'cascade' memory allocati... Closed
relates to MDEV-18393 Galera node kills after DELETE statem... Closed
relates to MDEV-18032 MDEV-15611 apparently still occuring ... Closed
relates to MDEV-18174 Galera node terminated due to foreign... Closed

 Description   

err.log

[ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table aaa.p; Cannot delete or update a parent row: a foreign key constraint fails (`aaa`.`f`, CONSTRAINT `f_ibfk_1` FOREIGN KEY (`f_id`) REFERENCES `p` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 108, Internal MariaDB error code: 1451
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 152, 128
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: Failed to apply app buffer: seqno: 128, status: 1
         at galera/src/trx_handle.cpp:apply():351

Reproduce
wsrep_slave_threads>1

CREATE TABLE `p` (
  `id` int(11) NOT NULL,
  `a` varchar(33) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ;
 
CREATE TABLE `f` (
  `id` int(11) NOT NULL,
  `f_id` int(11) DEFAULT NULL,
  KEY `f_id` (`f_id`),
  CONSTRAINT `f_ibfk_1` FOREIGN KEY (`f_id`) REFERENCES `p` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ;
 
insert into p select 1,'aaa';
 
insert into f select 1,1;
 
#Repeatedly execute the following SQL until you generate thousands of data
insert into f select a.id + b.a,a.f_id from f a join (select max(id) as a from f b) b on 1=1;
 
select count(*) from f;
+----------+
| count(*) |
+----------+
|   131072 |
+----------+
 
#Slave node, no matter wsrep_slave_fk_checks is on or off.
show variables like 'wsrep_slave_fk_checks';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| wsrep_slave_fk_checks | ON    |
+-----------------------+-------+
 
#Master (Write) node
delete from f;delete from p;
 
#Slave node down and got errors
2018-03-16 14:33:39 140457990018816 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table aaa.p; Cannot delete or update a parent row: a foreign key constraint fails (`aaa`.`f`, CONSTRAINT `f_ibfk_1` FOREIGN KEY (`f_id`) REFERENCES `p` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 108, Internal MariaDB error code: 1451
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 152, 128
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: Failed to apply app buffer: seqno: 128, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2018-03-16 14:33:39 140457990018816 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table aaa.p; Cannot delete or update a parent row: a foreign key constraint fails (`aaa`.`f`, CONSTRAINT `f_ibfk_1` FOREIGN KEY (`f_id`) REFERENCES `p` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 108, Internal MariaDB error code: 1451
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 152, 128
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: Failed to apply app buffer: seqno: 128, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2018-03-16 14:33:39 140457990018816 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table aaa.p; Cannot delete or update a parent row: a foreign key constraint fails (`aaa`.`f`, CONSTRAINT `f_ibfk_1` FOREIGN KEY (`f_id`) REFERENCES `p` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 108, Internal MariaDB error code: 1451
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 152, 128
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: Failed to apply app buffer: seqno: 128, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2018-03-16 14:33:39 140457990018816 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table aaa.p; Cannot delete or update a parent row: a foreign key constraint fails (`aaa`.`f`, CONSTRAINT `f_ibfk_1` FOREIGN KEY (`f_id`) REFERENCES `p` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 108, Internal MariaDB error code: 1451
2018-03-16 14:33:39 140457990018816 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 152, 128
2018-03-16 14:33:39 140457990018816 [ERROR] WSREP: Failed to apply trx: source: 849e805c-1dd2-11e8-aa79-eb31f2b81368 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 87 trx_id: 12277 seqnos (l: 33, g: 128, s: 127, d: 126, ts: 1358419270394303)
2018-03-16 14:33:39 140457990018816 [ERROR] WSREP: Failed to apply trx 128 4 times
2018-03-16 14:33:39 140457990018816 [ERROR] WSREP: Node consistency compromised, aborting...
 
#Slave processlist
#Write set 127 is "delete from f;"
#Write set 128 is "delete from p;"
#Slave concurrently apply the two sql
 !image-2018-03-21-10-45-36-084.png|thumbnail! 

This error didn't reproduce on MariaDB 10.2.7 with wsrep_slave_threads>1



 Comments   
Comment by Sachin Setiya (Inactive) [ 2018-04-02 ]

Thanks for the test case I am looking into it galera_mdev_15611.cnf galera_mdev_15611.test

Comment by Sachin Setiya (Inactive) [ 2018-04-02 ]

Actually I am not sure if this is bug or not, If we try to try to increase the may_apply_attempts

diff --git a/galera/src/replicator_smm.cpp b/galera/src/replicator_smm.cpp
index 48026f5..a45ee16 100644
--- a/galera/src/replicator_smm.cpp
+++ b/galera/src/replicator_smm.cpp
@@ -24,7 +24,7 @@ apply_trx_ws(void*                    recv_ctx,
              const wsrep_trx_meta_t&  meta)
 {
     using galera::TrxHandle;
-    static const size_t max_apply_attempts(4);
+    static const size_t max_apply_attempts(100);
     size_t attempts(1);
 
     do

We can make the test case to pass.
Hi seppo what you think ?

Comment by Sachin Setiya (Inactive) [ 2018-04-03 ]

Actually we don't need complicated insert
Even normal insert fails

--source include/galera_cluster.inc
--source include/have_innodb.inc
 
--connection node_1
CREATE TABLE t1 (
  id int primary key
);
 
CREATE TABLE t2 (
  id int primary key ,
  f_id int DEFAULT NULL, FOREIGN KEY(f_id)  REFERENCES t1 (id)
);
 
insert into t1 select 1;
 
--let $count=200
while($count)
{
  #Repeatedly execute the following SQL until you generate thousands of data
  --eval insert into t2 values ($count, 1);
  --dec $count
}
 
select count(*) from t1;
delete from t2;
delete from t1;

If we remove primary key from t2 then it will make slave node (node 2) to crash , with primary key , node 2 is aborted , which is not the correct behaviour

Comment by Sachin Setiya (Inactive) [ 2018-04-03 ]

Comment from seppo

I did some FK experiments with MariaDB 10.2 (latest development HEAD), and it turns out that, for an child table delete, only child table key is pushed into write set
this means that the reference to the parent table row is not present in the write set at all. And therefore parallel applying allows processing parent table delete in parallel with child table delete

Comment by Devin Yu [ 2018-04-03 ]

Table t2 without PK, the purpose is to let "be removed from t2;" perform slowly, so slave can apply "delete from t2;" and "delete from t1;" concurrently .
There is a PK in the table of my real environment where we encountered this bug.
BTW, the two SQL can't be in the same transaction, because the granularity of concurrency is transaction
Please redo this test in MariaDB 10.2.7, will never reproduce this bug no matter there is a PK in t2 or not.

Thank you.

Comment by Sachin Setiya (Inactive) [ 2018-04-03 ]

Hi 920895156@qq.com ,

Thanks for reminding that it does not fail in 10.2.7 , I have run git bisect and 2f342c450755fe7b6c39ec69930d240047c8242d is first bad commit, I am trying to understand what is does

Comment by Sachin Setiya (Inactive) [ 2018-04-04 ]

Patch link http://lists.askmonty.org/pipermail/commits/2018-April/012274.html

Comment by Marko Mäkelä [ 2018-04-05 ]

The submitted patch would seem to reintroduce MDEV-13498, because the Galera adjustments would be executed even when Galera is disabled (!wsrep_on_trx(trx)).

The function wsrep_must_process_fk() was introduced in MDEV-13498/MDEV-13246 and subsequently changed in MDEV-13678.

sachin.setiya.007, in which exact code revisions can the failure be repeated? Note that MDEV-14222 (MariaDB 10.2.13) changed some FOREIGN KEY processing by reverting a change that was made in MySQL 5.7.2 and reverted in 5.7.21.

Comment by Marko Mäkelä [ 2018-04-05 ]

I think that the predicate should be rewritten like this:

inline bool wsrep_must_process_fk(const upd_node_t* node, const trx_t* trx)
{
	if (!wsrep_on_trx(trx)) {
		return false;
	}
	return que_node_get_type(node->common.parent) != QUE_NODE_UPDATE
		|| static_cast<upd_node_t*>(node->common.parent)->cascade_node
		!= node;
}

That is, the condition on WSREP_ON is preserved, but the condition on node is negated.

Comment by Sachin Setiya (Inactive) [ 2018-04-08 ]

Hi Marko!

Thanks for the review,
The issue was introduced in 2f342c4 to all version upto 10.2

Comment by Sachin Setiya (Inactive) [ 2018-04-08 ]

Hi Marko!

I pushed it into bb-mdev-15611 branch.

Comment by Guillaume Lefranc [ 2018-05-30 ]

Is there a fix for MariaDB 10.1?

Comment by TAO ZHOU [ 2019-01-09 ]

Is this already fixed. I am running 10.2.19 and still getting the same error. It didn't kill itself though.

2019-01-09 13:51:46 50734833152 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table w2live_college.WebSiteLayout; Cannot delete or update a parent row: a foreign key constraint fails (`w2live_college`.`WebNodeType`, CONSTRAINT `WebNodeType_ibfk_1` FOREIGN KEY (`webSiteLayoutId`) REFERENCES `WebSiteLayout` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 196, Internal MariaDB error code: 1451
2019-01-09 13:51:46 50734833152 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 152, 1398192281
2019-01-09 13:51:46 50734833152 [Warning] WSREP: Failed to apply app buffer: seqno: 1398192281, status: 1
	 at galera/src/trx_handle.cpp:apply():353
Retrying 2th time

Comment by TAO ZHOU [ 2019-01-09 ]

It's still crashing.

Generated at Thu Feb 08 08:22:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.