[MDEV-16690] node hang due to conflicting inserts into foreign key child table Created: 2018-07-04  Updated: 2019-01-15  Resolved: 2019-01-15

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.2.16, 10.3.8
Fix Version/s: 10.3.11, 10.2.19

Type: Bug Priority: Major
Reporter: Seppo Jaakola Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-17541 KILL QUERY during lock wait in FOREIG... Closed
relates to MDEV-18174 Galera node terminated due to foreign... Closed

 Description   

Cluster node may enter in unresolved conflict state when there are two inserts, with same primary key, into a table which has foreign key constraint for a parent table. These inserts must be issued in separate cluster nodes, and there has to be simultaneous writes (updates or deletes), for the referenced parent row.
As a result of such scenario, replication applier thread may end in unresolved conflict state, And error log will be filled by messages of type:

"WSREP: BF lock wait long"

followed by InnoDB monitor outputs



 Comments   
Comment by Seppo Jaakola [ 2018-07-05 ]

Submitted a pull request, which has a mtr test for reproducing this issue with 10.2 and 10.3 HEAD versions
Pull request fixes a race condition in row0ins.cc, assigning this for review

Comment by Seppo Jaakola [ 2018-07-05 ]

Please take a look at the fix in row0ins.cc

This is the earliest point in execution which originates the over write of hard error code in trx::error_state with DB_LOCK_WAIT code. If 'err' remains here having value DB_LOCK_WAIT, it will be returned through a few function call stacks, and finally blindly assigned to trx::error_state in row_ins_step() / error_handling:

The fix here is protected with trx mutex, this may be redundant.

Comment by Marko Mäkelä [ 2018-07-05 ]

I like the solution, but I think that it can be cleaned up a little.

Comment by Marko Mäkelä [ 2018-07-06 ]

thiru, please check if trx->error_state can be modified by other threads than the one that is executing trx (I think not), and then merge (or cherry-pick) the fix to 10.2.

Comment by Marko Mäkelä [ 2019-01-15 ]

It looks like this has been fixed in MDEV-17541.

Comment by Marko Mäkelä [ 2019-01-15 ]

This issue was fixed as part of MDEV-17541.

Generated at Thu Feb 08 08:30:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.