[MDEV-37133] Parallel replication hang and stall due to incorrectly handled row lock conflict, missing deadlock kill - Jira

XML

Word

Printable

Details

Type: Bug
Status: In Review (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 11.4.0, 10.6, 10.11, 11.4, 11.8, 10.4(EOL)
Fix Version/s: 10.6, 10.11, 11.4
Component/s: Replication
Labels:
None

Bug Category:
Related to performance
Epic Link:
Reduce Slave Lag
Sprint:
Q4/2025 Server Maintenance, Q1/2026 Server Maintenance

Description

I have discovered problems with optimistic parallel replication that seem to have been there forever, and which have surely been experienced by many users.

The problem is seen as occasional row lock conflicts that is not correctly handled, so that the blocking thread is not deadlock killed and parallel replication gets hung/stalled until --innodb-lock-wait-timeout, default 50 seconds.

With the 50 second default timeout, the problem can instead appear as transient replication lag that disappears by itself, which can make the problem hard to diagnose (but note that 50 seconds replication lag due to a code error is still a very serious bug). Setting a large timeout will make the parallel replication hang completely.

The only way I have currently to reproduce is very sporadically using the attached test case in a RelWithDebInfo build on a host with many cores. Running it for a couple hours usually triggers the condition. Use the following patch to make the server crash on lock wait timeout, or set a large --innodb-lock-wait-timeout to make the test case fail in case of hang:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

index 33c8bf3bb16..3dee4cf17bf 100644

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -2408,6 +2422,7 @@ dberr_t lock_wait(que_thr_t *thr)

 #endif

       else

+        abort();

         trx->error_state= DB_LOCK_WAIT_TIMEOUT;

         lock_sys.timeouts++;

In 10.4/10.5, instead the following patch can be used:

index bd65704d657..f53ffe48ab3 100644

--- a/storage/innobase/lock/lock0wait.cc

+++ b/storage/innobase/lock/lock0wait.cc

@@ -515,6 +515,7 @@ lock_wait_check_and_cancel(

 #ifdef WITH_WSREP

                         if (!wsrep_is_BF_lock_timeout(trx)) {

 #endif /* WITH_WSREP */

+abort();

                                lock_cancel_waiting_and_release(trx->lock.wait_lock);

 #ifdef WITH_WSREP

I managed to track down what happens in this test case, though I do not yet have a complete picture of how to reliably reproduce due to the difficulty of triggering the issue for debugging:

Some transaction T2 is granted a row lock.
Another transaction T1 requests the lock, is blocked by T2, sends a deadlock kill to T2.
Meanwhile, another transaction T3 requests the lock and is granted it, due to assymmetries with locking (T2 and T3 block T1, but T2 does not block T3, IIUC). From debugging, T1 has LOCK_INSERT_INTENTION on the lock mode while T3 does not.
Again meanwhile, the page needs reorganising, lock_move_reorganize_page() gets called which eventually calls into lock_move_granted_locks_to_front(), putting T3 now ahead of T1 in the lock queue for the page.
Now the pending deadlock kill of T2 goes through, T2 rolls back.
At this point, T1 is not granted the lock, because a conflicting lock is held by T3 earlier in the queue.
Thus, at this point, T1 is waiting for T3, but this wait was never reported to the replication layer with thd_rpl_deadlock_check(), so replication deadlocks: T1 is waiting on row lock on T3, and T3 is waiting for prior commit on T1.
The problem is that waits are reported to replication when a transaction enqueues a waiting lock and goes to sleep; this reports a wait against all conflicting locks already in the queue. But when a new lock is granted to a transaction (T3 in this case), no wait is reported for any waiting transactions with a conflicting lock request already in the queue.

I also saw a secondary problem, there is a missing check for kill in lock_wait_suspend_thread() in 10.4/10.5. This can cause a deadlock kill to get lost, though the kill is checked once per second in the lock_wait_timeout_thread, so this leads to a stall of replication of <1 second and is even harder to detect. This problem seems to no longer exist in the reworked lock code from 10.6 on.

Attached two RFC patches (different for 10.4 and 11.4, the involved code was rewritten in 10.6). These work by adding wait reports to lock_rec_add_to_queue() when a granted lock is added to a lock queue for a page. With these patches, the problem is no longer reproducible for me, even after 48 hours of testing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

knielsen.test
2 kB
2025-07-01 14:31
parallel_rpl_hang_10.4.patch
5 kB
2025-07-01 14:53
parallel_rpl_hang_11.4.patch
2 kB
2025-07-01 14:53

Issue Links

is part of

MDEV-37582 Reduce Slave Lag

Open

Activity

People

Assignee:: Andrei Elkin

Reporter:: Kristian Nielsen

Assigned for Implementation:: Kristian Nielsen

Assigned for Review:: Andrei Elkin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2025-07-01 14:56

Updated:: 2025-10-21 16:38