[MDEV-32530] Race condition in lock_wait_rpl_report() - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2
Component/s: Locking, Storage Engine - InnoDB
Labels:
Environment:
GNU/Linux, rr 5.6.0

Description

~~MDEV-32096~~ introduced an incorrect optimization to lock_wait_rpl_report(), which manifested itself as the following debug assertion failure:

lock/lock0lock.cc:2068: dberr_t lock_wait(que_thr_t*): Assertion `!wait_lock == !trx->lock.wait_lock' failed.

In the rr replay trace, the problem occurs while we are able to acquire exclusive lock_sys.latch without waiting. The following patch should fix this:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

index 31e02d2451a..df51ceb16d8 100644

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -1812,8 +1812,14 @@ static lock_t *lock_wait_rpl_report(trx_t *trx)

   else if (!wait_lock->is_waiting())

-    wait_lock= nullptr;

-    goto func_exit;

+    wait_lock= trx->lock.wait_lock;

+    if (!wait_lock)

+      goto func_exit;

+    if (!wait_lock->is_waiting())

+    {

+      wait_lock= nullptr;

+      goto func_exit;

+    }

   if (wait_lock->is_table())

While this function was about to enter lock_sys.wr_lock_try(), another thread had updated the lock while holding a shared lock_sys.latch. The lock_sys.latch happened to have been released by the time lock_sys.wr_lock_try() executed the std::atomic::compare_exchange_strong() on the lock word, so the exclusive lock_sys.latch was granted without waiting.

In lock_sys_t::cancel() there is a similar lock_sys.wr_lock_try() pattern on record locks (which can be modified by other threads), but it is correctly reloading trx->lock.wait_lock after acquiring the lock_sys.latch.

Attachments

Issue Links

causes

MDEV-32728 safe_mutex: Found wrong usage of mutex 'LOCK_thd_data' and 'wait_mutex'

Closed

is caused by

MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait

Closed

Activity

Vladislav Lesin added a comment - 2023-10-24 06:23

Looks good to me.

Vladislav Lesin added a comment - 2023-10-24 06:23 Looks good to me.

People

Assignee:: Vladislav Lesin

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023-10-20 13:13

Updated:: 2023-11-08 11:38

Resolved:: 2023-10-24 12:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server