[MDEV-32728] safe_mutex: Found wrong usage of mutex 'LOCK_thd_data' and 'wait_mutex' - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.6
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2
Component/s: Locking
Labels:
- regression
- rr-profile-analyzed

Description

bb-10.6-release 02e97252
safe_mutex: Found wrong usage of mutex 'LOCK_thd_data' and 'wait_mutex'
Mutex currently locked (in reverse order):
wait_mutex /data/src/10.6/storage/innobase/handler/ha_innodb.cc line 5024
LOCK_thd_data /data/src/10.6/sql/sql_class.h line 3851
LOCK_thd_kill /data/src/10.6/sql/sql_class.h line 3850

The failure started happening after this commit in 10.6

commit e039720bf3494a35b34eb0ddc55af170a1807723

Author: Marko Mäkelä

Date:   Mon Sep 11 14:51:02 2023 +0300

    MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait

That's expected, as the mutex lock was only added there.
Since the same test doesn't trigger any other failures for me on a build before the commit, I'll consider it a regression for now. Feel free to demote if the analysis shows otherwise.

Attachments

Issue Links

is caused by

MDEV-32530 Race condition in lock_wait_rpl_report()

Closed

Activity

Ascending order - Click to sort in descending order

View 9 older comments

Kristian Nielsen added a comment - 2023-11-08 12:22 - edited

The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex.
But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait:

      err= my_cond_timedwait(&trx->lock.cond, &lock_sys.wait_mutex.m_mutex, &abstime);

Otherwise the kill may occur just after the check and be ignored, and the kill is lost.
This will break parallel replication, as deadlocks remain unhandled and the slave will hang.

It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill.

I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe.

What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained?
~~The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill.~~ Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.

Kristian Nielsen added a comment - 2023-11-08 12:22 - edited The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex. But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait: err= my_cond_timedwait(&trx->lock.cond, &lock_sys.wait_mutex.m_mutex, &abstime); Otherwise the kill may occur just after the check and be ignored, and the kill is lost. This will break parallel replication, as deadlocks remain unhandled and the slave will hang. It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill. I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe. What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained? The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill. Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.

Kristian Nielsen added a comment - 2023-11-08 12:57

Here an RFC patch for the above idea:

https://github.com/MariaDB/server/commit/d35f30214a2a1da0f5dbc6e69d1bd9a3a5e98b06

Kristian Nielsen added a comment - 2023-11-08 12:57 Here an RFC patch for the above idea: https://github.com/MariaDB/server/commit/d35f30214a2a1da0f5dbc6e69d1bd9a3a5e98b06

Sergei Golubchik added a comment - 2023-11-08 13:54

Thanks! Looks good, I'll apply and push it, if you don't mind

Sergei Golubchik added a comment - 2023-11-08 13:54 Thanks! Looks good, I'll apply and push it, if you don't mind

Kristian Nielsen added a comment - 2023-11-08 14:06

Sure, please do.
Meanwhile, I'll see if I can come up with a quick testcase, depends on how tricky it is to get the threads coordinated.

Kristian Nielsen added a comment - 2023-11-08 14:06 Sure, please do. Meanwhile, I'll see if I can come up with a quick testcase, depends on how tricky it is to get the threads coordinated.

Kristian Nielsen added a comment - 2023-11-08 17:24

Pushed a testcase to knielsen_mdev32728:

https://github.com/MariaDB/server/commit/2d0eb0ddfc13d7b5a788450a281a3c5b5854482e

Kristian Nielsen added a comment - 2023-11-08 17:24 Pushed a testcase to knielsen_mdev32728: https://github.com/MariaDB/server/commit/2d0eb0ddfc13d7b5a788450a281a3c5b5854482e

People

Assignee:: Sergei Golubchik

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2023-11-07 14:25

Updated:: 2023-11-09 13:57

Resolved:: 2023-11-09 13:57

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server