safe_mutex: Found wrong usage of mutex 'LOCK_thd_data' and 'wait_mutex'
Mutex currently locked (in reverse order):
wait_mutex /data/src/10.6/storage/innobase/handler/ha_innodb.cc line 5024
LOCK_thd_data /data/src/10.6/sql/sql_class.h line 3851
LOCK_thd_kill /data/src/10.6/sql/sql_class.h line 3850
The failure started happening after this commit in 10.6
commit e039720bf3494a35b34eb0ddc55af170a1807723
Author: Marko Mäkelä
Date: Mon Sep 11 14:51:02 2023 +0300
MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait
That's expected, as the mutex lock was only added there.
Since the same test doesn't trigger any other failures for me on a build before the commit, I'll consider it a regression for now. Feel free to demote if the analysis shows otherwise.
Attachments
Issue Links
is caused by
MDEV-32530Race condition in lock_wait_rpl_report()
The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex.
But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait:
Otherwise the kill may occur just after the check and be ignored, and the kill is lost.
This will break parallel replication, as deadlocks remain unhandled and the slave will hang.
It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill.
I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe.
What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained? The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill. Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.
Kristian Nielsen
added a comment - - edited The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex.
But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait:
err= my_cond_timedwait(&trx->lock.cond, &lock_sys.wait_mutex.m_mutex, &abstime);
Otherwise the kill may occur just after the check and be ignored, and the kill is lost.
This will break parallel replication, as deadlocks remain unhandled and the slave will hang.
It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill.
I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe.
What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained?
The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill. Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.
Kristian Nielsen
added a comment - Here an RFC patch for the above idea:
https://github.com/MariaDB/server/commit/d35f30214a2a1da0f5dbc6e69d1bd9a3a5e98b06
Sure, please do.
Meanwhile, I'll see if I can come up with a quick testcase, depends on how tricky it is to get the threads coordinated.
Kristian Nielsen
added a comment - Sure, please do.
Meanwhile, I'll see if I can come up with a quick testcase, depends on how tricky it is to get the threads coordinated.
Kristian Nielsen
added a comment - Pushed a testcase to knielsen_mdev32728:
https://github.com/MariaDB/server/commit/2d0eb0ddfc13d7b5a788450a281a3c5b5854482e
People
Sergei Golubchik
Elena Stepanova
Votes:
0Vote for this issue
Watchers:
4Start watching this issue
Dates
Created:
Updated:
Resolved:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":1002.7999999523163,"ttfb":221.5,"pageVisibility":"visible","entityId":126280,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"9a655f9b-bf4a-40ff-89eb-afc90fe125e0","navigationType":0,"readyForUser":1082,"redirectCount":0,"resourceLoadedEnd":1191.3999998569489,"resourceLoadedStart":227.69999980926514,"resourceTiming":[{"duration":283.30000019073486,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2bu7/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":227.69999980926514,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":227.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":511,"responseStart":0,"secureConnectionStart":0},{"duration":283.2999999523163,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2bu7/820016/12ta74/8679b4946efa1a0bb029a3a22206fb5d/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true","startTime":227.79999995231628,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":227.79999995231628,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":511.09999990463257,"responseStart":0,"secureConnectionStart":0},{"duration":293.89999985694885,"initiatorType":"script","name":"https://jira.mariadb.org/s/fbf975c0cce4b1abf04784eeae9ba1f4-CDN/lu2bu7/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":228,"connectEnd":228,"connectStart":228,"domainLookupEnd":228,"domainLookupStart":228,"fetchStart":228,"redirectEnd":0,"redirectStart":0,"requestStart":228,"responseEnd":521.8999998569489,"responseStart":521.8999998569489,"secureConnectionStart":228},{"duration":391.7000000476837,"initiatorType":"script","name":"https://jira.mariadb.org/s/099b33461394b8015fc36c0a4b96e19f-CDN/lu2bu7/820016/12ta74/8679b4946efa1a0bb029a3a22206fb5d/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true","startTime":228.09999990463257,"connectEnd":228.09999990463257,"connectStart":228.09999990463257,"domainLookupEnd":228.09999990463257,"domainLookupStart":228.09999990463257,"fetchStart":228.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":228.09999990463257,"responseEnd":619.7999999523163,"responseStart":619.7999999523163,"secureConnectionStart":228.09999990463257},{"duration":395.59999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/94c15bff32baef80f4096a08aceae8bc-CDN/lu2bu7/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":228.29999995231628,"connectEnd":228.29999995231628,"connectStart":228.29999995231628,"domainLookupEnd":228.29999995231628,"domainLookupStart":228.29999995231628,"fetchStart":228.29999995231628,"redirectEnd":0,"redirectStart":0,"requestStart":228.29999995231628,"responseEnd":623.8999998569489,"responseStart":623.8999998569489,"secureConnectionStart":228.29999995231628},{"duration":396.09999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2bu7/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":228.29999995231628,"connectEnd":228.29999995231628,"connectStart":228.29999995231628,"domainLookupEnd":228.29999995231628,"domainLookupStart":228.29999995231628,"fetchStart":228.29999995231628,"redirectEnd":0,"redirectStart":0,"requestStart":228.29999995231628,"responseEnd":624.3999998569489,"responseStart":624.3999998569489,"secureConnectionStart":228.29999995231628},{"duration":396.2999999523163,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2bu7/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":228.39999985694885,"connectEnd":228.39999985694885,"connectStart":228.39999985694885,"domainLookupEnd":228.39999985694885,"domainLookupStart":228.39999985694885,"fetchStart":228.39999985694885,"redirectEnd":0,"redirectStart":0,"requestStart":228.39999985694885,"responseEnd":624.6999998092651,"responseStart":624.6999998092651,"secureConnectionStart":228.39999985694885},{"duration":396.69999980926514,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2bu7/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":228.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":228.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":625.1999998092651,"responseStart":0,"secureConnectionStart":0},{"duration":396.40000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":228.69999980926514,"connectEnd":228.69999980926514,"connectStart":228.69999980926514,"domainLookupEnd":228.69999980926514,"domainLookupStart":228.69999980926514,"fetchStart":228.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":228.69999980926514,"responseEnd":625.0999999046326,"responseStart":625.0999999046326,"secureConnectionStart":228.69999980926514},{"duration":396.7999999523163,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2bu7/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":228.79999995231628,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":228.79999995231628,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":625.5999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":396.69999980926514,"initiatorType":"script","name":"https://jira.mariadb.org/s/3339d87fa2538a859872f2df449bf8d0-CDN/lu2bu7/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":229,"connectEnd":229,"connectStart":229,"domainLookupEnd":229,"domainLookupStart":229,"fetchStart":229,"redirectEnd":0,"redirectStart":0,"requestStart":229,"responseEnd":625.6999998092651,"responseStart":625.6999998092651,"secureConnectionStart":229},{"duration":496.90000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2bu7/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":229.59999990463257,"connectEnd":229.59999990463257,"connectStart":229.59999990463257,"domainLookupEnd":229.59999990463257,"domainLookupStart":229.59999990463257,"fetchStart":229.59999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":229.59999990463257,"responseEnd":726.5,"responseStart":726.5,"secureConnectionStart":229.59999990463257},{"duration":939.6000001430511,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2bu7/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":229.69999980926514,"connectEnd":229.69999980926514,"connectStart":229.69999980926514,"domainLookupEnd":229.69999980926514,"domainLookupStart":229.69999980926514,"fetchStart":229.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":229.69999980926514,"responseEnd":1169.2999999523163,"responseStart":1169.2999999523163,"secureConnectionStart":229.69999980926514},{"duration":50.299999952316284,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":677.5999999046326,"connectEnd":677.5999999046326,"connectStart":677.5999999046326,"domainLookupEnd":677.5999999046326,"domainLookupStart":677.5999999046326,"fetchStart":677.5999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":677.5999999046326,"responseEnd":727.8999998569489,"responseStart":727.8999998569489,"secureConnectionStart":677.5999999046326},{"duration":279.89999985694885,"initiatorType":"link","name":"https://jira.mariadb.org/s/d5715adaadd168a9002b108b2b039b50-CDN/lu2bu7/820016/12ta74/be4b45e9cec53099498fa61c8b7acba4/_/download/contextbatch/css/jira.project.sidebar,-_super,-project.issue.navigator,-jira.general,-jira.browse.project,-jira.view.issue,-jira.global,-atl.general,-com.atlassian.jira.projects.sidebar.init/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true","startTime":911.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":911.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1191.3999998569489,"responseStart":0,"secureConnectionStart":0}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":14,"responseStart":222,"responseEnd":225,"domLoading":225,"domInteractive":1199,"domContentLoadedEventStart":1199,"domContentLoadedEventEnd":1245,"domComplete":2016,"loadEventStart":2017,"loadEventEnd":2017,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":1171.2999999523163},{"name":"bigPipe.sidebar-id.end","time":1172.1999998092651},{"name":"bigPipe.activity-panel-pipe-id.start","time":1172.2999999523163},{"name":"bigPipe.activity-panel-pipe-id.end","time":1174},{"name":"activityTabFullyLoaded","time":1261.7999999523163}],"measures":[],"correlationId":"bd7799d5e0099c","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":142,"dbReadsTimeInMs":9,"dbConnsTimeInMs":16,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex.
But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait:
err= my_cond_timedwait(&trx->lock.cond, &lock_sys.wait_mutex.m_mutex, &abstime);
Otherwise the kill may occur just after the check and be ignored, and the kill is lost.
This will break parallel replication, as deadlocks remain unhandled and the slave will hang.
It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill.
I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe.
What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained?
The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill.Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.