Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32728

safe_mutex: Found wrong usage of mutex 'LOCK_thd_data' and 'wait_mutex'

Details

    Description

      bb-10.6-release 02e97252

      safe_mutex: Found wrong usage of mutex 'LOCK_thd_data' and 'wait_mutex'
      Mutex currently locked (in reverse order):
      wait_mutex                        /data/src/10.6/storage/innobase/handler/ha_innodb.cc  line 5024
      LOCK_thd_data                     /data/src/10.6/sql/sql_class.h  line 3851
      LOCK_thd_kill                     /data/src/10.6/sql/sql_class.h  line 3850
      

      The failure started happening after this commit in 10.6

      commit e039720bf3494a35b34eb0ddc55af170a1807723
      Author: Marko Mäkelä
      Date:   Mon Sep 11 14:51:02 2023 +0300
       
          MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait
      

      That's expected, as the mutex lock was only added there.
      Since the same test doesn't trigger any other failures for me on a build before the commit, I'll consider it a regression for now. Feel free to demote if the analysis shows otherwise.

      Attachments

        Issue Links

          Activity

            knielsen Kristian Nielsen added a comment - - edited

            The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex.
            But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait:

                  err= my_cond_timedwait(&trx->lock.cond, &lock_sys.wait_mutex.m_mutex, &abstime);
            

            Otherwise the kill may occur just after the check and be ignored, and the kill is lost.
            This will break parallel replication, as deadlocks remain unhandled and the slave will hang.

            It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill.

            I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe.

            What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained?
            The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill. Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.

            knielsen Kristian Nielsen added a comment - - edited The patch moves the check of trx_is_interrupted() to before taking the lock_sys.wait_mutex.m_mutex. But checking for kill always needs to be done while holding the mutex that's being used in pthread_cond_wait: err= my_cond_timedwait(&trx->lock.cond, &lock_sys.wait_mutex.m_mutex, &abstime); Otherwise the kill may occur just after the check and be ignored, and the kill is lost. This will break parallel replication, as deadlocks remain unhandled and the slave will hang. It sems necessary to be able to call thd_kill_level() while holding other mutex, as this is a requirement for correctly handling kill. I understand the convenience of trying to run pending apc from thd_kill_level(), since it is something we can expect to be called frequently. But it doesn't seem safe. What about using mysql_mutex_trylock() in thd_kill_level(), and only running the apc if the lock can be obtained? The request can then be dequeued under LOCK_thd_kill. Then the apc should be run while temporarily unlocking LOCK_thd_kill. Update: No, the lock cannot be released while running the apc, it protects the lifetime of the request. But that might be ok, just a retriction on which mutexes can be taken in an apc.
            knielsen Kristian Nielsen added a comment - Here an RFC patch for the above idea: https://github.com/MariaDB/server/commit/d35f30214a2a1da0f5dbc6e69d1bd9a3a5e98b06

            Thanks! Looks good, I'll apply and push it, if you don't mind

            serg Sergei Golubchik added a comment - Thanks! Looks good, I'll apply and push it, if you don't mind

            Sure, please do.
            Meanwhile, I'll see if I can come up with a quick testcase, depends on how tricky it is to get the threads coordinated.

            knielsen Kristian Nielsen added a comment - Sure, please do. Meanwhile, I'll see if I can come up with a quick testcase, depends on how tricky it is to get the threads coordinated.
            knielsen Kristian Nielsen added a comment - Pushed a testcase to knielsen_mdev32728: https://github.com/MariaDB/server/commit/2d0eb0ddfc13d7b5a788450a281a3c5b5854482e

            People

              serg Sergei Golubchik
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.