Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6582

DEBUG_SYNC does not reset mysys_var->current_mutex, causes assertion "Trying to unlock mutex that wasn't locked"

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.13
    • 10.0.16
    • Tests
    • None

    Description

      https://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/5937

      rpl_parallel.test asserts with the following stack trace:

      safe_mutex: Trying to unlock mutex LOCK_wait_commit that wasn't locked at /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/sql/sql_class.cc, line 1817
      Last used at /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/sql/log.cc, line: 6791
       
      mysys/thr_mutex.c:424(safe_mutex_unlock)[0xe1c6f6]
      sql/sql_class.cc:1817(THD::awake(killed_state))[0x5b88af]
      psi/mysql_thread.h:756(inline_mysql_mutex_unlock)[0x5e59e9]
      sql/sql_parse.cc:7453(sql_kill)[0x5e5a52]
      sql/sql_parse.cc:4313(mysql_execute_command(THD*))[0x5e9203]
      sql/sql_parse.cc:6411(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x5ebf5a]
      sql/sql_parse.cc:1309(dispatch_command(enum_server_command, THD*, char*, unsigned int))[0x5ed96c]
      sql/sql_parse.cc:1005(do_command(THD*))[0x5ef152]
      sql/sql_connect.cc:1379(do_handle_one_connection(THD*))[0x6dd5ff]
      sql/sql_connect.cc:1295(handle_one_connection)[0x6dd725]

      I think I have seen similar failures before. I guess it happens because
      mysys_var->current_mutex changes while THD::awake() is trying to kill a
      thread. IIRC, this is related to incorrect use of ENTER_COND and EXIT_COND
      (eg. missing exit or something like that).

      I also think that in THD::awake, we should take a local copy of
      mysys_var->current_mutex, we really don't want to reload the value when we
      unlock the mutex.

      But we could add a debug assertion that the value does not change, to try and
      catch more of these problems (incorrect usage of ENTER_COND/EXIT_COND).

      Attachments

        Issue Links

          Activity

            knielsen Kristian Nielsen added a comment - I found another instance of this: http://buildbot.askmonty.org/buildbot/builders/kvm-dgcov-jaunty-i386/builds/4408

            Another way this bug can occur is if DEBUG_SYNC is used inside ENTER_COND / EXIT_COND. Because DEBUG_SYNC changes current_mutex, which can race to give the problem seen here.

            knielsen Kristian Nielsen added a comment - Another way this bug can occur is if DEBUG_SYNC is used inside ENTER_COND / EXIT_COND. Because DEBUG_SYNC changes current_mutex, which can race to give the problem seen here.

            I think I found it.

            I managed to reproduce the bug reliably by inserting these sleeps in the code:

            === modified file 'sql/log.cc'
            --- sql/log.cc	2014-11-17 07:53:42 +0000
            +++ sql/log.cc	2014-11-26 08:59:16 +0000
            @@ -6833,6 +6833,7 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr
                   */
                   wfc->opaque_pointer= orig_entry;
                   DEBUG_SYNC(orig_entry->thd, "group_commit_waiting_for_prior");
            +my_sleep(100000);
                   orig_entry->thd->ENTER_COND(&wfc->COND_wait_commit,
                                               &wfc->LOCK_wait_commit,
                                               &stage_waiting_for_prior_transaction_to_commit,
             
            === modified file 'sql/sql_class.cc'
            --- sql/sql_class.cc	2014-11-13 10:01:31 +0000
            +++ sql/sql_class.cc	2014-11-26 08:59:05 +0000
            @@ -1810,6 +1810,7 @@ void THD::awake(killed_state state_to_se
                   for (i= 0; i < WAIT_FOR_KILL_TRY_TIMES * SECONDS_TO_WAIT_FOR_KILL; i++)
                   {
                     int ret= mysql_mutex_trylock(mysys_var->current_mutex);
            +my_sleep(200000);
                     mysql_cond_broadcast(mysys_var->current_cond);
                     if (!ret)
                     {
             

            I think the problem is in debug_sync_execute():

                  if (thd->mysys_var)
                  {
                    old_mutex= thd->mysys_var->current_mutex;
                    thd->mysys_var->current_mutex= &debug_sync_global.ds_mutex;
            ...
                  if (old_mutex)
                    thd->mysys_var->current_mutex= old_mutex;

            Note, that if old_mutex is NULL, then debug_sync_execute() will leave a
            dangling non-NULL value in thd->mysys_var->current_mutex. This allows a race,
            where first THD::awake() fetches and locks the dangling mutex, then ENTER_COND
            changes current_mutex to a new value, then THD::awake() reloads the new value
            and unlocks a different mutex from the one it locked (the dangling one).

            Seems clear enough now.

            knielsen Kristian Nielsen added a comment - I think I found it. I managed to reproduce the bug reliably by inserting these sleeps in the code: === modified file 'sql/log.cc' --- sql/log.cc 2014-11-17 07:53:42 +0000 +++ sql/log.cc 2014-11-26 08:59:16 +0000 @@ -6833,6 +6833,7 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr */ wfc->opaque_pointer= orig_entry; DEBUG_SYNC(orig_entry->thd, "group_commit_waiting_for_prior"); +my_sleep(100000); orig_entry->thd->ENTER_COND(&wfc->COND_wait_commit, &wfc->LOCK_wait_commit, &stage_waiting_for_prior_transaction_to_commit,   === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2014-11-13 10:01:31 +0000 +++ sql/sql_class.cc 2014-11-26 08:59:05 +0000 @@ -1810,6 +1810,7 @@ void THD::awake(killed_state state_to_se for (i= 0; i < WAIT_FOR_KILL_TRY_TIMES * SECONDS_TO_WAIT_FOR_KILL; i++) { int ret= mysql_mutex_trylock(mysys_var->current_mutex); +my_sleep(200000); mysql_cond_broadcast(mysys_var->current_cond); if (!ret) {   I think the problem is in debug_sync_execute(): if (thd->mysys_var) { old_mutex= thd->mysys_var->current_mutex; thd->mysys_var->current_mutex= &debug_sync_global.ds_mutex; ... if (old_mutex) thd->mysys_var->current_mutex= old_mutex; Note, that if old_mutex is NULL, then debug_sync_execute() will leave a dangling non-NULL value in thd->mysys_var->current_mutex. This allows a race, where first THD::awake() fetches and locks the dangling mutex, then ENTER_COND changes current_mutex to a new value, then THD::awake() reloads the new value and unlocks a different mutex from the one it locked (the dangling one). Seems clear enough now.
            knielsen Kristian Nielsen added a comment - http://lists.askmonty.org/pipermail/commits/2014-November/007071.html

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.