Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8476

Race condition in slave SQL thread shutdown

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.0(EOL), 10.1(EOL)
    • 10.0.23, 10.1.8
    • Replication
    • None

    Description

      During server shutdown slave SQL thread broadcasts "stopped" too early:

        mysql_cond_broadcast(&rli->stop_cond);
        DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5););
        mysql_mutex_unlock(&rli->run_lock);  // tell the world we are done
       
        /*
          Deactivate the parallel replication thread pool, if there are now no more
          SQL threads running. Do this here, when we have released all locks, but
          while our THD (and current_thd) is still valid.
        */
        mysql_mutex_lock(&LOCK_active_mi);
        if (opt_slave_parallel_threads > 0 &&
            !master_info_index->any_slave_sql_running())
          rpl_parallel_inactivate_pool(&global_rpl_thread_pool);
        mysql_mutex_unlock(&LOCK_active_mi);
       
        mysql_mutex_lock(&LOCK_thread_count);
        THD_CHECK_SENTRY(thd);
        delete thd;
        mysql_mutex_unlock(&LOCK_thread_count);

      This causes enormous number of sporadic replication test failures with crashes, hangs and warnings like "Plugin '...' has ref_count=1 after shutdown".

      To reproduce this failure reliably, just add sleep(1) immediately after condition is signaled and run tests as usual:

      diff --git a/sql/slave.cc b/sql/slave.cc
      index 4129c4c..efc75f3 100644
      --- a/sql/slave.cc
      +++ b/sql/slave.cc
      @@ -4837,6 +4837,7 @@ log '%s' at position %s, relay log '%s' position: %s%s", RPL_LOG_NAME,
         mysql_cond_broadcast(&rli->stop_cond);
         DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5););
         mysql_mutex_unlock(&rli->run_lock);  // tell the world we are done
      +  sleep(1);
       
         /*
           Deactivate the parallel replication thread pool, if there are now no more

      This seem to be appeared after:

      commit 682ed005c57dabb4615052a79789e44e5c93525c
      Author: Kristian Nielsen <knielsen@knielsen-hq.org>
      Date:   Wed Jun 10 11:57:42 2015 +0200
       
          MDEV-8294: Inconsistent behavior of slave parallel threads at runtime
       
          There were some cases where the slave SQL thread could stop without
          the pool of parallel replication worker threads being correctly
          de-activated.

      Attachments

        Issue Links

          Activity

            People

              monty Michael Widenius
              svoj Sergey Vojtovich
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.