Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36287

Server crash in SHOW SLAVE STATUS concurrent with STOP SLAVE

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.5.28
    • 12.0.0
    • Replication
    • None

    Description

      The code for SHOW SLAVE STATUS accesses mi->rli.sql_driver_thd->proc_info without holding mi->run_lock. This means the THD can go away in the middle (in case of STOP SLAVE) and cause SHOW SLAVE STATUS to access invalid memory and crash the server.

      This appeared as an MSAN test failure in Buildbot, the precise test probably doesn't matter as most any replication test could hit this rare race:

      https://buildbot.mariadb.org/#/builders/640/builds/9922

      I could reproduce with a hacked test case using sleep and a small code patch to inject a sleep to make the race easier to hit:

      --source include/master-slave.inc
       
      --connection slave1
      send STOP SLAVE;
       
      --connection slave
      --sleep 0.5
      send SHOW SLAVE STATUS;
       
      --connection slave1
      reap;
      --sleep 0.5
      START SLAVE;
       
      --connection slave
      reap;
       
      --source include/rpl_end.inc
      

      diff --git a/sql/slave.cc b/sql/slave.cc
      index 6f4176f233d..3e4df0ff4e6 100644
      --- a/sql/slave.cc
      +++ b/sql/slave.cc
      @@ -3343,8 +3343,10 @@ static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full,
           // SQL_Remaining_Delay
           // THD::proc_info is not protected by any lock, so we read it once
           // to ensure that we use the same value throughout this function.
      +    THD *sql_driver= mi->rli.sql_driver_thd;
      +    my_sleep(2000000);
           const char *slave_sql_running_state=
      -      mi->rli.sql_driver_thd ? mi->rli.sql_driver_thd->proc_info : "";
      +      sql_driver ? sql_driver->proc_info : "";
           if (slave_sql_running_state == stage_sql_thd_waiting_until_delay.m_name)
           {
             time_t t= my_time(0), sql_delay_end= mi->rli.get_sql_delay_end();
      @@ -5839,6 +5841,7 @@ pthread_handler_t handle_slave_sql(void *arg)
           could be used by slave through Relay_log_info::save_temporary_tables.
         */
         thd->temporary_tables= 0;
      +my_sleep(1000000);
         rli->sql_driver_thd= 0;
         thd->rgi_fake= thd->rgi_slave= NULL;
       
      

      Setting 10.5 as the target as this is a crashing bug.

      Attachments

        Activity

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.