Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36287

Server crash in SHOW SLAVE STATUS concurrent with STOP SLAVE

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.5.28
    • 12.0.0
    • Replication
    • None

    Description

      The code for SHOW SLAVE STATUS accesses mi->rli.sql_driver_thd->proc_info without holding mi->run_lock. This means the THD can go away in the middle (in case of STOP SLAVE) and cause SHOW SLAVE STATUS to access invalid memory and crash the server.

      This appeared as an MSAN test failure in Buildbot, the precise test probably doesn't matter as most any replication test could hit this rare race:

      https://buildbot.mariadb.org/#/builders/640/builds/9922

      I could reproduce with a hacked test case using sleep and a small code patch to inject a sleep to make the race easier to hit:

      --source include/master-slave.inc
       
      --connection slave1
      send STOP SLAVE;
       
      --connection slave
      --sleep 0.5
      send SHOW SLAVE STATUS;
       
      --connection slave1
      reap;
      --sleep 0.5
      START SLAVE;
       
      --connection slave
      reap;
       
      --source include/rpl_end.inc
      

      diff --git a/sql/slave.cc b/sql/slave.cc
      index 6f4176f233d..3e4df0ff4e6 100644
      --- a/sql/slave.cc
      +++ b/sql/slave.cc
      @@ -3343,8 +3343,10 @@ static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full,
           // SQL_Remaining_Delay
           // THD::proc_info is not protected by any lock, so we read it once
           // to ensure that we use the same value throughout this function.
      +    THD *sql_driver= mi->rli.sql_driver_thd;
      +    my_sleep(2000000);
           const char *slave_sql_running_state=
      -      mi->rli.sql_driver_thd ? mi->rli.sql_driver_thd->proc_info : "";
      +      sql_driver ? sql_driver->proc_info : "";
           if (slave_sql_running_state == stage_sql_thd_waiting_until_delay.m_name)
           {
             time_t t= my_time(0), sql_delay_end= mi->rli.get_sql_delay_end();
      @@ -5839,6 +5841,7 @@ pthread_handler_t handle_slave_sql(void *arg)
           could be used by slave through Relay_log_info::save_temporary_tables.
         */
         thd->temporary_tables= 0;
      +my_sleep(1000000);
         rli->sql_driver_thd= 0;
         thd->rgi_fake= thd->rgi_slave= NULL;
       
      

      Setting 10.5 as the target as this is a crashing bug.

      Attachments

        Activity

          Here is a crash from the above test case:

          Thread 1 (Thread 0x7f5f386526c0 (LWP 1632036)):
          #0  __pthread_kill_implementation (threadid=<optimized out>, signo=11, no_tid=<optimized out>) at ./nptl/pthread_kill.c:44
          #1  0x000055ca74f21777 in my_write_core (sig=11) at /kvm/src/my/dev/mariadb2/mysys/stacktrace.c:424
          #2  0x000055ca74594ef5 in handle_fatal_signal (sig=11) at /kvm/src/my/dev/mariadb2/sql/signal_handler.cc:297
          #3  <signal handler called>
          #4  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
          #5  0x000055ca740bb1c0 in Protocol::store (this=0x7f5edc001358, from=0x8f8f8f8f8f8f8f8f <error: Cannot access memory at address 0x8f8f8f8f8f8f8f8f>, cs=0x55ca75adc560 <my_charset_bin>) at /kvm/src/my/dev/mariadb2/sql/protocol.cc:1119
          #6  0x000055ca740dd32a in send_show_master_info_data (thd=0x7f5edc000d58, mi=0x55ca84302a80, full=false, gtid_pos=0x7f5f38650ad0) at /kvm/src/my/dev/mariadb2/sql/slave.cc:3358
          #7  0x000055ca740d9d71 in show_master_info (thd=0x7f5edc000d58, mi=0x55ca84302a80, full=false) at /kvm/src/my/dev/mariadb2/sql/slave.cc:2898
          #8  0x000055ca740d57a8 in Sql_cmd_show_slave_status::execute (this=0x7f5edc012a10, thd=0x7f5edc000d58) at /kvm/src/my/dev/mariadb2/sql/slave.cc:1530
          #9  0x000055ca741ffa7f in mysql_execute_command (thd=0x7f5edc000d58) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:6194
          #10 0x000055ca742051cd in mysql_parse (thd=0x7f5edc000d58, rawbuf=0x7f5edc012980 "SHOW SLAVE STATUS", length=17, parser_state=0x7f5f38651300, is_com_multi=false, is_next_command=false) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:8252
          #11 0x000055ca741f33e3 in dispatch_command (command=COM_QUERY, thd=0x7f5edc000d58, packet=0x7f5edc00a4c9 "", packet_length=17, is_com_multi=false, is_next_command=false) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:1891
          #12 0x000055ca741f2078 in do_command (thd=0x7f5edc000d58) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:1375
          #13 0x000055ca743b428e in do_handle_one_connection (connect=0x55ca842cd8f8, put_in_cache=true) at /kvm/src/my/dev/mariadb2/sql/sql_connect.cc:1386
          #14 0x000055ca743b402d in handle_one_connection (arg=0x55ca842cd8f8) at /kvm/src/my/dev/mariadb2/sql/sql_connect.cc:1298
          #15 0x000055ca748f127c in pfs_spawn_thread (arg=0x55ca842acd48) at /kvm/src/my/dev/mariadb2/storage/perfschema/pfs.cc:2201
          #16 0x00007f5f452a81c4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
          #17 0x00007f5f4532885c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
          

          knielsen Kristian Nielsen added a comment - Here is a crash from the above test case: Thread 1 (Thread 0x7f5f386526c0 (LWP 1632036)): #0 __pthread_kill_implementation (threadid=<optimized out>, signo=11, no_tid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 0x000055ca74f21777 in my_write_core (sig=11) at /kvm/src/my/dev/mariadb2/mysys/stacktrace.c:424 #2 0x000055ca74594ef5 in handle_fatal_signal (sig=11) at /kvm/src/my/dev/mariadb2/sql/signal_handler.cc:297 #3 <signal handler called> #4 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 #5 0x000055ca740bb1c0 in Protocol::store (this=0x7f5edc001358, from=0x8f8f8f8f8f8f8f8f <error: Cannot access memory at address 0x8f8f8f8f8f8f8f8f>, cs=0x55ca75adc560 <my_charset_bin>) at /kvm/src/my/dev/mariadb2/sql/protocol.cc:1119 #6 0x000055ca740dd32a in send_show_master_info_data (thd=0x7f5edc000d58, mi=0x55ca84302a80, full=false, gtid_pos=0x7f5f38650ad0) at /kvm/src/my/dev/mariadb2/sql/slave.cc:3358 #7 0x000055ca740d9d71 in show_master_info (thd=0x7f5edc000d58, mi=0x55ca84302a80, full=false) at /kvm/src/my/dev/mariadb2/sql/slave.cc:2898 #8 0x000055ca740d57a8 in Sql_cmd_show_slave_status::execute (this=0x7f5edc012a10, thd=0x7f5edc000d58) at /kvm/src/my/dev/mariadb2/sql/slave.cc:1530 #9 0x000055ca741ffa7f in mysql_execute_command (thd=0x7f5edc000d58) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:6194 #10 0x000055ca742051cd in mysql_parse (thd=0x7f5edc000d58, rawbuf=0x7f5edc012980 "SHOW SLAVE STATUS", length=17, parser_state=0x7f5f38651300, is_com_multi=false, is_next_command=false) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:8252 #11 0x000055ca741f33e3 in dispatch_command (command=COM_QUERY, thd=0x7f5edc000d58, packet=0x7f5edc00a4c9 "", packet_length=17, is_com_multi=false, is_next_command=false) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:1891 #12 0x000055ca741f2078 in do_command (thd=0x7f5edc000d58) at /kvm/src/my/dev/mariadb2/sql/sql_parse.cc:1375 #13 0x000055ca743b428e in do_handle_one_connection (connect=0x55ca842cd8f8, put_in_cache=true) at /kvm/src/my/dev/mariadb2/sql/sql_connect.cc:1386 #14 0x000055ca743b402d in handle_one_connection (arg=0x55ca842cd8f8) at /kvm/src/my/dev/mariadb2/sql/sql_connect.cc:1298 #15 0x000055ca748f127c in pfs_spawn_thread (arg=0x55ca842acd48) at /kvm/src/my/dev/mariadb2/storage/perfschema/pfs.cc:2201 #16 0x00007f5f452a81c4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #17 0x00007f5f4532885c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

          bnestere Want to review this small patch as well? Pushed into knielsen_fixes_10.5 along with the others.

          knielsen Kristian Nielsen added a comment - bnestere Want to review this small patch as well? Pushed into knielsen_fixes_10.5 along with the others.

          I have reviewed the code and it looks ok to push

          monty Michael Widenius added a comment - I have reviewed the code and it looks ok to push

          Approved, thanks knielsen! Though I think 10.5 and 10.6 are locked for high-priority / security fixes only. I'd say with this being a crash, and a low-risk fix, it would be worth putting into 10.5. But I'm ok with 10.11 too. monty what do you think?

          bnestere Brandon Nesterenko added a comment - Approved, thanks knielsen ! Though I think 10.5 and 10.6 are locked for high-priority / security fixes only. I'd say with this being a crash, and a low-risk fix, it would be worth putting into 10.5. But I'm ok with 10.11 too. monty what do you think?

          Pushed to 10.5

          knielsen Kristian Nielsen added a comment - Pushed to 10.5

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.