Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8039

parallel replication slave - indefinate lock wait in rpl_parallel::wait_for_done when terminate_slave_thread with stop slave while show global status/show slave status was running

    XMLWordPrintable

    Details

      Description

      from MDEV-7126 - same symptoms - different case

      reproduced on 10.0.17:

      its quite likely this had an active IO thread and stopped SQL thread (was running pt-slave-delay --delay 24h --interval 1hr)

        mysql  -S /var/lib/mariadb/mysql-10.sock 
      Welcome to the MariaDB monitor.  Commands end with ; or \g.
      Your MariaDB connection id is 1438
      Server version: 10.0.17-MariaDB-log MariaDB Server
       
      Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
       
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
       
      MariaDB [(none)]> stop slave; set global slave_domain_parallel_threads=15; start slave;
      

      ]$ mysql -u openquery   -S /var/lib/mariadb/mysql-10.sock 
      Welcome to the MariaDB monitor.  Commands end with ; or \g.
      Your MariaDB connection id is 1443
      Server version: 10.0.17-MariaDB-log MariaDB Server
       
      Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
       
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
       
      MariaDB [(none)]> show processlist;
      +------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
      | Id   | User        | Host      | db                 | Command | Time   | State                                                                          | Info               | Progress |
      +------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
      |    3 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    4 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    5 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    6 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    7 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    8 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    9 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |   10 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |   11 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |   12 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |   13 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   14 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   15 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   16 | system user |           | NULL               | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   17 | system user |           | NULL               | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   18 | system user |           | NULL               | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   19 | system user |           | NULL               | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   20 | system user |           | NULL               | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   21 | system user |           | NULL               | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   22 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   24 | system user |           | NULL               | Connect | 264856 | Waiting for master to send event                                               | NULL               |    0.000 |
      |   39 | openquery   | localhost | NULL               | Sleep   |    907 |                                                                                | NULL               |    0.000 |
      | 1349 | system user |           | NULL               | Connect |  26107 | Reading event from the relay log                                               | NULL               |    0.000 |
      | 1438 | root        | localhost | NULL               | Query   |    147 | Killing slave                                                                  | stop slave         |    0.000 |
      | 1440 | munin       | localhost | information_schema | Query   |     76 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
      | 1441 | root        | localhost | NULL               | Query   |     41 | init                                                                           | show slave status  |    0.000 |
      | 1442 | munin       | localhost | information_schema | Query   |     14 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
      | 1443 | openquery   | localhost | NULL               | Query   |      0 | init                                                                           | show processlist   |    0.000 |
      +------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
      

      SQL thread was stopped - set password on non-existent user.

      MariaDB [(none)]> set global sql_slave_skip_counter=1; start slave;
      ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first
      Query OK, 0 rows affected, 1 warning (0.00 sec)
       
      MariaDB [(none)]> stop slave;set global sql_slave_skip_counter=1; start slave;
      

      erver version: 10.0.17-MariaDB-log MariaDB Server
       
      Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
       
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
       
      MariaDB [(none)]> show processlist;
      | Id   | User        | Host      | db                 | Command | Time    | State                                                                          | Info               | Progress |
      |    3 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    4 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    5 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    6 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    7 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
      |    8 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |    9 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   10 | system user |           | NULL               | Connect |  944327 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   11 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   12 | system user |           | NULL               | Connect |  944327 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   13 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   14 | system user |           | NULL               | Connect |  944325 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   15 | system user |           | NULL               | Connect |  944328 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   16 | system user |           | NULL               | Connect |  944328 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   17 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   18 | system user |           | NULL               | Connect |  944328 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   19 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   20 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   21 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   22 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
      |   23 | system user |           | NULL               | Connect | 1010763 | Waiting for master to send event                                               | NULL               |    0.000 |
      |   28 | openquery   | localhost | NULL               | Killed  |  127489 | init                                                                           | SHOW SLAVE STATUS  |    0.000 |
      |  291 | system user |           | NULL               | Connect |  854691 | Reading event from the relay log                                               | NULL               |    0.000 |
      | 2711 | openquery   | localhost | NULL               | Query   |  129024 | Killing slave                                                                  | stop slave         |    0.000 |
      | 2715 | munin       | localhost | information_schema | Killed  |  128765 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
      | 2716 | root        | localhost | NULL               | Killed  |  128709 | init                                                                           | show slave status  |    0.000 |
      | 2719 | munin       | localhost | information_schema | Killed  |  128703 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
      | 2741 | openquery   | localhost | NULL               | Query   |       0 | init                                                                           | show processlist   |    0.000 |
       
      28 rows in set (0.00 sec)
      

      Threads where well and truly halted before i killed off the global status threads.

      (gdb) where
      #0  0x000000348e40b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1  0x000000000056167f in terminate_slave_thread ()
      #2  0x0000000000564d45 in terminate_slave_threads(Master_info*, int, bool) ()
      #3  0x00000000005efabf in stop_slave(THD*, Master_info*, bool) ()
      #4  0x00000000005d3fe2 in mysql_execute_command(THD*) ()
      #5  0x00000000005d8552 in mysql_parse ()
      #6  0x00000000005da7b0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
      #7  0x00000000006963d3 in do_handle_one_connection(THD*) ()
      #8  0x00000000006964a2 in handle_one_connection ()
      #9  0x000000348e4079d1 in start_thread () from /lib64/libpthread.so.0
      #10 0x000000348e0e8b6d in clone () from /lib64/libc.so.6
      

      full bt in gdb-10.0-terminate_slave_thread.txt

      Analysis

      Positions are from objdump on the executable

      Looking at the show slave status this is at the pthread_mutex_lock position 0x63551a in fill_status though a copy show pthread_mutex lock is from fill_status(635419)->show_status_array(62a94c)->show_heartbeat_period(54a982).

      show_heartbeat_period is blocked acquiring the LOCK_active_mi.

      fill_status(63551a) is waiting on LOCK_show_status just before show_status_array which means its waiting on the fill_status that reached show_heartbeat_period.

      0x5d5953 in mysql_execute_command is waiting on a lock LOCK_active_mi for a show slave status command.

      terminate_slave_thread looking at the disassembled code the following is the only cond_timedwait

      mysql_cond_timedwait(term_cond, term_lock, &abstime)
      

      and its waiting here.

      thd->LOCK_thd_data is unlocked.
      term_lock is sql_lock (in terminate_slave_threads) based on the position in the terminate_slave_threads, which is (mi->rli.run_lock) (which is locked) in stop_slave (by calling lock_slave_thread).
      term_cond is mi->rli.stop_cond

      We can see this is terminating the sql thread.

      In both stack traces have handle_slave_sql(line 4676) calling in rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread); gdb-10.0-terminate_slave_thread.txt has two of these threads in the same place which strikes me as odd (wasn't multisource).

      So the blocking rpl_parallel::wait_for_done at a lock point and is preventing the mysql_cond_broadcast(&rli->stop_cond); at the bottom of the handle_slave_sql thread.

      So the point of being stuck is at rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread)

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              knielsen Kristian Nielsen
              Reporter:
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: