Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Cannot Reproduce
-
10.0.17
-
rhel6 10.0.17 master and slave
Description
from MDEV-7126 - same symptoms - different case
reproduced on 10.0.17:
its quite likely this had an active IO thread and stopped SQL thread (was running pt-slave-delay --delay 24h --interval 1hr)
mysql -S /var/lib/mariadb/mysql-10.sock
|
Welcome to the MariaDB monitor. Commands end with ; or \g.
|
Your MariaDB connection id is 1438
|
Server version: 10.0.17-MariaDB-log MariaDB Server
|
|
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
|
|
MariaDB [(none)]> stop slave; set global slave_domain_parallel_threads=15; start slave;
|
]$ mysql -u openquery -S /var/lib/mariadb/mysql-10.sock
|
Welcome to the MariaDB monitor. Commands end with ; or \g.
|
Your MariaDB connection id is 1443
|
Server version: 10.0.17-MariaDB-log MariaDB Server
|
|
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
|
|
MariaDB [(none)]> show processlist;
|
+------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
|
| 3 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 4 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 5 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 6 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 7 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 8 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 9 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 10 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 11 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 12 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 13 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 14 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 15 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 16 | system user | | NULL | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 17 | system user | | NULL | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 18 | system user | | NULL | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 19 | system user | | NULL | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 20 | system user | | NULL | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 21 | system user | | NULL | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 22 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 24 | system user | | NULL | Connect | 264856 | Waiting for master to send event | NULL | 0.000 |
|
| 39 | openquery | localhost | NULL | Sleep | 907 | | NULL | 0.000 |
|
| 1349 | system user | | NULL | Connect | 26107 | Reading event from the relay log | NULL | 0.000 |
|
| 1438 | root | localhost | NULL | Query | 147 | Killing slave | stop slave | 0.000 |
|
| 1440 | munin | localhost | information_schema | Query | 76 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 1441 | root | localhost | NULL | Query | 41 | init | show slave status | 0.000 |
|
| 1442 | munin | localhost | information_schema | Query | 14 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 1443 | openquery | localhost | NULL | Query | 0 | init | show processlist | 0.000 |
|
+------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
|
SQL thread was stopped - set password on non-existent user.
MariaDB [(none)]> set global sql_slave_skip_counter=1; start slave;
|
ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first
|
Query OK, 0 rows affected, 1 warning (0.00 sec)
|
|
MariaDB [(none)]> stop slave;set global sql_slave_skip_counter=1; start slave;
|
erver version: 10.0.17-MariaDB-log MariaDB Server
|
|
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
|
|
MariaDB [(none)]> show processlist;
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
| 3 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 4 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 5 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 6 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 7 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 |
|
| 8 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 9 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 10 | system user | | NULL | Connect | 944327 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 11 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 12 | system user | | NULL | Connect | 944327 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 13 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 14 | system user | | NULL | Connect | 944325 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 15 | system user | | NULL | Connect | 944328 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 16 | system user | | NULL | Connect | 944328 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 17 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 18 | system user | | NULL | Connect | 944328 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 19 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 20 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 21 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 22 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 |
|
| 23 | system user | | NULL | Connect | 1010763 | Waiting for master to send event | NULL | 0.000 |
|
| 28 | openquery | localhost | NULL | Killed | 127489 | init | SHOW SLAVE STATUS | 0.000 |
|
| 291 | system user | | NULL | Connect | 854691 | Reading event from the relay log | NULL | 0.000 |
|
| 2711 | openquery | localhost | NULL | Query | 129024 | Killing slave | stop slave | 0.000 |
|
| 2715 | munin | localhost | information_schema | Killed | 128765 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 2716 | root | localhost | NULL | Killed | 128709 | init | show slave status | 0.000 |
|
| 2719 | munin | localhost | information_schema | Killed | 128703 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 2741 | openquery | localhost | NULL | Query | 0 | init | show processlist | 0.000 |
|
|
28 rows in set (0.00 sec)
|
Threads where well and truly halted before i killed off the global status threads.
(gdb) where
|
#0 0x000000348e40b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
|
#1 0x000000000056167f in terminate_slave_thread ()
|
#2 0x0000000000564d45 in terminate_slave_threads(Master_info*, int, bool) ()
|
#3 0x00000000005efabf in stop_slave(THD*, Master_info*, bool) ()
|
#4 0x00000000005d3fe2 in mysql_execute_command(THD*) ()
|
#5 0x00000000005d8552 in mysql_parse ()
|
#6 0x00000000005da7b0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
|
#7 0x00000000006963d3 in do_handle_one_connection(THD*) ()
|
#8 0x00000000006964a2 in handle_one_connection ()
|
#9 0x000000348e4079d1 in start_thread () from /lib64/libpthread.so.0
|
#10 0x000000348e0e8b6d in clone () from /lib64/libc.so.6
|
full bt in gdb-10.0-terminate_slave_thread.txt
Analysis
Positions are from objdump on the executable
Looking at the show slave status this is at the pthread_mutex_lock position 0x63551a in fill_status though a copy show pthread_mutex lock is from fill_status(635419)->show_status_array(62a94c)->show_heartbeat_period(54a982).
show_heartbeat_period is blocked acquiring the LOCK_active_mi.
fill_status(63551a) is waiting on LOCK_show_status just before show_status_array which means its waiting on the fill_status that reached show_heartbeat_period.
0x5d5953 in mysql_execute_command is waiting on a lock LOCK_active_mi for a show slave status command.
terminate_slave_thread looking at the disassembled code the following is the only cond_timedwait
mysql_cond_timedwait(term_cond, term_lock, &abstime)
|
and its waiting here.
thd->LOCK_thd_data is unlocked.
term_lock is sql_lock (in terminate_slave_threads) based on the position in the terminate_slave_threads, which is (mi->rli.run_lock) (which is locked) in stop_slave (by calling lock_slave_thread).
term_cond is mi->rli.stop_cond
We can see this is terminating the sql thread.
In both stack traces have handle_slave_sql(line 4676) calling in rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread); gdb-10.0-terminate_slave_thread.txt has two of these threads in the same place which strikes me as odd (wasn't multisource).
So the blocking rpl_parallel::wait_for_done at a lock point and is preventing the mysql_cond_broadcast(&rli->stop_cond); at the bottom of the handle_slave_sql thread.
So the point of being stuck is at rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread)