[MDEV-8039] parallel replication slave - indefinate lock wait in rpl_parallel::wait_for_done when terminate_slave_thread with stop slave while show global status/show slave status was running Created: 2015-04-23  Updated: 2020-08-25  Resolved: 2017-01-26

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.17
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Daniel Black Assignee: Kristian Nielsen
Resolution: Cannot Reproduce Votes: 0
Labels: need_feedback, parallelslave, replication
Environment:

rhel6 10.0.17 master and slave


Attachments: HTML File all_threads_bt     Text File gdb-10.0-terminate_slave_thread.txt     Text File gdb.txt    
Issue Links:
Relates
relates to MDEV-7126 replication slave - deadlock in termi... Closed
relates to MDEV-7667 aborting stop slave doesn't recover Closed

 Description   

from MDEV-7126 - same symptoms - different case

reproduced on 10.0.17:

its quite likely this had an active IO thread and stopped SQL thread (was running pt-slave-delay --delay 24h --interval 1hr)

  mysql  -S /var/lib/mariadb/mysql-10.sock 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1438
Server version: 10.0.17-MariaDB-log MariaDB Server
 
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> stop slave; set global slave_domain_parallel_threads=15; start slave;

]$ mysql -u openquery   -S /var/lib/mariadb/mysql-10.sock 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1443
Server version: 10.0.17-MariaDB-log MariaDB Server
 
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> show processlist;
+------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
| Id   | User        | Host      | db                 | Command | Time   | State                                                                          | Info               | Progress |
+------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
|    3 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    4 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    5 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    6 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    7 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    8 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    9 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|   10 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|   11 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|   12 | system user |           | NULL               | Connect | 264856 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|   13 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   14 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   15 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   16 | system user |           | NULL               | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   17 | system user |           | NULL               | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   18 | system user |           | NULL               | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   19 | system user |           | NULL               | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   20 | system user |           | NULL               | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   21 | system user |           | NULL               | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   22 | system user |           | NULL               | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   24 | system user |           | NULL               | Connect | 264856 | Waiting for master to send event                                               | NULL               |    0.000 |
|   39 | openquery   | localhost | NULL               | Sleep   |    907 |                                                                                | NULL               |    0.000 |
| 1349 | system user |           | NULL               | Connect |  26107 | Reading event from the relay log                                               | NULL               |    0.000 |
| 1438 | root        | localhost | NULL               | Query   |    147 | Killing slave                                                                  | stop slave         |    0.000 |
| 1440 | munin       | localhost | information_schema | Query   |     76 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
| 1441 | root        | localhost | NULL               | Query   |     41 | init                                                                           | show slave status  |    0.000 |
| 1442 | munin       | localhost | information_schema | Query   |     14 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
| 1443 | openquery   | localhost | NULL               | Query   |      0 | init                                                                           | show processlist   |    0.000 |
+------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+

SQL thread was stopped - set password on non-existent user.

MariaDB [(none)]> set global sql_slave_skip_counter=1; start slave;
ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first
Query OK, 0 rows affected, 1 warning (0.00 sec)
 
MariaDB [(none)]> stop slave;set global sql_slave_skip_counter=1; start slave;

erver version: 10.0.17-MariaDB-log MariaDB Server
 
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> show processlist;
| Id   | User        | Host      | db                 | Command | Time    | State                                                                          | Info               | Progress |
|    3 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    4 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    5 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    6 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    7 | system user |           | NULL               | Connect | 1010764 | Waiting for work from SQL thread                                               | NULL               |    0.000 |
|    8 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|    9 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   10 | system user |           | NULL               | Connect |  944327 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   11 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   12 | system user |           | NULL               | Connect |  944327 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   13 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   14 | system user |           | NULL               | Connect |  944325 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   15 | system user |           | NULL               | Connect |  944328 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   16 | system user |           | NULL               | Connect |  944328 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   17 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   18 | system user |           | NULL               | Connect |  944328 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   19 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   20 | system user |           | NULL               | Connect |  944326 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   21 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   22 | system user |           | NULL               | Connect |  944324 | Waiting for prior transaction to start commit before starting next transaction | NULL               |    0.000 |
|   23 | system user |           | NULL               | Connect | 1010763 | Waiting for master to send event                                               | NULL               |    0.000 |
|   28 | openquery   | localhost | NULL               | Killed  |  127489 | init                                                                           | SHOW SLAVE STATUS  |    0.000 |
|  291 | system user |           | NULL               | Connect |  854691 | Reading event from the relay log                                               | NULL               |    0.000 |
| 2711 | openquery   | localhost | NULL               | Query   |  129024 | Killing slave                                                                  | stop slave         |    0.000 |
| 2715 | munin       | localhost | information_schema | Killed  |  128765 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
| 2716 | root        | localhost | NULL               | Killed  |  128709 | init                                                                           | show slave status  |    0.000 |
| 2719 | munin       | localhost | information_schema | Killed  |  128703 | Filling schema table                                                           | SHOW GLOBAL STATUS |    0.000 |
| 2741 | openquery   | localhost | NULL               | Query   |       0 | init                                                                           | show processlist   |    0.000 |
 
28 rows in set (0.00 sec)

Threads where well and truly halted before i killed off the global status threads.

(gdb) where
#0  0x000000348e40b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000056167f in terminate_slave_thread ()
#2  0x0000000000564d45 in terminate_slave_threads(Master_info*, int, bool) ()
#3  0x00000000005efabf in stop_slave(THD*, Master_info*, bool) ()
#4  0x00000000005d3fe2 in mysql_execute_command(THD*) ()
#5  0x00000000005d8552 in mysql_parse ()
#6  0x00000000005da7b0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#7  0x00000000006963d3 in do_handle_one_connection(THD*) ()
#8  0x00000000006964a2 in handle_one_connection ()
#9  0x000000348e4079d1 in start_thread () from /lib64/libpthread.so.0
#10 0x000000348e0e8b6d in clone () from /lib64/libc.so.6

full bt in gdb-10.0-terminate_slave_thread.txt

Analysis

Positions are from objdump on the executable

Looking at the show slave status this is at the pthread_mutex_lock position 0x63551a in fill_status though a copy show pthread_mutex lock is from fill_status(635419)->show_status_array(62a94c)->show_heartbeat_period(54a982).

show_heartbeat_period is blocked acquiring the LOCK_active_mi.

fill_status(63551a) is waiting on LOCK_show_status just before show_status_array which means its waiting on the fill_status that reached show_heartbeat_period.

0x5d5953 in mysql_execute_command is waiting on a lock LOCK_active_mi for a show slave status command.

terminate_slave_thread looking at the disassembled code the following is the only cond_timedwait

mysql_cond_timedwait(term_cond, term_lock, &abstime)

and its waiting here.

thd->LOCK_thd_data is unlocked.
term_lock is sql_lock (in terminate_slave_threads) based on the position in the terminate_slave_threads, which is (mi->rli.run_lock) (which is locked) in stop_slave (by calling lock_slave_thread).
term_cond is mi->rli.stop_cond

We can see this is terminating the sql thread.

In both stack traces have handle_slave_sql(line 4676) calling in rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread); gdb-10.0-terminate_slave_thread.txt has two of these threads in the same place which strikes me as odd (wasn't multisource).

So the blocking rpl_parallel::wait_for_done at a lock point and is preventing the mysql_cond_broadcast(&rli->stop_cond); at the bottom of the handle_slave_sql thread.

So the point of being stuck is at rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread)



 Comments   
Comment by Daniel Black [ 2015-05-03 ]

Slave status locks fixed in MDEV-7126 https://github.com/MariaDB/server/commit/f8320210e78232fed350754faf72d14af63d0c3e

Comment by Elena Stepanova [ 2015-05-04 ]

Got one as well:

Thread 3 (Thread 0x7fc2dd490700 (LWP 20035)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x0000000000eed6fd in safe_cond_timedwait (cond=0x7fc30c1ebdf8, mp=0x7fc30c1ebd10, abstime=0x7fc2dd48e4b0, file=0xf8bbe0 "10.1/include/mysql/psi/mysql_thread.h", line=1202) at 10.1/mysys/thr_mutex.c:547
#2  0x00000000005dcfca in inline_mysql_cond_timedwait (that=0x7fc30c1ebdf8, mutex=0x7fc30c1ebd10, abstime=0x7fc2dd48e4b0, src_file=0xf8c2f0 "10.1/sql/slave.cc", src_line=791) at 10.1/include/mysql/psi/mysql_thread.h:1202
#3  0x00000000005debb2 in terminate_slave_thread (thd=0x7fc2dc0e3070, term_lock=0x7fc30c1ebd10, term_cond=0x7fc30c1ebdf8, slave_running=0x7fc30c1ec7ec, skip_lock=true) at 10.1/sql/slave.cc:791
#4  0x00000000005de554 in terminate_slave_threads (mi=0x7fc30c1e8000, thread_mask=3, skip_lock=true) at 10.1/sql/slave.cc:636
#5  0x00000000006b8fb7 in stop_slave (thd=0x7fc2eecf5070, mi=0x7fc30c1e8000, net_report=true) at 10.1/sql/sql_repl.cc:3186
#6  0x0000000000686841 in mysql_execute_command (thd=0x7fc2eecf5070) at 10.1/sql/sql_parse.cc:3530
#7  0x000000000069218b in mysql_parse (thd=0x7fc2eecf5070, rawbuf=0x7fc2e6ec5088 "STOP SLAVE SQL_THREAD, IO_THREAD", length=32, parser_state=0x7fc2dd48f110) at 10.1/sql/sql_parse.cc:7165
#8  0x0000000000681199 in dispatch_command (command=COM_QUERY, thd=0x7fc2eecf5070, packet=0x7fc2eec92071 "STOP SLAVE SQL_THREAD, IO_THREAD", packet_length=32) at 10.1/sql/sql_parse.cc:1462
#9  0x000000000067ff63 in do_command (thd=0x7fc2eecf5070) at 10.1/sql/sql_parse.cc:1090
#10 0x00000000007ad7a0 in do_handle_one_connection (thd_arg=0x7fc2eecf5070) at 10.1/sql/sql_connect.cc:1347
#11 0x00000000007ad4e5 in handle_one_connection (arg=0x7fc2eecf5070) at 10.1/sql/sql_connect.cc:1258
#12 0x00007fc30dab6e9a in start_thread (arg=0x7fc2dd490700) at pthread_create.c:308
#13 0x00007fc30d1e78bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7fc2dde8b700 (LWP 20028)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x0000000000eed6fd in safe_cond_timedwait (cond=0x7fc2eefc8218, mp=0x7fc2eefc8168, abstime=0x7fc2dde87aa0, file=0xfd2760 "10.1/include/mysql/psi/mysql_thread.h", line=1202) at 10.1/mysys/thr_mutex.c:547
#2  0x00000000007b6784 in inline_mysql_cond_timedwait (that=0x7fc2eefc8218, mutex=0x7fc2eefc8168, abstime=0x7fc2dde87aa0, src_file=0xfd29c8 "10.1/sql/mdl.cc", src_line=1066) at 10.1/include/mysql/psi/mysql_thread.h:1202
#3  0x00000000007b7902 in MDL_wait::timed_wait (this=0x7fc2eefc8168, owner=0x7fc2eefc8120, abs_timeout=0x7fc2dde87aa0, set_status_on_timeout=false, wait_state_name=0x170e6f0) at 10.1/sql/mdl.cc:1066
#4  0x00000000007b92ab in MDL_context::acquire_lock (this=0x7fc2eefc8168, mdl_request=0x7fc2dde87b70, lock_wait_timeout=31536000) at 10.1/sql/mdl.cc:2039
#5  0x00000000007b9a2a in MDL_context::upgrade_shared_lock (this=0x7fc2eefc8168, mdl_ticket=0x7fc30c01f2c0, new_type=MDL_EXCLUSIVE, lock_wait_timeout=31536000) at 10.1/sql/mdl.cc:2232
#6  0x000000000073f186 in mysql_inplace_alter_table (thd=0x7fc2eefc8070, table_list=0x7fc2eefce128, table=0x7fc3072dcc70, altered_table=0x7fc2eefd5070, ha_alter_info=0x7fc2dde87eb0, inplace_supported=HA_ALTER_INPLACE_NO_LOCK_AFTER_PREPARE, target_mdl_request=0x7fc2dde89460, alter_ctx=0x7fc2dde87fd0) at 10.1/sql/sql_table.cc:6944
#7  0x0000000000743c9b in mysql_alter_table (thd=0x7fc2eefc8070, new_db=0x7fc2eefce708 "test", new_name=0x0, create_info=0x7fc2dde896a0, table_list=0x7fc2eefce128, alter_info=0x7fc2dde897d0, order_num=0, order=0x0, ignore=false) at 10.1/sql/sql_table.cc:8821
#8  0x00000000007b2325 in Sql_cmd_alter_table::execute (this=0x7fc2eefce8a8, thd=0x7fc2eefc8070) at 10.1/sql/sql_alter.cc:325
#9  0x000000000068de51 in mysql_execute_command (thd=0x7fc2eefc8070) at 10.1/sql/sql_parse.cc:5558
#10 0x000000000069218b in mysql_parse (thd=0x7fc2eefc8070, rawbuf=0x7fc2e6d4d1b9 "ALTER TABLE table11_innodb ADD COLUMN `default` INTEGER DEFAULT 0", length=65, parser_state=0x7fc2dde8a9f0) at 10.1/sql/sql_parse.cc:7165
#11 0x000000000099f402 in Query_log_event::do_apply_event (this=0x7fc2e6c1f230, rgi=0x7fc2e6df0800, query_arg=0x7fc2e6d4d1b9 "ALTER TABLE table11_innodb ADD COLUMN `default` INTEGER DEFAULT 0", q_len_arg=65) at 10.1/sql/log_event.cc:4287
#12 0x000000000099e6ed in Query_log_event::do_apply_event (this=0x7fc2e6c1f230, rgi=0x7fc2e6df0800) at 10.1/sql/log_event.cc:4013
#13 0x00000000005f0654 in Log_event::apply_event (this=0x7fc2e6c1f230, rgi=0x7fc2e6df0800) at 10.1/sql/log_event.h:1347
#14 0x00000000005e64df in apply_event_and_update_pos (ev=0x7fc2e6c1f230, thd=0x7fc2eefc8070, rgi=0x7fc2e6df0800, rpt=0x7fc2dc008380) at 10.1/sql/slave.cc:3274
#15 0x00000000008179e7 in rpt_handle_event (qev=0x7fc2e6db7a70, rpt=0x7fc2dc008380) at 10.1/sql/rpl_parallel.cc:49
#16 0x0000000000819b6c in handle_rpl_parallel_thread (arg=0x7fc2dc008380) at 10.1/sql/rpl_parallel.cc:942
#17 0x00007fc30dab6e9a in start_thread (arg=0x7fc2dde8b700) at pthread_create.c:308
#18 0x00007fc30d1e78bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7fc2ddf1d700 (LWP 20027)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x0000000000eed3f3 in safe_cond_wait (cond=0x7fc2e6df10e8, mp=0x7fc2e6df1038, file=0xf9c780 "10.1/include/mysql/psi/mysql_thread.h", line=1165) at 10.1/mysys/thr_mutex.c:493
#2  0x000000000063f9ce in inline_mysql_cond_wait (that=0x7fc2e6df10e8, mutex=0x7fc2e6df1038, src_file=0xf9cfe8 "10.1/sql/sql_class.cc", src_line=6853) at 10.1/include/mysql/psi/mysql_thread.h:1165
#3  0x000000000065004e in wait_for_commit::wait_for_prior_commit2 (this=0x7fc2e6df1038, thd=0x7fc2e6c95070) at 10.1/sql/sql_class.cc:6853
#4  0x000000000062fbc3 in wait_for_commit::wait_for_prior_commit (this=0x7fc2e6df1038, thd=0x7fc2e6c95070) at 10.1/sql/sql_class.h:1780
#5  0x000000000062fed2 in THD::wait_for_prior_commit (this=0x7fc2e6c95070) at 10.1/sql/sql_class.h:3781
#6  0x00000000008af50c in ha_commit_one_phase (thd=0x7fc2e6c95070, all=true) at 10.1/sql/handler.cc:1536
#7  0x00000000008aee1d in ha_commit_trans (thd=0x7fc2e6c95070, all=true) at 10.1/sql/handler.cc:1406
#8  0x00000000007c14e9 in trans_commit (thd=0x7fc2e6c95070) at 10.1/sql/transaction.cc:236
#9  0x00000000009a699d in Xid_log_event::do_apply_event (this=0x7fc2dc024570, rgi=0x7fc2e6df1000) at 10.1/sql/log_event.cc:7393
#10 0x00000000005f0654 in Log_event::apply_event (this=0x7fc2dc024570, rgi=0x7fc2e6df1000) at 10.1/sql/log_event.h:1347
#11 0x00000000005e64df in apply_event_and_update_pos (ev=0x7fc2dc024570, thd=0x7fc2e6c95070, rgi=0x7fc2e6df1000, rpt=0x7fc2dc0081b0) at 10.1/sql/slave.cc:3274
#12 0x00000000008179e7 in rpt_handle_event (qev=0x7fc2e6db8970, rpt=0x7fc2dc0081b0) at 10.1/sql/rpl_parallel.cc:49
#13 0x0000000000819b6c in handle_rpl_parallel_thread (arg=0x7fc2dc0081b0) at 10.1/sql/rpl_parallel.cc:942
#14 0x00007fc30dab6e9a in start_thread (arg=0x7fc2ddf1d700) at pthread_create.c:308
#15 0x00007fc30d1e78bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#16 0x0000000000000000 in ?? ()

All threads' bt attached as all_threads_bt.

10.1, git commit 53382ac128.

Slave command line:

00:00:59 /home/elenst/git/10.1/sql/mysqld --no-defaults --basedir=/home/elenst/git/10.1 --datadir=/home/elenst/test_results/analyze_slave/data --lc-messages-dir=/home/elenst/git/10.1/sql/share --character-sets-dir=/home/elenst/git/10.1/sql/share/charsets --tmpdir=/home/elenst/test_results/analyze_slave/tmp --core-file --max-allowed-packet=128Mb --port=10732 --socket=/home/elenst/test_results/analyze_slave/mysql.sock --pid-file=/home/elenst/test_results/analyze_slave/mysql.pid --general-log --general-log-file=/home/elenst/test_results/analyze_slave/mysql.log --server_id=2 --report-host=127.0.0.1 --report_port=10732 --log_output=FILE --slow_query_log --log_bin_trust_function_creators=1 --query_cache_size=64M --slave-skip-errors=1054,1317,1049,1305,1539,1505 --slave-parallel-mode=optimistic --binlog_commit_wait_count=20 --binlog_commit_wait_usec=1000000 --slave-parallel-threads=20 --max-binlog-size=4096 --innodb --sql-mode=no_engine_substitution

End of slave error log:

150502 21:33:31 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:10730',replication starts at GTID position '0-1-65'
150502 21:33:31 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000004' at position 1755, relay log './mysql-relay-bin.000001' position: 4; GTID position '0-1-65'
150502 21:34:02 [Note] Error reading relay log event: slave SQL thread was killed
150502 21:34:02 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
150502 21:34:02 [Note] Slave I/O thread killed while reading event
150502 21:34:02 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000012', position 3799; GTID position 0-1-241
150502 21:34:22 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:10730',replication starts at GTID position '0-1-203'
150502 21:34:22 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000011' at position 534, relay log './mysql-relay-bin.000001' position: 4; GTID position '0-1-203'
150502 21:34:32 [Note] Error reading relay log event: slave SQL thread was killed
150503  5:35:08 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
150503  5:35:08 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000020' at position 4146; GTID position '0-1-399'
150503  5:35:08 [ERROR] Slave I/O: error reconnecting to master 'root@127.0.0.1:10730' - retry-time: 1  retries: 86400  message: Can't connect to MySQL server on '127.0.0.1' (111 "Connection refused"), Internal MariaDB error code: 2003
150503 15:25:18 [Note] /home/elenst/git/10.1/sql/mysqld: Normal shutdown
 
150503 15:25:18 [Note] Event Scheduler: Purging the queue. 0 events
150503 15:25:19 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
150503 15:25:19 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000020', position 4146; GTID position 0-1-399

Test command line:

perl /home/elenst/bzr/randgen-mariadb-patches/runall-new.pl --no-mask --seed=1430530643 --threads=16 --duration=400 --queries=100M --reporters=QueryTimeout,Backtrace,ErrorLog,Deadlock,ReplicationThreadRestarter --redefine=conf/mariadb/general-workarounds.yy --redefine=conf/mariadb/10.0-features-redefine.yy --mysqld=--log_output=FILE --mysqld=--slow_query_log --mysqld=--log_bin_trust_function_creators=1 --mysqld=--query_cache_size=64M --engine=InnoDB --views --grammar=conf/runtime/metadata_stability.yy --gendata=conf/runtime/metadata_stability.zz --rpl_mode=statement --mysqld=--slave-skip-errors=1054,1317,1049,1305,1539,1505 --mysqld=--slave-parallel-mode=optimistic --mysqld=--binlog_commit_wait_count=20 --mysqld=--binlog_commit_wait_usec=1000000 --mysqld=--slave-parallel-threads=20 --use-gtid=current_pos --mysqld=--max-binlog-size=4096 --mtr-build-thread=73 --basedir1=/home/elenst/git/10.1 --vardir1=/home/elenst/test_results/analyze

Comment by Kristian Nielsen [ 2015-05-07 ]

08:35 < knielsen> elenst: can you supply binlogs of MDEV-8039?
08:38 < knielsen> elenst: binlog from slave (if enabled) would be useful also, and general log of slave
08:39 < knielsen> elenst: and value of @@gtid_slave_pos on slave, if you still have it hanging

Comment by Kristian Nielsen [ 2015-05-07 ]

Daniel, I agree that it is odd that there are two threads in
handle_slave_sql(). This indicates that two SQL threads are running. I do
not see how that is possible without using multi-source replication?
There are also two sets of parallel replication worker threads.

Anyway, this case seems unrelated to show slave status. What we see is 10
worker threads idle, and 10 worker threads in "Waiting for prior transaction
to start commit before starting next transaction".
(In the second case, there is an additional set of 5 idle and 15 waiting
worker threads).

This indicates that the parallel replication worker threads have become
stuck, independent of anything else going on. This causes STOP SLAVE to
block waiting for the stuck worker threads, and in turn SHOW STATUS to
block, but that seems unrelated. All the worker threads are idle, not
executing any transactions, and all waiting (incorrectly).

There were several bugs fixes recently where this could happen. Basically,
some internal lists of group_commit_orderer objects gets corrupt, and a
wakeup is missed, causing all threads to get stuck waiting. These should all
be fixed in 10.0.18, which I believe is due for release shortly?

We can think about trying to track down why it hangs, but I would prefer to
first see if 10.0.18 fixes it. Tracking down these hangs can be very hard,
and would require at least very carefully collected binlog around the
point of hang, SHOW FULL PROCESSLIST, and error log, all painstakingly
correct, complete, and matching each other.

When this condition happens, it is possible to recover by using KILL queries
against the threads stuck in "Waiting for prior transaction to start commit
before starting next transaction". Note that it may be necessary to kill
multiple times, as they have to be killed in the correct order, which is
unknown to the user.

Comment by Kristian Nielsen [ 2015-05-07 ]

So to clarify, Elenas and Daniels conditions seem unrelated.

In Daniel's case, the worker threads have become stuck, all waiting for
another thread to start first.

In Elena's case, there is an ALTER TABLE stuck waiting on a table lock while
three other transactions have completed and are waiting for a prior
transaction to commit first (presumably the ALTER TABLE?). So this looks
like incorrectly executing the ALTER TABLE ddl speculatively, hopefully logs
can help clarify this.

Comment by Daniel Black [ 2015-05-07 ]

thanks knielsen. Happy to see if 10.0.18 fixes in it, especially in light of contradictory information on my part.

Thanks for looking. Once released I'll hammer the stop slave/start slave in parallel replication for a while. If I come across anything I'll be sure the collect those information bits carefully.

Comment by Elena Stepanova [ 2015-05-07 ]

08:35 < knielsen> elenst: can you supply binlogs of MDEV-8039?
08:38 < knielsen> elenst: binlog from slave (if enabled) would be useful also, and general log of slave
08:39 < knielsen> elenst: and value of @@gtid_slave_pos on slave, if you still have it hanging

So to clarify, Elenas and Daniels conditions seem unrelated.

In order to separate unrelated issues, the new JIRA report MDEV-8113 has been created, it contains the requested information.

Comment by Daniel Black [ 2017-01-26 ]

Apologies that this has been waiting for feedback for years. Due to employer changes I cannot test this as promised.

Generated at Thu Feb 08 07:24:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.