[MDEV-7568] STOP SLAVE crashes the server Created: 2015-02-10  Updated: 2015-03-02  Resolved: 2015-03-02

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.16
Fix Version/s: 10.0.17

Type: Bug Priority: Critical
Reporter: Hoang Xuan Phu Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: multisource, parallelslave, verified
Environment:

NixOS on AWS


Attachments: File master-my.cnf     File slave-my.cnf     Text File slaves-status.txt    

 Description   

I hope "Critical" is the right priority. It's a crash after all.

The crash only happen to one particular replication source (there are 8 in total). This is our production system so it's quite a hassle to run a debug build, but let me know if the information here is not enough an I'll see what I can do.

There is no *.err file. Not in the database directory, not anywhere in the filesystem.

Here's the crash log:

Feb 10 03:53:13 nyancat mysqld[28898]: 150210  3:53:13 [ERROR] mysqld got signal 11 ;
Feb 10 03:53:13 nyancat mysqld[28898]: This could be because you hit a bug. It is also possible that this binary
Feb 10 03:53:13 nyancat mysqld[28898]: or one of the libraries it was linked against is corrupt, improperly built,
Feb 10 03:53:13 nyancat mysqld[28898]: or misconfigured. This error can also be caused by malfunctioning hardware.
Feb 10 03:53:13 nyancat mysqld[28898]: To report this bug, see http://kb.askmonty.org/en/reporting-bugs
Feb 10 03:53:13 nyancat mysqld[28898]: We will try our best to scrape up some info that will hopefully help
Feb 10 03:53:13 nyancat mysqld[28898]: diagnose the problem, but since we have already crashed,
Feb 10 03:53:13 nyancat mysqld[28898]: something is definitely wrong and this may fail.
Feb 10 03:53:13 nyancat mysqld[28898]: Server version: 10.0.16-MariaDB-log
Feb 10 03:53:14 nyancat mysqld[28898]: key_buffer_size=134217728
Feb 10 03:53:14 nyancat mysqld[28898]: read_buffer_size=131072
Feb 10 03:53:14 nyancat mysqld[28898]: max_used_connections=12
Feb 10 03:53:14 nyancat mysqld[28898]: max_threads=1002
Feb 10 03:53:14 nyancat mysqld[28898]: thread_count=6
Feb 10 03:53:14 nyancat mysqld[28898]: It is possible that mysqld could use up to
Feb 10 03:53:14 nyancat mysqld[28898]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4384243 K  bytes of memory
Feb 10 03:53:14 nyancat mysqld[28898]: Hope that's ok; if not, decrease some variables in the equation.
Feb 10 03:53:14 nyancat mysqld[28898]: Thread pointer: 0x0x7f5310001268
Feb 10 03:53:14 nyancat mysqld[28898]: Attempting backtrace. You can use the following information to find out
Feb 10 03:53:14 nyancat mysqld[28898]: where mysqld died. If you see no messages after this, something went
Feb 10 03:53:14 nyancat mysqld[28898]: terribly wrong...
Feb 10 03:53:14 nyancat mysqld[28898]: stack_bottom = 0x7f53600e6970 thread_stack 0x48000
Feb 10 03:53:14 nyancat mysqld[28898]: (my_addr_resolve failure: fork)
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gqs2rkr5p379577px4fhpm5q8dzs4s13-mariadb-10.0.16/bin/mysqld(my_print_stacktrace+0x29) [0xb75919]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gqs2rkr5p379577px4fhpm5q8dzs4s13-mariadb-10.0.16/bin/mysqld(handle_fatal_signal+0x398) [0x721868]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gpaawfk6qvb1wdj4vrwvdgp4yqfcqqgf-glibc-2.19/lib/libpthread.so.0(+0xf3a0) [0x7f593f0853a0]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gpaawfk6qvb1wdj4vrwvdgp4yqfcqqgf-glibc-2.19/lib/libpthread.so.0(pthread_cond_wait+0x8f) [0x7f593f081a8f]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gqs2rkr5p379577px4fhpm5q8dzs4s13-mariadb-10.0.16/bin/mysqld(rpl_parallel_thread_pool::get_thread(rpl_parallel_thread**, rpl_parallel_entry*)+0xfb) [0x6cdb9b]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gqs2rkr5p379577px4fhpm5q8dzs4s13-mariadb-10.0.16/bin/mysqld(rpl_parallel_entry::choose_thread(rpl_group_info*, bool*, PSI_stage_info_v1*, bool)+0x22d) [0x6cfb4d]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gqs2rkr5p379577px4fhpm5q8dzs4s13-mariadb-10.0.16/bin/mysqld(rpl_parallel::do_event(rpl_group_info*, Log_event*, unsigned long long)+0x15a) [0x6d081a]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gqs2rkr5p379577px4fhpm5q8dzs4s13-mariadb-10.0.16/bin/mysqld(handle_slave_sql+0x18ca) [0x55956a]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gpaawfk6qvb1wdj4vrwvdgp4yqfcqqgf-glibc-2.19/lib/libpthread.so.0(+0x6f8a) [0x7f593f07cf8a]
Feb 10 03:53:14 nyancat mysqld[28898]: /nix/store/gpaawfk6qvb1wdj4vrwvdgp4yqfcqqgf-glibc-2.19/lib/libc.so.6(clone+0x6d) [0x7f593d8f10bd]
Feb 10 03:53:14 nyancat mysqld[28898]: Trying to get some variables.
Feb 10 03:53:14 nyancat mysqld[28898]: Some pointers may be invalid and cause the dump to abort.
Feb 10 03:53:14 nyancat mysqld[28898]: Query (0x0): is an invalid pointer
Feb 10 03:53:14 nyancat mysqld[28898]: Connection ID (thread ID): 10
Feb 10 03:53:14 nyancat mysqld[28898]: Status: NOT_KILLED
Feb 10 03:53:14 nyancat mysqld[28898]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,
Feb 10 03:53:14 nyancat mysqld[28898]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
Feb 10 03:53:14 nyancat mysqld[28898]: information that should help you find out what is causing the crash.
Feb 10 03:53:15 nyancat systemd[1]: mysql.service: main process exited, code=exited, status=1/FAILURE
Feb 10 03:53:15 nyancat systemd[1]: mysql.service: control process exited, code=exited status=1
Feb 10 03:53:15 nyancat systemd[1]: Unit mysql.service entered failed state.
Feb 10 03:53:15 nyancat systemd[1]: mysql.service failed.
Feb 10 03:53:15 nyancat mysqladmin[32455]: [116B blob data]
Feb 10 03:53:15 nyancat mysqladmin[32455]: error: 'Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111 "Connection refused")'
Feb 10 03:53:15 nyancat mysqladmin[32455]: Check that mysqld is running and that the socket: '/tmp/mysql.sock' exists!



 Comments   
Comment by Elena Stepanova [ 2015-02-10 ]

Hi,

Your error log apparently goes to syslog/messages or such, where you copied the stack trace from. Can you copy-paste the whole mysqld output, from server startup and till the crash?

Can you describe in more details your replication topology, and which of its participants crashes? Does the slave which crashes also get updated directly, or does it only receive updates from the master?

Please attach your cnf file(s) from the master and the slave involved in the crash.

After the server restarts after the crash and reconnects to the master, it says which position it restarts replication from. If you can provide the binary log covering this position and some time before, it could also help.

Thanks.

Comment by Hoang Xuan Phu [ 2015-02-10 ]

That log is from journalctl (systemd), the system doesn't have a log file for mysqld, at least not that I know of.

The slave server replicates from 8 masters running an assortment of MySQL and MariaDB with different versions. One particular master (let's call it "bad-master") is running MariaDB 10.0.11. Every time I issue STOP SLAVE 'bad-master' on the slave server, the MariaDB process on the slave server crashes.

I have attached the config files.

For an unknown reason, the slave was lagging a few days behind "bad-master", but Seconds_Behind_Master still shows 0 (may be another bug? IDK). I only found out about this when I see reports of missing data. The binary log in question has expired and is no longer available.

Thanks for looking into this.

Comment by Elena Stepanova [ 2015-02-10 ]

Do you happen to have any monitoring on the crashing slave? SHOW SLAVE STATUS, SHOW PROCESSLIST and such.

Regarding the log, doesn't journalctl have any other records with 'mysqld' in the prefix?

Comment by Hoang Xuan Phu [ 2015-02-11 ]

SHOW SLAVE STATUS said something about waiting for master to send events. Unfortunately I don't have a record of it.

There was no more relevant output in the log.

My coworker found that the crash stopped after parallel replication is disabled:

-        # Should be at least double the number of masters
-        slave_parallel_threads = 20
-        # This limits the damage if one replication source goes wild
-        slave_domain_parallel_threads = 10

Does that ring any bell?

Comment by Elena Stepanova [ 2015-02-11 ]

It's natural since the stack trace indicates that the problem happens in parallel replication code. But we don't have known open crashing bugs in this area, so we'll need to find a way to reproduce it.

Have you found the rest of the error log?

Comment by Hoang Xuan Phu [ 2015-02-12 ]

There is no more logs, unless you mean running a debug binary to get more debug output. Meanwhile, I have disabled parallel replication to avoid the crash and repair the broken replications. I might be able to set up a debug system when things calm down a bit.

Comment by Elena Stepanova [ 2015-02-12 ]

So far I wasn't able to reproduce it, I will keep trying.

If possible, please provide SHOW ALL SLAVES STATUS from the running slave server, maybe we'll see something about configuration that I'm missing. If it's confidential data, you can upload the output to our ftp.askmonty.org/private. Of course, IPs, server names etc. can be obfuscated.

Also, what server package are you using? Are you building from source? If so, please paste your cmake options.

Comment by Hoang Xuan Phu [ 2015-02-13 ]

I have attached slaves-status.txt. The bad master is number 5 (renamed "bad-master"). I'm not sure if it's very useful because currently the server is running without parallel replication. In 2 weeks I expect to be able to duplicate the setup and try parallel replication again.

MariaDB is build from source. Here are some relevant build configuration lines. The rest is NixOS boilerplate.

  cmakeFlags = "-DWITH_SSL=yes -DWITH_READLINE=yes -DWITH_EMBEDDED_SERVER=yes -DWITH_ZLIB=yes -DINSTALL_SCRIPTDIR=bin -DHAVE_IPV6=yes -DWITHOUT_TOKUDB=1";
  NIX_LDFLAGS = stdenv.lib.optionalString stdenv.isLinux "-lgcc_s";
  prePatch = ''
    sed -i -e "s|/usr/bin/libtool|libtool|" cmake/libutils.cmake
  '';
  postInstall = ''
    sed -i -e "s|basedir=\"\"|basedir=\"$out\"|" $out/bin/mysql_install_db
    rm -rf $out/mysql-test $out/sql-bench
  '';

Comment by Elena Stepanova [ 2015-02-19 ]

I was able to reproduce the crash. It happens on slightly different circumstances, at least on my build – not during STOP SLAVE, but during restarting the slave server, – but otherwise it's identical, so it must be the same problem: similar stack trace, multi-source replication setup and parallel replication.

Stack trace from 10.0 revno 4599

150219 14:10:15 [Note] Started replication for 'm1'
150219 14:10:15 [Note] Master 'm1': Slave SQL thread initialized, starting replication in log 'mysqld-bin.000001' at position 313, relay log './mysqld-relay-bin-m1.000002' position: 601
150219 14:10:15 [Note] Master 'm1': Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'mysqld-bin.000001' at position 70397
safe_mutex: Trying to lock unitialized mutex at /home/elenst/bzr/10.0/sql/rpl_parallel.cc, line 1400
150219 14:10:15 [Note] Started replication for 'm2'
150219 14:10:15 [ERROR] mysqld got signal 6 ;
 
#3  <signal handler called>
#4  0x00007febfb211165 in *__GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007febfb2143e0 in *__GI_abort () at abort.c:92
#6  0x0000000000e52f61 in safe_mutex_lock (mp=0x188f5f8, my_flags=0, file=0xf69e20 "/home/elenst/bzr/10.0/sql/rpl_parallel.cc", line=1400) at /home/elenst/bzr/10.0/mysys/thr_mutex.c:247
#7  0x00000000007f3d97 in inline_mysql_mutex_lock (that=0x188f5f8, src_file=0xf69e20 "/home/elenst/bzr/10.0/sql/rpl_parallel.cc", src_line=1400) at /home/elenst/bzr/10.0/include/mysql/psi/mysql_thread.h:689
#8  0x00000000007f7a30 in rpl_parallel_thread_pool::get_thread (this=0x188f5e0, owner=0x7febe80993c0, entry=0x7febe8099270) at /home/elenst/bzr/10.0/sql/rpl_parallel.cc:1400
#9  0x00000000007f7ef5 in rpl_parallel_entry::choose_thread (this=0x7febe8099270, rgi=0x7febe8016800, did_enter_cond=0x7febfd383737, old_stage=0x7febfd383710, reuse=false) at /home/elenst/bzr/10.0/sql/rpl_parallel.cc:1542
#10 0x00000000007f8d43 in rpl_parallel::do_event (this=0x7febf1425c58, serial_rgi=0x7febe8016800, ev=0x7febe807b270, event_size=38) at /home/elenst/bzr/10.0/sql/rpl_parallel.cc:1971
#11 0x00000000005e805e in exec_relay_log_event (thd=0x7febe803f070, rli=0x7febf1422c48, serial_rgi=0x7febe8016800) at /home/elenst/bzr/10.0/sql/slave.cc:3531
#12 0x00000000005eb07d in handle_slave_sql (arg=0x7febf1421000) at /home/elenst/bzr/10.0/sql/slave.cc:4665
#13 0x0000000000cd0a2a in pfs_spawn_thread (arg=0x7febf4fd9c70) at /home/elenst/bzr/10.0/storage/perfschema/pfs.cc:1860
#14 0x00007febfd005b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#15 0x00007febfb2bb70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

Test case

--enable_connect_log
 
--let $queries = 200
 
--connect (server_1,127.0.0.1,root,,,$SERVER_MYPORT_1)
create database db1;
use db1;
create table t1 (pk int auto_increment primary key) engine=InnoDB;
 
# Generate some binlog events on server 1
--disable_query_log
--disable_warnings
--let $n = $queries
while ($n)
{
  insert into t1 (pk) values (null);
  delete from t1 order by pk limit 1;
  --dec $n
}
--enable_query_log
--enable_warnings
 
--connect (server_2,127.0.0.1,root,,,$SERVER_MYPORT_2)
create database db2;
use db2;
create table t1 (pk int auto_increment primary key) engine=InnoDB;
 
# Generate some binlog events on server 2
--disable_query_log
--disable_warnings
--let $n = $queries
while ($n)
{
  insert into t1 (pk) values (null);
  delete from t1 order by pk limit 1;
  --dec $n
}
--enable_query_log
--enable_warnings
 
--connect (server_3,127.0.0.1,root,,,$SERVER_MYPORT_3)
 
eval change master 'm1' to master_host='127.0.0.1', master_port=$SERVER_MYPORT_1, master_user='root';
eval change master 'm2' to master_host='127.0.0.1', master_port=$SERVER_MYPORT_2, master_user='root';
start all slaves;
let $wait_for_all = 1;
let $show_statement = SHOW ALL SLAVES STATUS;
let $field = Slave_SQL_Running;
let $condition = = 'Yes';
--source include/wait_show_condition.inc
let $field = Slave_IO_Running;
--source include/wait_show_condition.inc
 
 
--echo # Wait a bit...
--sleep 3
 
--echo # Stop the slave...
--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.3.expect
--shutdown_server 10
--source include/wait_until_disconnected.inc
 
--echo # Restart the slave the slave...
--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.3.expect
--enable_reconnect
--source include/wait_until_connected_again.inc
 
show processlist;

cnf file for the test case

!include include/default_mysqld.cnf
 
[mysqld.1]
server-id=1
log-bin
innodb
 
[mysqld.2]
server-id=2
log-bin
innodb
 
[mysqld.3]
server-id=3
slave_parallel_threads=1
innodb
 
[ENV]
SERVER_MYPORT_1= @mysqld.1.port
SERVER_MYSOCK_1= @mysqld.1.socket
SERVER_MYPORT_2= @mysqld.2.port
SERVER_MYSOCK_2= @mysqld.2.socket
SERVER_MYPORT_3= @mysqld.3.port
SERVER_MYSOCK_3= @mysqld.3.socket

Comment by Kristian Nielsen [ 2015-02-19 ]

Ok, I found it. Here is a patch that fixes it:

http://lists.askmonty.org/pipermail/commits/2015-February/007452.html

I've pushed the patch to 10.0.17.

Comment by Elena Stepanova [ 2015-02-19 ]

phunehehe,

As you can see above, Kristian has fixed the bug which most likely caused the crash that you observed. Still, we would like to make sure it actually solves the problem for you. You said above that you were going to create a new setup. Would it be possible for you to pick up the fixed sources and try them? If you use the bzr tree, it's already in the tree; if you use source tarballs, I will give you the location of the one with the fix.

Comment by Hoang Xuan Phu [ 2015-02-28 ]

This is great news. Please point me to a tarball with the fix. I'll find time to try it out. Sorry for the delayed response, it was a big holiday

Comment by Kristian Nielsen [ 2015-02-28 ]

The patch was pushed to 10.0.17. Elena just wanted to keep the bug open in order to have time to double-check that the originally reported problem is the same as the problem fixed by the patch.

Comment by Elena Stepanova [ 2015-02-28 ]

phunehehe,
You can find the source tarball here: http://hasky.askmonty.org/archive/pack/10.0/build-7590/kvm-tarbake-jaunty-x86/mariadb-10.0.17.tar.gz

Comment by Hoang Xuan Phu [ 2015-02-28 ]

Is that the same as the one from https://downloads.mariadb.org/mariadb/10.0.17/? I see the fix in the changelog. Will definitely update and test.

Comment by Elena Stepanova [ 2015-02-28 ]

Right, if your mirror already has 10.0.17 package, please use it.

Comment by Hoang Xuan Phu [ 2015-03-02 ]

This might be good or bad news. The crash is no longer reproducible for me. Not with 10.0.16, not with 10.0.17.
Thanks to everyone for taking care of MariaDB!

Comment by Elena Stepanova [ 2015-03-02 ]

Lets hope it's good news. I will consider the bug fixed for now; if you have any new information later, please let us know.

Generated at Thu Feb 08 07:20:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.