[MDEV-523] Galera: Server crashes with SIGABRT, SIGSEGV or deadlocks on concurrent DDL/DML flow Created: 2012-09-11  Updated: 2012-09-20  Resolved: 2012-09-18

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.25-galera
Fix Version/s: 5.5.28a-galera

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Seppo Jaakola
Resolution: Fixed Votes: 0
Labels: None

Attachments: File crash_sigabrt_error_log     File crash_sigabrt_threads     File crash_sigsegv_error_log     File crash_sigsegv_threads     File deadlock_error_log     File deadlock_threads    
Issue Links:
Blocks
is blocked by MDEV-541 Galera: A deadlock breaks the whole c... Closed
Relates
relates to MDEV-468 Galera: Server does not honor metadat... Closed

 Description   

The following test, when it's run on a node of a several-node cluster, causes different effects: server crash with signal 11, or crash with signal 6, or deadlock between the test clients (in this case I can still connect to the server and issue SHOW PROCESSLIST or such). There might be even more variations, but I stopped trying to distinguish them after the first 3.

I will assume for now that all those problems have the same root cause, and will put them all together here. The attached files are all threads' stack trace and server error log for each of the failure types, 6 files total.

Here in the description I will put the shell script which needs to be run to reproduce the problems, stack traces of the crashing threads, and my server command line.

Test script:

#!/bin/bash
 
# Provide your port number as the 1st parameter if the port is not 3306.
# Provide full path to MySQL client as the 2nd parameter if there is no MySQL client on the system path.
# Please note that the script is quite rough, so after the server deadlocks you might need to do some cleanup
 
port=${1:-3306}
client=${2:-mysql}
 
echo "CREATE TABLE IF NOT EXISTS test.t1 (f1 tinyint, pk int auto_increment primary key) ENGINE=InnoDB |" > client1
echo "CREATE TABLE IF NOT EXISTS test.t1 (f1 tinyint, pk int auto_increment primary key) ENGINE=InnoDB |" > client2
 
c=1
while [ $c -lt 100 ]
do
  echo "CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tr ; SELECT '1 alive' |" >> client1
  echo "INSERT INTO t1 ( f1 ) VALUES ( 1 ) ; DELETE FROM t1 LIMIT 1 ; CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tri ; SELECT '2 alive' |" >> client2
  (( c += 1 ))
done
 
$client --no-defaults -uroot --protocol=tcp --port=$port --force --delimiter='|' test < client1 &
$client --no-defaults -uroot --protocol=tcp --port=$port --force --delimiter='|' test < client2
fg
 
$client --no-defaults -uroot --protocol=tcp --port=$port --force --delimiter='|' -e "DROP TABLE test.t1"
 

My server command line (the node where I run the test flow; other node(s) are the same, only wsrep_cluster_address differs).

maria-5.5-galera/sql/mysqld --no-defaults --datadir=maria-5.5-galera/data1 --wsrep_provider=galera-23.2.1-src/libgalera_smm.so --wsrep_sst_method=rsync --core --binlog-format=row --default-storage-engine=InnoDB --innodb_autoinc_lock_mode=2 --innodb_locks_unsafe_for_binlog=1 --innodb_flush_log_at_trx_commit=0 --log-error=log.err --wsrep_sst_auth=galera:galera --basedir=maria-5.5-galera/ --port=8306 --loose-lc-messages-dir=maria-5.5-galera/sql/share --socket=/tmp/elenst-galera-1.sock --tmpdir=maria-5.5-galera/data1/tmp --general-log=1 --wsrep_cluster_address=gcomm:// --core --wsrep_debug=1

Stack trace for the thread crashing with SIGSEGV:

#2  0x00000000007bdd3c in handle_fatal_signal (sig=11) at maria-5.5-galera/sql/signal_handler.cc:262
#3  <signal handler called>
#4  0x0000000000c88e70 in free_root (root=0x7f0ce41296c0, MyFlags=0) at maria-5.5-galera/mysys/my_alloc.c:364
#5  0x00000000006c1021 in TABLE_SHARE::destroy (this=0x7f0cd00146e8) at maria-5.5-galera/sql/table.cc:457
#6  0x00000000006c10d0 in free_table_share (share=0x7f0cd00146e8) at maria-5.5-galera/sql/table.cc:481
#7  0x000000000059fb54 in table_def_free_entry (share=0x7f0cd00146e8) at maria-5.5-galera/sql/sql_base.cc:371
#8  0x0000000000c72951 in my_hash_delete (hash=0x14cb320, record=0x7f0cd00146e8 "\002") at maria-5.5-galera/mysys/hash.c:604
#9  0x00000000005a07e1 in release_table_share (share=0x7f0cd00146e8) at maria-5.5-galera/sql/sql_base.cc:795
#10 0x00000000006c7e48 in closefrm (table=0x7f0cd0015d00, free_share=true) at maria-5.5-galera/sql/table.cc:2726
#11 0x00000000005a0e5e in intern_close_table (table=0x7f0cd0015d00) at maria-5.5-galera/sql/sql_base.cc:932
#12 0x00000000005a0ec7 in free_cache_entry (table=0x7f0cd0015d00) at maria-5.5-galera/sql/sql_base.cc:955
#13 0x00000000005a2505 in close_thread_table (thd=0x5002e50, table_ptr=0x5002f08) at maria-5.5-galera/sql/sql_base.cc:1642
#14 0x00000000005a1e9a in close_all_tables_for_name (thd=0x5002e50, share=0x7f0cd00146e8, extra=HA_EXTRA_NOT_USED) at maria-5.5-galera/sql/sql_base.cc:1437
#15 0x00000000006a7d90 in mysql_create_or_drop_trigger (thd=0x5002e50, tables=0x5087558, create=false) at maria-5.5-galera/sql/sql_trigger.cc:564
#16 0x000000000060904d in mysql_execute_command (thd=0x5002e50) at maria-5.5-galera/sql/sql_parse.cc:4793
#17 0x000000000060c9cc in mysql_parse (thd=0x5002e50, rawbuf=0x5086b36 "DROP TRIGGER IF EXISTS tr ; SELECT '1 alive'", length=44, parser_state=0x7f0ce412a550) at maria-5.5-galera/sql/sql_parse.cc:6235
#18 0x00000000005fec47 in dispatch_command (command=COM_QUERY, thd=0x5002e50, packet=0x505d991 "", packet_length=106) at maria-5.5-galera/sql/sql_parse.cc:1299
#19 0x00000000005fd8b3 in do_command (thd=0x5002e50) at maria-5.5-galera/sql/sql_parse.cc:899
#20 0x00000000006fd7df in do_handle_one_connection (thd_arg=0x5002e50) at maria-5.5-galera/sql/sql_connect.cc:1278
#21 0x00000000006fd1c6 in handle_one_connection (arg=0x5002e50) at maria-5.5-galera/sql/sql_connect.cc:1186
#22 0x00007f0ce341defc in start_thread (arg=0x7f0ce412b700) at pthread_create.c:304
#23 0x00007f0ce2bcd59d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

Stack trace for the thread crashing with SIGABRT:

#5  0x00007f1b0af79b0b in __GI_abort () at abort.c:92
#6  0x00007f1b0afafd63 in __libc_message (do_abort=2, fmt=0x7f1b0b0a0e58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:189
#7  0x00007f1b0afba6e6 in malloc_printerr (action=3, str=0x7f1b0b0a1048 "free(): corrupted unsorted chunks", ptr=<optimized out>) at malloc.c:6283
#8  0x00007f1b0afbe9cc in __GI___libc_free (mem=<optimized out>) at malloc.c:3738
#9  0x0000000000ca8bbe in free_memory (ptr=0x3ddcf90) at maria-5.5-galera/mysys/safemalloc.c:205
#10 0x0000000000ca8ae2 in sf_free (ptr=0x3ddcf90) at maria-5.5-galera/mysys/safemalloc.c:182
#11 0x0000000000c94eb0 in my_free (ptr=0x3ddcf90) at maria-5.5-galera/mysys/my_malloc.c:115
#12 0x00000000007082a3 in MDL_context::acquire_locks (this=0x7f1af95781a0, mdl_requests=0x7f1b0c57f7c0, lock_wait_timeout=31536000) at maria-5.5-galera/sql/mdl.cc:2207
#13 0x00000000005a833c in lock_table_names (thd=0x7f1af95780b0, tables_start=0x3dd7a70, tables_end=0x0, lock_wait_timeout=31536000, flags=0) at maria-5.5-galera/sql/sql_base.cc:4751
#14 0x00000000005a8714 in open_tables (thd=0x7f1af95780b0, start=0x7f1b0c57fb20, counter=0x7f1b0c57fb6c, flags=0, prelocking_strategy=0x7f1b0c57fc00) at maria-5.5-galera/sql/sql_base.cc:4936
#15 0x00000000005a97d1 in open_and_lock_tables (thd=0x7f1af95780b0, tables=0x3dd7a70, derived=false, flags=0, prelocking_strategy=0x7f1b0c57fc00) at maria-5.5-galera/sql/sql_base.cc:5599
#16 0x00000000005a929a in open_n_lock_single_table (thd=0x7f1af95780b0, table_l=0x3dd7a70, lock_type=TL_READ_NO_INSERT, flags=0, prelocking_strategy=0x7f1b0c57fc00) at maria-5.5-galera/sql/sql_base.cc:5443
#17 0x00000000006a5099 in open_n_lock_single_table (thd=0x7f1af95780b0, table_l=0x3dd7a70, lock_type=TL_READ_NO_INSERT, flags=0) at maria-5.5-galera/sql/sql_base.h:482
#18 0x00000000006a7c03 in mysql_create_or_drop_trigger (thd=0x7f1af95780b0, tables=0x3dd7a70, create=true) at maria-5.5-galera/sql/sql_trigger.cc:530
#19 0x0000000000608fc4 in mysql_execute_command (thd=0x7f1af95780b0) at maria-5.5-galera/sql/sql_parse.cc:4785
#20 0x000000000060c9cc in mysql_parse (thd=0x7f1af95780b0, rawbuf=0x3dd7898 "CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tr ; SELECT '1 alive'", length=106, parser_state=0x7f1b0c580550) at maria-5.5-galera/sql/sql_parse.cc:6235
#21 0x00000000005fea28 in dispatch_command (command=COM_QUERY, thd=0x7f1af95780b0, packet=0x7f1af95cb8b1 "CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tr ; SELECT '1 alive'", packet_length=106) at maria-5.5-galera/sql/sql_parse.cc:1240
#22 0x00000000005fd8b3 in do_command (thd=0x7f1af95780b0) at maria-5.5-galera/sql/sql_parse.cc:899
#23 0x00000000006fd7df in do_handle_one_connection (thd_arg=0x7f1af95780b0) at maria-5.5-galera/sql/sql_connect.cc:1278
#24 0x00000000006fd1c6 in handle_one_connection (arg=0x7f1af95780b0) at maria-5.5-galera/sql/sql_connect.cc:1186
#25 0x00007f1b0b873efc in start_thread (arg=0x7f1b0c581700) at pthread_create.c:304
#26 0x00007f1b0b02359d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

*** glibc detected *** maria-5.5-galera/sql/mysqld: corrupted double-linked list: 0x00007fe9596156b0 ***

sergii@pisem.net-20120908075116-u2v35aaixa3nnhrl
date: 2012-09-08 09:51:16 +0200
build-date: 2012-09-11 05:57:26 +0400
revno: 3346

All types of the failures are also reproducible on MariaDB-Galera-5.5.25 release binaries.



 Comments   
Comment by Seppo Jaakola [ 2012-09-11 ]

This test shows a bug when DDL statement is part of a multi statement. Removing delimiters and --delimiter option in the test script makes this test pass.

Comment by Elena Stepanova [ 2012-09-11 ]

This is true (at least for this test), but normally it should not matter whether DDL is provided as a number of single statements or as a multi-statement, right? There should be nothing specific about a multi-statement in the way server processes it? So, my hope was the difference was just caused by different timing for multi- and single statements in the test. If it's not so, and multi statements are essentially different, it's more disturbing and, in addition to fixing the crash, at the very least deserves mentioning somewhere in the documentation (maybe it's already there, but I couldn't see it so far).

Comment by Seppo Jaakola [ 2012-09-18 ]

When tested in upstream build, none of the bad effects (hanging, assert, segfault) happened. However, I'm not quite sure how the test should behave, DDL gets high priority in Galera replication and these client sessions will constantly conflict with each other.

Comment by Elena Stepanova [ 2012-09-18 ]

>> I'm not quite sure how the test should behave, DDL gets high priority in Galera replication and these client sessions will constantly conflict with each other.

For this kind of tests, I would consider the behavior good enough if nodes don't crash, don't deadlock, replication doesn't break, data doesn't get corrupted and remains consistent over the cluster.
The exact result of the flow (what is executed, what is rolled back, etc.) remains non-deterministic due to the nature of a concurrency test.
I'll re-run the unabridged version to check the fix.

Generated at Thu Feb 08 06:29:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.