Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-523

Galera: Server crashes with SIGABRT, SIGSEGV or deadlocks on concurrent DDL/DML flow

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 5.5.25-galera
    • 5.5.28a-galera
    • None
    • None

    Description

      The following test, when it's run on a node of a several-node cluster, causes different effects: server crash with signal 11, or crash with signal 6, or deadlock between the test clients (in this case I can still connect to the server and issue SHOW PROCESSLIST or such). There might be even more variations, but I stopped trying to distinguish them after the first 3.

      I will assume for now that all those problems have the same root cause, and will put them all together here. The attached files are all threads' stack trace and server error log for each of the failure types, 6 files total.

      Here in the description I will put the shell script which needs to be run to reproduce the problems, stack traces of the crashing threads, and my server command line.

      Test script:

      #!/bin/bash
       
      # Provide your port number as the 1st parameter if the port is not 3306.
      # Provide full path to MySQL client as the 2nd parameter if there is no MySQL client on the system path.
      # Please note that the script is quite rough, so after the server deadlocks you might need to do some cleanup
       
      port=${1:-3306}
      client=${2:-mysql}
       
      echo "CREATE TABLE IF NOT EXISTS test.t1 (f1 tinyint, pk int auto_increment primary key) ENGINE=InnoDB |" > client1
      echo "CREATE TABLE IF NOT EXISTS test.t1 (f1 tinyint, pk int auto_increment primary key) ENGINE=InnoDB |" > client2
       
      c=1
      while [ $c -lt 100 ]
      do
        echo "CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tr ; SELECT '1 alive' |" >> client1
        echo "INSERT INTO t1 ( f1 ) VALUES ( 1 ) ; DELETE FROM t1 LIMIT 1 ; CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tri ; SELECT '2 alive' |" >> client2
        (( c += 1 ))
      done
       
      $client --no-defaults -uroot --protocol=tcp --port=$port --force --delimiter='|' test < client1 &
      $client --no-defaults -uroot --protocol=tcp --port=$port --force --delimiter='|' test < client2
      fg
       
      $client --no-defaults -uroot --protocol=tcp --port=$port --force --delimiter='|' -e "DROP TABLE test.t1"
       

      My server command line (the node where I run the test flow; other node(s) are the same, only wsrep_cluster_address differs).

      maria-5.5-galera/sql/mysqld --no-defaults --datadir=maria-5.5-galera/data1 --wsrep_provider=galera-23.2.1-src/libgalera_smm.so --wsrep_sst_method=rsync --core --binlog-format=row --default-storage-engine=InnoDB --innodb_autoinc_lock_mode=2 --innodb_locks_unsafe_for_binlog=1 --innodb_flush_log_at_trx_commit=0 --log-error=log.err --wsrep_sst_auth=galera:galera --basedir=maria-5.5-galera/ --port=8306 --loose-lc-messages-dir=maria-5.5-galera/sql/share --socket=/tmp/elenst-galera-1.sock --tmpdir=maria-5.5-galera/data1/tmp --general-log=1 --wsrep_cluster_address=gcomm:// --core --wsrep_debug=1

      Stack trace for the thread crashing with SIGSEGV:

      #2  0x00000000007bdd3c in handle_fatal_signal (sig=11) at maria-5.5-galera/sql/signal_handler.cc:262
      #3  <signal handler called>
      #4  0x0000000000c88e70 in free_root (root=0x7f0ce41296c0, MyFlags=0) at maria-5.5-galera/mysys/my_alloc.c:364
      #5  0x00000000006c1021 in TABLE_SHARE::destroy (this=0x7f0cd00146e8) at maria-5.5-galera/sql/table.cc:457
      #6  0x00000000006c10d0 in free_table_share (share=0x7f0cd00146e8) at maria-5.5-galera/sql/table.cc:481
      #7  0x000000000059fb54 in table_def_free_entry (share=0x7f0cd00146e8) at maria-5.5-galera/sql/sql_base.cc:371
      #8  0x0000000000c72951 in my_hash_delete (hash=0x14cb320, record=0x7f0cd00146e8 "\002") at maria-5.5-galera/mysys/hash.c:604
      #9  0x00000000005a07e1 in release_table_share (share=0x7f0cd00146e8) at maria-5.5-galera/sql/sql_base.cc:795
      #10 0x00000000006c7e48 in closefrm (table=0x7f0cd0015d00, free_share=true) at maria-5.5-galera/sql/table.cc:2726
      #11 0x00000000005a0e5e in intern_close_table (table=0x7f0cd0015d00) at maria-5.5-galera/sql/sql_base.cc:932
      #12 0x00000000005a0ec7 in free_cache_entry (table=0x7f0cd0015d00) at maria-5.5-galera/sql/sql_base.cc:955
      #13 0x00000000005a2505 in close_thread_table (thd=0x5002e50, table_ptr=0x5002f08) at maria-5.5-galera/sql/sql_base.cc:1642
      #14 0x00000000005a1e9a in close_all_tables_for_name (thd=0x5002e50, share=0x7f0cd00146e8, extra=HA_EXTRA_NOT_USED) at maria-5.5-galera/sql/sql_base.cc:1437
      #15 0x00000000006a7d90 in mysql_create_or_drop_trigger (thd=0x5002e50, tables=0x5087558, create=false) at maria-5.5-galera/sql/sql_trigger.cc:564
      #16 0x000000000060904d in mysql_execute_command (thd=0x5002e50) at maria-5.5-galera/sql/sql_parse.cc:4793
      #17 0x000000000060c9cc in mysql_parse (thd=0x5002e50, rawbuf=0x5086b36 "DROP TRIGGER IF EXISTS tr ; SELECT '1 alive'", length=44, parser_state=0x7f0ce412a550) at maria-5.5-galera/sql/sql_parse.cc:6235
      #18 0x00000000005fec47 in dispatch_command (command=COM_QUERY, thd=0x5002e50, packet=0x505d991 "", packet_length=106) at maria-5.5-galera/sql/sql_parse.cc:1299
      #19 0x00000000005fd8b3 in do_command (thd=0x5002e50) at maria-5.5-galera/sql/sql_parse.cc:899
      #20 0x00000000006fd7df in do_handle_one_connection (thd_arg=0x5002e50) at maria-5.5-galera/sql/sql_connect.cc:1278
      #21 0x00000000006fd1c6 in handle_one_connection (arg=0x5002e50) at maria-5.5-galera/sql/sql_connect.cc:1186
      #22 0x00007f0ce341defc in start_thread (arg=0x7f0ce412b700) at pthread_create.c:304
      #23 0x00007f0ce2bcd59d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

      Stack trace for the thread crashing with SIGABRT:

      #5  0x00007f1b0af79b0b in __GI_abort () at abort.c:92
      #6  0x00007f1b0afafd63 in __libc_message (do_abort=2, fmt=0x7f1b0b0a0e58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:189
      #7  0x00007f1b0afba6e6 in malloc_printerr (action=3, str=0x7f1b0b0a1048 "free(): corrupted unsorted chunks", ptr=<optimized out>) at malloc.c:6283
      #8  0x00007f1b0afbe9cc in __GI___libc_free (mem=<optimized out>) at malloc.c:3738
      #9  0x0000000000ca8bbe in free_memory (ptr=0x3ddcf90) at maria-5.5-galera/mysys/safemalloc.c:205
      #10 0x0000000000ca8ae2 in sf_free (ptr=0x3ddcf90) at maria-5.5-galera/mysys/safemalloc.c:182
      #11 0x0000000000c94eb0 in my_free (ptr=0x3ddcf90) at maria-5.5-galera/mysys/my_malloc.c:115
      #12 0x00000000007082a3 in MDL_context::acquire_locks (this=0x7f1af95781a0, mdl_requests=0x7f1b0c57f7c0, lock_wait_timeout=31536000) at maria-5.5-galera/sql/mdl.cc:2207
      #13 0x00000000005a833c in lock_table_names (thd=0x7f1af95780b0, tables_start=0x3dd7a70, tables_end=0x0, lock_wait_timeout=31536000, flags=0) at maria-5.5-galera/sql/sql_base.cc:4751
      #14 0x00000000005a8714 in open_tables (thd=0x7f1af95780b0, start=0x7f1b0c57fb20, counter=0x7f1b0c57fb6c, flags=0, prelocking_strategy=0x7f1b0c57fc00) at maria-5.5-galera/sql/sql_base.cc:4936
      #15 0x00000000005a97d1 in open_and_lock_tables (thd=0x7f1af95780b0, tables=0x3dd7a70, derived=false, flags=0, prelocking_strategy=0x7f1b0c57fc00) at maria-5.5-galera/sql/sql_base.cc:5599
      #16 0x00000000005a929a in open_n_lock_single_table (thd=0x7f1af95780b0, table_l=0x3dd7a70, lock_type=TL_READ_NO_INSERT, flags=0, prelocking_strategy=0x7f1b0c57fc00) at maria-5.5-galera/sql/sql_base.cc:5443
      #17 0x00000000006a5099 in open_n_lock_single_table (thd=0x7f1af95780b0, table_l=0x3dd7a70, lock_type=TL_READ_NO_INSERT, flags=0) at maria-5.5-galera/sql/sql_base.h:482
      #18 0x00000000006a7c03 in mysql_create_or_drop_trigger (thd=0x7f1af95780b0, tables=0x3dd7a70, create=true) at maria-5.5-galera/sql/sql_trigger.cc:530
      #19 0x0000000000608fc4 in mysql_execute_command (thd=0x7f1af95780b0) at maria-5.5-galera/sql/sql_parse.cc:4785
      #20 0x000000000060c9cc in mysql_parse (thd=0x7f1af95780b0, rawbuf=0x3dd7898 "CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tr ; SELECT '1 alive'", length=106, parser_state=0x7f1b0c580550) at maria-5.5-galera/sql/sql_parse.cc:6235
      #21 0x00000000005fea28 in dispatch_command (command=COM_QUERY, thd=0x7f1af95780b0, packet=0x7f1af95cb8b1 "CREATE TRIGGER tr AFTER DELETE ON t1 FOR EACH ROW BEGIN END ; DROP TRIGGER IF EXISTS tr ; SELECT '1 alive'", packet_length=106) at maria-5.5-galera/sql/sql_parse.cc:1240
      #22 0x00000000005fd8b3 in do_command (thd=0x7f1af95780b0) at maria-5.5-galera/sql/sql_parse.cc:899
      #23 0x00000000006fd7df in do_handle_one_connection (thd_arg=0x7f1af95780b0) at maria-5.5-galera/sql/sql_connect.cc:1278
      #24 0x00000000006fd1c6 in handle_one_connection (arg=0x7f1af95780b0) at maria-5.5-galera/sql/sql_connect.cc:1186
      #25 0x00007f1b0b873efc in start_thread (arg=0x7f1b0c581700) at pthread_create.c:304
      #26 0x00007f1b0b02359d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

      *** glibc detected *** maria-5.5-galera/sql/mysqld: corrupted double-linked list: 0x00007fe9596156b0 ***

      sergii@pisem.net-20120908075116-u2v35aaixa3nnhrl
      date: 2012-09-08 09:51:16 +0200
      build-date: 2012-09-11 05:57:26 +0400
      revno: 3346

      All types of the failures are also reproducible on MariaDB-Galera-5.5.25 release binaries.

      Attachments

        1. crash_sigabrt_error_log
          63 kB
        2. crash_sigabrt_threads
          28 kB
        3. crash_sigsegv_error_log
          19 kB
        4. crash_sigsegv_threads
          28 kB
        5. deadlock_error_log
          45 kB
        6. deadlock_threads
          27 kB

        Issue Links

          Activity

            seppo Seppo Jaakola added a comment -

            This test shows a bug when DDL statement is part of a multi statement. Removing delimiters and --delimiter option in the test script makes this test pass.

            seppo Seppo Jaakola added a comment - This test shows a bug when DDL statement is part of a multi statement. Removing delimiters and --delimiter option in the test script makes this test pass.
            elenst Elena Stepanova added a comment - - edited

            This is true (at least for this test), but normally it should not matter whether DDL is provided as a number of single statements or as a multi-statement, right? There should be nothing specific about a multi-statement in the way server processes it? So, my hope was the difference was just caused by different timing for multi- and single statements in the test. If it's not so, and multi statements are essentially different, it's more disturbing and, in addition to fixing the crash, at the very least deserves mentioning somewhere in the documentation (maybe it's already there, but I couldn't see it so far).

            elenst Elena Stepanova added a comment - - edited This is true (at least for this test), but normally it should not matter whether DDL is provided as a number of single statements or as a multi-statement, right? There should be nothing specific about a multi-statement in the way server processes it? So, my hope was the difference was just caused by different timing for multi- and single statements in the test. If it's not so, and multi statements are essentially different, it's more disturbing and, in addition to fixing the crash, at the very least deserves mentioning somewhere in the documentation (maybe it's already there, but I couldn't see it so far).
            seppo Seppo Jaakola added a comment -

            When tested in upstream build, none of the bad effects (hanging, assert, segfault) happened. However, I'm not quite sure how the test should behave, DDL gets high priority in Galera replication and these client sessions will constantly conflict with each other.

            seppo Seppo Jaakola added a comment - When tested in upstream build, none of the bad effects (hanging, assert, segfault) happened. However, I'm not quite sure how the test should behave, DDL gets high priority in Galera replication and these client sessions will constantly conflict with each other.

            >> I'm not quite sure how the test should behave, DDL gets high priority in Galera replication and these client sessions will constantly conflict with each other.

            For this kind of tests, I would consider the behavior good enough if nodes don't crash, don't deadlock, replication doesn't break, data doesn't get corrupted and remains consistent over the cluster.
            The exact result of the flow (what is executed, what is rolled back, etc.) remains non-deterministic due to the nature of a concurrency test.
            I'll re-run the unabridged version to check the fix.

            elenst Elena Stepanova added a comment - >> I'm not quite sure how the test should behave, DDL gets high priority in Galera replication and these client sessions will constantly conflict with each other. For this kind of tests, I would consider the behavior good enough if nodes don't crash, don't deadlock, replication doesn't break, data doesn't get corrupted and remains consistent over the cluster. The exact result of the flow (what is executed, what is rolled back, etc.) remains non-deterministic due to the nature of a concurrency test. I'll re-run the unabridged version to check the fix.

            People

              seppo Seppo Jaakola
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.