Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21577

Galera crash after MDL BF-BF conflict

Details

    Description

      Server starts crashing after upgrade to Mariadb 10.4.11. We have 3 node galera cluster. Error is probably related to dropping and recreating table, which is used by other node.
      Logs from few last crashes:

      led 24 22:00:19 db2 mysqld[18625]: 2020-01-24 22:00:19 16 [Note] WSREP: MDL BF-BF conflict
      led 24 22:00:19 db2 mysqld[18625]: schema:  s_demo
      led 24 22:00:19 db2 mysqld[18625]: [157B blob data]
      led 24 22:00:19 db2 mysqld[18625]: granted: (11         seqno 804791504         wsrep (toi, exec, committed) cmd 0 1         CREATE TABLE IF NOT EXISTS `shopio_vouchers_products` (
      led 24 22:00:19 db2 mysqld[18625]:   `id` int(11) NOT NULL AUTO_INCREMENT,
      led 24 22:00:19 db2 mysqld[18625]:   `vouchers_id` int(11) NOT NULL,
      led 24 22:00:19 db2 mysqld[18625]:   `products_id` int(11) NOT NULL,
      led 24 22:00:19 db2 mysqld[18625]:   PRIMARY KEY (`id`),
      led 24 22:00:19 db2 mysqld[18625]:   UNIQUE KEY `unique_ids` (`vouchers_id`,`products_id`),
      led 24 22:00:19 db2 mysqld[18625]:   KEY `products_id` (`products_id`),
      led 24 22:00:19 db2 mysqld[18625]:   CONSTRAINT FOREIGN KEY (`products_id`) REFERENCES `shopio_products` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
      led 24 22:00:19 db2 mysqld[18625]:   CONSTRAINT FOREIGN KEY (`vouchers_id`) REFERENCES `shopio_vouchers` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
      led 24 22:00:19 db2 mysqld[18625]: ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4)
      led 24 22:00:19 db2 mysqld[18625]: 2020-01-24 22:00:19 16 [ERROR] Aborting
      led 24 22:08:32 db2 systemd[1]: Stopping MariaDB 10.4.11 database server...
      led 24 22:08:32 db2 mysqld[18625]: 2020-01-24 22:08:32 0 [Warning] WSREP: server: db2 unallowed state transition: disconnecting -> disconnecting
      led 24 22:08:32 db2 mysqld[18625]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.11/wsrep-lib/src/server_state.cpp:1340: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state): Assertion `0' failed.
      led 24 22:08:32 db2 mysqld[18625]: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%200124 22:08:32 [ERROR] mysqld got signal 6 ;
      led 24 22:08:32 db2 mysqld[18625]: This could be because you hit a bug. It is also possible that this binary
      led 24 22:08:32 db2 mysqld[18625]: or one of the libraries it was linked against is corrupt, improperly built,
      led 24 22:08:32 db2 mysqld[18625]: or misconfigured. This error can also be caused by malfunctioning hardware.
      led 24 22:08:32 db2 mysqld[18625]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
      led 24 22:08:32 db2 mysqld[18625]: We will try our best to scrape up some info that will hopefully help
      led 24 22:08:32 db2 mysqld[18625]: diagnose the problem, but since we have already crashed,
      led 24 22:08:32 db2 mysqld[18625]: something is definitely wrong and this may fail.
      led 24 22:08:32 db2 mysqld[18625]: Server version: 10.4.11-MariaDB-1:10.4.11+maria~stretch-log
      led 24 22:08:32 db2 mysqld[18625]: key_buffer_size=1048576
      led 24 22:08:32 db2 mysqld[18625]: read_buffer_size=262144
      led 24 22:08:32 db2 mysqld[18625]: max_used_connections=44
      led 24 22:08:32 db2 mysqld[18625]: max_threads=2327
      led 24 22:08:32 db2 mysqld[18625]: thread_count=21
      led 24 22:08:32 db2 mysqld[18625]: It is possible that mysqld could use up to
      led 24 22:08:32 db2 mysqld[18625]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5419764 K  bytes of memory
      led 24 22:08:32 db2 mysqld[18625]: Hope that's ok; if not, decrease some variables in the equation.
      led 24 22:08:32 db2 mysqld[18625]: Thread pointer: 0x0
      led 24 22:08:32 db2 mysqld[18625]: Attempting backtrace. You can use the following information to find out
      led 24 22:08:32 db2 mysqld[18625]: where mysqld died. If you see no messages after this, something went
      led 24 22:08:32 db2 mysqld[18625]: terribly wrong...
      led 24 22:08:32 db2 mysqld[18625]: stack_bottom = 0x0 thread_stack 0x31000
      led 24 22:08:32 db2 mysqld[18625]: *** buffer overflow detected ***: /usr/sbin/mysqld terminated
      led 24 22:08:32 db2 mysqld[18625]: ======= Backtrace: =========
      led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7f39ad77cbfb]
      led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f39ad805437]
      led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(+0xf7570)[0x7f39ad803570]
      led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(+0xf93aa)[0x7f39ad8053aa]
      led 24 22:08:32 db2 mysqld[18625]: /usr/sbin/mysqld(my_addr_resolve+0xe2)[0x559a16f0f492]
      

      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 805439671)
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [ERROR] Slave SQL: Error executing row event: 'Table 's_demo.shopio_vouchers_products' doesn't exist', Internal MariaDB error code: 1146
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Warning] WSREP: Event 23 Update_rows_v1 apply failed: 1146, seqno 805439671
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [ERROR] WSREP: Failed to apply write set: gtid: 54078096-9d91-11e9-8472-a7c937a2aaa8:805439671 server_id: 34aa6dfd-3d7d-11ea-9f02-1e9305220e61 client_id: 1
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: Closing send monitor...
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: Closed send monitor.
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: gcomm: terminating thread
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: gcomm: joining thread
      led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: gcomm: closing backend
      

      led 25 07:00:19 db2 mysqld[17264]: 2020-01-25  7:00:19 16 [Note] WSREP: MDL BF-BF conflict
      led 25 07:00:19 db2 mysqld[17264]: schema:  s_demo
      led 25 07:00:19 db2 mysqld[17264]: request: (16         seqno 806396466         wsrep (toi, exec, committed) cmd 0 3         /*!40000 ALTER TABLE `shopio_vouchers_products` DISABLE KEYS */)
      led 25 07:00:19 db2 mysqld[17264]: [157B blob data]
      led 25 07:00:19 db2 mysqld[17264]: 2020-01-25  7:00:19 16 [ERROR] Aborting
      led 25 07:10:24 db2 systemd[1]: Stopping MariaDB 10.4.11 database server...
      led 25 07:10:24 db2 mysqld[17264]: 2020-01-25  7:10:24 0 [Warning] WSREP: server: db2 unallowed state transition: disconnecting -> disconnecting
      led 25 07:10:24 db2 mysqld[17264]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.11/wsrep-lib/src/server_state.cpp:1340: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state
      led 25 07:10:24 db2 mysqld[17264]: 200125  7:10:24 [ERROR] mysqld got signal 6 ;
      led 25 07:10:24 db2 mysqld[17264]: This could be because you hit a bug. It is also possible that this binary
      led 25 07:10:24 db2 mysqld[17264]: or one of the libraries it was linked against is corrupt, improperly built,
      led 25 07:10:24 db2 mysqld[17264]: or misconfigured. This error can also be caused by malfunctioning hardware.
      led 25 07:10:24 db2 mysqld[17264]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
      led 25 07:10:24 db2 mysqld[17264]: We will try our best to scrape up some info that will hopefully help
      led 25 07:10:24 db2 mysqld[17264]: diagnose the problem, but since we have already crashed,
      led 25 07:10:24 db2 mysqld[17264]: something is definitely wrong and this may fail.
      led 25 07:10:24 db2 mysqld[17264]: Server version: 10.4.11-MariaDB-1:10.4.11+maria~stretch-log
      led 25 07:10:24 db2 mysqld[17264]: key_buffer_size=1048576
      led 25 07:10:24 db2 mysqld[17264]: read_buffer_size=262144
      led 25 07:10:24 db2 mysqld[17264]: max_used_connections=39
      led 25 07:10:24 db2 mysqld[17264]: max_threads=2327
      led 25 07:10:24 db2 mysqld[17264]: thread_count=26
      led 25 07:10:24 db2 mysqld[17264]: It is possible that mysqld could use up to
      led 25 07:10:24 db2 mysqld[17264]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5419764 K  bytes of memory
      

      Attachments

        1. node1.txt
          17 kB
        2. node2.txt
          0.9 kB

        Issue Links

          Activity

            wdoekes Walter Doekes added a comment - - edited

            The followup by Fišer accurately describes the hanging mariadbd 10.4.20 that I ran into just now:

            • mariadb (1:10.4.20+maria~bionic)
            • galera-4 (26.4.8-bionic)
            • statement: ALTER TABLE `stc_16_cblist` ADD `field_column` text
            • the Aborting node went non-primary and hung; the rest of the (3 node) cluster was unaffected (the other two considered it desynced)

            2021-11-17 16:10:06 22 [Note] WSREP: MDL BF-BF conflict
            schema:  redacted_2021
            request: (22 	seqno 31261949 	wsrep (toi, exec, committed) cmd 0 3 	ALTER TABLE `stc_16_cblist` ADD `field_column` text)
            granted: (17 	seqno 31261952 	wsrep (high priority, exec, committing) cmd 0 161 	(null))
            2021-11-17 16:10:06 22 [ERROR] Aborting
            

            The SHOW PROCESSLIST shows all threads in Daemon or Sleep state:

            Id User Host db Cmd Time State Info Progr.
            1 system user   NULL Sleep 490318 wsrep aborter idle NULL 0.000
            2 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
            4 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
            5 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
            6 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
            7 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
            8 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
            11 system user   NULL Sleep 1229 After apply log event NULL 0.000
            12 system user   NULL Sleep 1229 After apply log event NULL 0.000
            13 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            14 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            15 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
            16 system user   NULL Sleep 1229 committing NULL 0.000
            17 system user   NULL Sleep 1229 committing NULL 0.000
            18 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            19 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            20 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
            21 system user   NULL Sleep 1229 After apply log event NULL 0.000
            22 system user   xxxxxx_2021 Sleep 1229 After create ALTER TABLE `stc_16_cblist` ADD `field_column` text 0.000
            23 system user   NULL Sleep 1229 After apply log event NULL 0.000
            24 system user   NULL Sleep 1229 committing NULL 0.000
            25 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            26 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
            27 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            28 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            29 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
            30 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
            31 system user   NULL Sleep 1229 After apply log event NULL 0.000
            32 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            34 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
            1324367 root localhost NULL Query 0 Init show full processlist 0.000

            I salvaged a core dump. But this setup has a interesting peculiarity with these dumps, which I haven't been able to explain . There seem to be certain threads missing in the core dump, which results in this behaviour:

            # gdb `which mysqld` ./core -batch -ex 'info threads' | tail -n1
            29	../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
            warning: Couldn't find general-purpose registers in core file.
            PC register is not available
              61   Thread 0x7f3a8cd756c0 (LWP 1)
            

            That "PC register is not available" error fails all "thread apply all" lookups, but we can get at some/most of the threads using a workaround:

            for x in `seq 1 61`; do echo -ex; echo "thread apply $x bt full"; done |
              xargs -d\\n gdb `which mysqld` ./core -batch
            

            Threads are all waiting for something non-obvious:

            # tail -n61 info-threads.txt  | sed -e 's/^[*]/ /' | awk '{print $6}' | sort | uniq -c
                  1 
                 19 __GI___nanosleep
                  1 __GI___poll
                  1 __GI___sigtimedwait
                  8 futex_abstimed_wait_cancelable
                 31 futex_wait_cancelable
            

            The Aborting thread appears to be number 47:

            Thread 47 (LWP 1202077):
            #0  0x00007f3a8b485ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x559b8bcd38f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
            #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x559b8bcd38a0, cond=0x559b8bcd38c8) at pthread_cond_wait.c:502
            #2  __pthread_cond_wait (cond=cond@entry=0x559b8bcd38c8, mutex=0x559b8bcd38a0) at pthread_cond_wait.c:655
            #3  0x00007f3a5da09845 in gu::Lock::wait (this=0x7f38e01acb00, cond=...) at ./galerautils/src/gu_lock.hpp:46
            #4  galera::ReplicatorSMM::wait_for_CLOSED (this=this@entry=0x559b8bcd3760, lock=...) at ./galera/src/replicator_smm.cpp:273
            #5  0x00007f3a5da098c5 in galera::ReplicatorSMM::close (this=0x559b8bcd3760) at ./galera/src/replicator_smm.cpp:364
            #6  0x00007f3a5d9e681b in galera_disconnect (gh=<optimized out>) at ./galera/src/wsrep_provider.cpp:208
            #7  0x0000559b89905e51 in wsrep::wsrep_provider_v26::disconnect (this=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:759
            #8  0x0000559b898f329b in wsrep::server_state::disconnect (this=0x559b8bc9e790) at ./wsrep-lib/src/server_state.cpp:540
            #9  0x0000559b89011888 in unireg_abort (exit_code=exit_code@entry=1) at ./sql/mysqld.cc:1862
            #10 0x0000559b8924d4d9 in wsrep_handle_mdl_conflict (requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ticket=ticket@entry=0x7f3638020b80, key=key@entry=0x7f38f86299f8) at ./sql/wsrep_mysqld.cc:2324
            #11 0x0000559b891c71f6 in MDL_lock::can_grant_lock (this=this@entry=0x7f38f86299f8, type_arg=<optimized out>, requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ignore_lock_priority=ignore_lock_priority@entry=false) at ./sql/mdl.cc:1769
            #12 0x0000559b891c7915 in MDL_context::try_acquire_lock_impl (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, out_ticket=out_ticket@entry=0x7f38e01acf98) at ./sql/mdl.cc:2100
            #13 0x0000559b891c7a4e in MDL_context::acquire_lock (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, lock_wait_timeout=lock_wait_timeout@entry=86400) at ./sql/mdl.cc:2247
            #14 0x0000559b891c85c2 in MDL_context::upgrade_shared_lock (this=this@entry=0x7f361c000d28, mdl_ticket=0x7f361c07de10, new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400) at ./sql/mdl.cc:2523
            #15 0x0000559b8916c826 in mysql_inplace_alter_table (target_mdl_request=0x7f38e01ae300, alter_ctx=0x7f38e01af420, ha_alter_info=0x7f38e01ad3d0, altered_table=0x7f38e01ad470, table=0x7f362c0f19e8, table_list=0x7f361c00be40, thd=0x7f361c000c08) at ./sql/sql_table.cc:7730
            #16 mysql_alter_table (thd=thd@entry=0x7f361c000c08, new_db=new_db@entry=0x7f361c005230, new_name=new_name@entry=0x7f361c005658, create_info=create_info@entry=0x7f38e01b0010, table_list=<optimized out>, table_list@entry=0x7f361c00be40, alter_info=alter_info@entry=0x7f38e01aff50, order_num=0, order=0x0, ignore=false) at ./sql/sql_table.cc:10248
            #17 0x0000559b891c28ce in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x7f361c000c08) at ./sql/sql_alter.cc:520
            #18 0x0000559b890d58c7 in mysql_execute_command (thd=thd@entry=0x7f361c000c08) at ./sql/sql_parse.cc:6189
            #19 0x0000559b890db70a in mysql_parse (thd=0x7f361c000c08, rawbuf=<optimized out>, length=51, parser_state=parser_state@entry=0x7f38e01b2740, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7992
            #20 0x0000559b893e6f66 in Query_log_event::do_apply_event (this=0x7f361c0660d8, rgi=0x7f361c01c4b0, query_arg=0x7f361c0dfaca "ALTER TABLE `stc_16_cblist` ADD `field_column` text", q_len_arg=<optimized out>) at ./sql/log_event.cc:5734
            #21 0x0000559b8925e3b8 in Log_event::apply_event (rgi=0x7f361c01c4b0, this=0x7f361c0660d8) at ./sql/log_event.h:1492
            #22 wsrep_apply_events (thd=thd@entry=0x7f361c000c08, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at ./sql/wsrep_applier.cc:200
            #23 0x0000559b89244ab9 in Wsrep_high_priority_service::apply_toi (this=0x7f38e01b3cf0, ws_meta=..., data=...) at ./sql/wsrep_high_priority_service.cc:382
            #24 0x0000559b898f1eb3 in apply_toi (data=..., ws_meta=..., ws_handle=..., high_priority_service=..., provider=...) at ./wsrep-lib/src/server_state.cpp:461
            #25 wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1127
            #26 0x0000559b899066d4 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7f38e01b3cf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
            #27 (anonymous namespace)::apply_cb (ctx=0x7f38e01b3cf0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7f38e01b300f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504
            #28 0x00007f3a5d9fc265 in galera::TrxHandleSlave::apply (this=this@entry=0x7f36a0097820, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, apply_cb=0x559b899065a0 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7f38e01b300f: false) at ./galera/src/trx_handle.cpp:391
            #29 0x00007f3a5da0dc10 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x559b8bcd3760, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, ts=...) at ./galera/src/replicator_smm.cpp:504
            #30 0x00007f3a5da140f5 in galera::ReplicatorSMM::process_trx (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2127
            #31 0x00007f3a5da466b5 in galera::GcsActionSource::process_writeset (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:62
            #32 0x00007f3a5da472c2 in galera::GcsActionSource::dispatch (this=this@entry=0x559b8bce0c60, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:110
            #33 0x00007f3a5da47582 in galera::GcsActionSource::process (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:183
            #34 0x00007f3a5da14b10 in galera::ReplicatorSMM::async_recv (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0) at ./galera/src/replicator_smm.cpp:390
            #35 0x00007f3a5d9e69bb in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236
            #36 0x0000559b8990731e in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796
            #37 0x0000559b8926051f in wsrep_replication_process (thd=0x7f361c000c08, arg=<optimized out>) at ./sql/wsrep_thd.cc:57
            #38 0x0000559b8924f9d3 in start_wsrep_THD (arg=arg@entry=0x559b972686f0) at ./sql/wsrep_mysqld.cc:2892
            #39 0x0000559b89824e8a in pfs_spawn_thread (arg=0x559b97268718) at ./storage/perfschema/pfs.cc:1869
            #40 0x00007f3a8b47f6db in start_thread (arg=0x7f38e01b4700) at pthread_create.c:463
            #41 0x00007f3a8a87d71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            I'd love to add more useful information, but I don't know where to go from here. I'll keep the core dump around for a while, in case jplindst or seppo or anyone else wants to know the value of something.

            If someone can explain the core dump madness where I have fewer register states than apparent threads, I'd love to hear that too. These broken corefiles are really getting on my nerves:

            # objdump -x core |
                grep -E '^ *[0-9]+ [.](reg|reg2|reg-xstate|note.linuxcore.siginfo)/' |
                awk '{print $2}' | sed -e 's/\/[0-9].*//' | sort | uniq -c 
                 60 .note.linuxcore.siginfo
                 60 .reg
                 60 .reg-xstate
                 60 .reg2
            

            wdoekes Walter Doekes added a comment - - edited The followup by FiÅ¡er accurately describes the hanging mariadbd 10.4.20 that I ran into just now: mariadb (1:10.4.20+maria~bionic) galera-4 (26.4.8-bionic) statement: ALTER TABLE `stc_16_cblist` ADD `field_column` text the Aborting node went non-primary and hung; the rest of the (3 node) cluster was unaffected (the other two considered it desynced) 2021-11-17 16:10:06 22 [Note] WSREP: MDL BF-BF conflict schema: redacted_2021 request: (22 seqno 31261949 wsrep (toi, exec, committed) cmd 0 3 ALTER TABLE `stc_16_cblist` ADD `field_column` text) granted: (17 seqno 31261952 wsrep (high priority, exec, committing) cmd 0 161 (null)) 2021-11-17 16:10:06 22 [ERROR] Aborting The SHOW PROCESSLIST shows all threads in Daemon or Sleep state: Id User Host db Cmd Time State Info Progr. 1 system user   NULL Sleep 490318 wsrep aborter idle NULL 0.000 2 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000 4 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000 8 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000 11 system user   NULL Sleep 1229 After apply log event NULL 0.000 12 system user   NULL Sleep 1229 After apply log event NULL 0.000 13 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 14 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 15 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000 16 system user   NULL Sleep 1229 committing NULL 0.000 17 system user   NULL Sleep 1229 committing NULL 0.000 18 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 19 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 20 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000 21 system user   NULL Sleep 1229 After apply log event NULL 0.000 22 system user   xxxxxx_2021 Sleep 1229 After create ALTER TABLE `stc_16_cblist` ADD `field_column` text 0.000 23 system user   NULL Sleep 1229 After apply log event NULL 0.000 24 system user   NULL Sleep 1229 committing NULL 0.000 25 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 26 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000 27 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 28 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 29 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000 30 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000 31 system user   NULL Sleep 1229 After apply log event NULL 0.000 32 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 34 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000 1324367 root localhost NULL Query 0 Init show full processlist 0.000 I salvaged a core dump. But this setup has a interesting peculiarity with these dumps, which I haven't been able to explain . There seem to be certain threads missing in the core dump, which results in this behaviour: # gdb `which mysqld` ./core -batch -ex 'info threads' | tail -n1 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. warning: Couldn't find general-purpose registers in core file. PC register is not available 61 Thread 0x7f3a8cd756c0 (LWP 1) That "PC register is not available" error fails all "thread apply all" lookups, but we can get at some/most of the threads using a workaround: for x in `seq 1 61`; do echo -ex; echo "thread apply $x bt full"; done | xargs -d\\n gdb `which mysqld` ./core -batch Threads are all waiting for something non-obvious: # tail -n61 info-threads.txt | sed -e 's/^[*]/ /' | awk '{print $6}' | sort | uniq -c 1 19 __GI___nanosleep 1 __GI___poll 1 __GI___sigtimedwait 8 futex_abstimed_wait_cancelable 31 futex_wait_cancelable The Aborting thread appears to be number 47: Thread 47 (LWP 1202077): #0 0x00007f3a8b485ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x559b8bcd38f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x559b8bcd38a0, cond=0x559b8bcd38c8) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=cond@entry=0x559b8bcd38c8, mutex=0x559b8bcd38a0) at pthread_cond_wait.c:655 #3 0x00007f3a5da09845 in gu::Lock::wait (this=0x7f38e01acb00, cond=...) at ./galerautils/src/gu_lock.hpp:46 #4 galera::ReplicatorSMM::wait_for_CLOSED (this=this@entry=0x559b8bcd3760, lock=...) at ./galera/src/replicator_smm.cpp:273 #5 0x00007f3a5da098c5 in galera::ReplicatorSMM::close (this=0x559b8bcd3760) at ./galera/src/replicator_smm.cpp:364 #6 0x00007f3a5d9e681b in galera_disconnect (gh=<optimized out>) at ./galera/src/wsrep_provider.cpp:208 #7 0x0000559b89905e51 in wsrep::wsrep_provider_v26::disconnect (this=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:759 #8 0x0000559b898f329b in wsrep::server_state::disconnect (this=0x559b8bc9e790) at ./wsrep-lib/src/server_state.cpp:540 #9 0x0000559b89011888 in unireg_abort (exit_code=exit_code@entry=1) at ./sql/mysqld.cc:1862 #10 0x0000559b8924d4d9 in wsrep_handle_mdl_conflict (requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ticket=ticket@entry=0x7f3638020b80, key=key@entry=0x7f38f86299f8) at ./sql/wsrep_mysqld.cc:2324 #11 0x0000559b891c71f6 in MDL_lock::can_grant_lock (this=this@entry=0x7f38f86299f8, type_arg=<optimized out>, requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ignore_lock_priority=ignore_lock_priority@entry=false) at ./sql/mdl.cc:1769 #12 0x0000559b891c7915 in MDL_context::try_acquire_lock_impl (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, out_ticket=out_ticket@entry=0x7f38e01acf98) at ./sql/mdl.cc:2100 #13 0x0000559b891c7a4e in MDL_context::acquire_lock (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, lock_wait_timeout=lock_wait_timeout@entry=86400) at ./sql/mdl.cc:2247 #14 0x0000559b891c85c2 in MDL_context::upgrade_shared_lock (this=this@entry=0x7f361c000d28, mdl_ticket=0x7f361c07de10, new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400) at ./sql/mdl.cc:2523 #15 0x0000559b8916c826 in mysql_inplace_alter_table (target_mdl_request=0x7f38e01ae300, alter_ctx=0x7f38e01af420, ha_alter_info=0x7f38e01ad3d0, altered_table=0x7f38e01ad470, table=0x7f362c0f19e8, table_list=0x7f361c00be40, thd=0x7f361c000c08) at ./sql/sql_table.cc:7730 #16 mysql_alter_table (thd=thd@entry=0x7f361c000c08, new_db=new_db@entry=0x7f361c005230, new_name=new_name@entry=0x7f361c005658, create_info=create_info@entry=0x7f38e01b0010, table_list=<optimized out>, table_list@entry=0x7f361c00be40, alter_info=alter_info@entry=0x7f38e01aff50, order_num=0, order=0x0, ignore=false) at ./sql/sql_table.cc:10248 #17 0x0000559b891c28ce in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x7f361c000c08) at ./sql/sql_alter.cc:520 #18 0x0000559b890d58c7 in mysql_execute_command (thd=thd@entry=0x7f361c000c08) at ./sql/sql_parse.cc:6189 #19 0x0000559b890db70a in mysql_parse (thd=0x7f361c000c08, rawbuf=<optimized out>, length=51, parser_state=parser_state@entry=0x7f38e01b2740, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7992 #20 0x0000559b893e6f66 in Query_log_event::do_apply_event (this=0x7f361c0660d8, rgi=0x7f361c01c4b0, query_arg=0x7f361c0dfaca "ALTER TABLE `stc_16_cblist` ADD `field_column` text", q_len_arg=<optimized out>) at ./sql/log_event.cc:5734 #21 0x0000559b8925e3b8 in Log_event::apply_event (rgi=0x7f361c01c4b0, this=0x7f361c0660d8) at ./sql/log_event.h:1492 #22 wsrep_apply_events (thd=thd@entry=0x7f361c000c08, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at ./sql/wsrep_applier.cc:200 #23 0x0000559b89244ab9 in Wsrep_high_priority_service::apply_toi (this=0x7f38e01b3cf0, ws_meta=..., data=...) at ./sql/wsrep_high_priority_service.cc:382 #24 0x0000559b898f1eb3 in apply_toi (data=..., ws_meta=..., ws_handle=..., high_priority_service=..., provider=...) at ./wsrep-lib/src/server_state.cpp:461 #25 wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1127 #26 0x0000559b899066d4 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7f38e01b3cf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47 #27 (anonymous namespace)::apply_cb (ctx=0x7f38e01b3cf0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7f38e01b300f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504 #28 0x00007f3a5d9fc265 in galera::TrxHandleSlave::apply (this=this@entry=0x7f36a0097820, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, apply_cb=0x559b899065a0 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7f38e01b300f: false) at ./galera/src/trx_handle.cpp:391 #29 0x00007f3a5da0dc10 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x559b8bcd3760, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, ts=...) at ./galera/src/replicator_smm.cpp:504 #30 0x00007f3a5da140f5 in galera::ReplicatorSMM::process_trx (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2127 #31 0x00007f3a5da466b5 in galera::GcsActionSource::process_writeset (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:62 #32 0x00007f3a5da472c2 in galera::GcsActionSource::dispatch (this=this@entry=0x559b8bce0c60, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:110 #33 0x00007f3a5da47582 in galera::GcsActionSource::process (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:183 #34 0x00007f3a5da14b10 in galera::ReplicatorSMM::async_recv (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0) at ./galera/src/replicator_smm.cpp:390 #35 0x00007f3a5d9e69bb in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236 #36 0x0000559b8990731e in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796 #37 0x0000559b8926051f in wsrep_replication_process (thd=0x7f361c000c08, arg=<optimized out>) at ./sql/wsrep_thd.cc:57 #38 0x0000559b8924f9d3 in start_wsrep_THD (arg=arg@entry=0x559b972686f0) at ./sql/wsrep_mysqld.cc:2892 #39 0x0000559b89824e8a in pfs_spawn_thread (arg=0x559b97268718) at ./storage/perfschema/pfs.cc:1869 #40 0x00007f3a8b47f6db in start_thread (arg=0x7f38e01b4700) at pthread_create.c:463 #41 0x00007f3a8a87d71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 I'd love to add more useful information, but I don't know where to go from here. I'll keep the core dump around for a while, in case jplindst or seppo or anyone else wants to know the value of something. If someone can explain the core dump madness where I have fewer register states than apparent threads, I'd love to hear that too. These broken corefiles are really getting on my nerves: # objdump -x core | grep -E '^ *[0-9]+ [.](reg|reg2|reg-xstate|note.linuxcore.siginfo)/' | awk '{print $2}' | sed -e 's/\/[0-9].*//' | sort | uniq -c 60 .note.linuxcore.siginfo 60 .reg 60 .reg-xstate 60 .reg2

            wdoekes I strongly suggest to upgrade to MariaDB 10.4.22 and Galera 26.4.9, it has fixed important problem around MDL-locking (see MDEV-25114).

            jplindst Jan Lindström (Inactive) added a comment - wdoekes I strongly suggest to upgrade to MariaDB 10.4.22 and Galera 26.4.9, it has fixed important problem around MDL-locking (see MDEV-25114 ).
            wdoekes Walter Doekes added a comment -

            Heh. I had been waiting for that release (and their 10.3.x counterparts). I did not know it was actually out yet. Will schedule some upgrading here and there..

            wdoekes Walter Doekes added a comment - Heh. I had been waiting for that release (and their 10.3.x counterparts). I did not know it was actually out yet. Will schedule some upgrading here and there..
            wdoekes Walter Doekes added a comment -

            And here I was wondering why it takes so long for the Ubuntu repos to get updates...

            Turns out a mirror has stalled:

            # apt-cache policy mariadb-server-10.4
            mariadb-server-10.4:
              Installed: (none)
              Candidate: 1:10.4.20+maria~bionic
              Version table:
                 1:10.4.20+maria~bionic 500
                    500 http://mirror.i3d.net/pub/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
                 1:10.4.19+maria~bionic 500
                    500 http://mirror.i3d.net/pub/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
            ...
            

            Versus:

            # apt-cache policy mariadb-server-10.4
            mariadb-server-10.4:
              Installed: (none)
              Candidate: 1:10.4.22+maria~bionic
              Version table:
                 1:10.4.22+maria~bionic 500
                    500 http://ftp.nluug.nl/db/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
                 1:10.4.21+maria~bionic 500
                    500 http://ftp.nluug.nl/db/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
            ...
            

            wdoekes Walter Doekes added a comment - And here I was wondering why it takes so long for the Ubuntu repos to get updates... Turns out a mirror has stalled: # apt-cache policy mariadb-server-10.4 mariadb-server-10.4: Installed: (none) Candidate: 1:10.4.20+maria~bionic Version table: 1:10.4.20+maria~bionic 500 500 http://mirror.i3d.net/pub/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages 1:10.4.19+maria~bionic 500 500 http://mirror.i3d.net/pub/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages ... Versus: # apt-cache policy mariadb-server-10.4 mariadb-server-10.4: Installed: (none) Candidate: 1:10.4.22+maria~bionic Version table: 1:10.4.22+maria~bionic 500 500 http://ftp.nluug.nl/db/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages 1:10.4.21+maria~bionic 500 500 http://ftp.nluug.nl/db/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages ...
            wdoekes Walter Doekes added a comment -

            And JIRA doesn't accept smiley faces. I had included the sweat_smile emoji ( https://emojipedia.org/grinning-face-with-sweat/ ) in the message, but got an error that the JIRA server could not be contacted during submit. Some database/charset/collation on the backend, perhaps?

            wdoekes Walter Doekes added a comment - And JIRA doesn't accept smiley faces. I had included the sweat_smile emoji ( https://emojipedia.org/grinning-face-with-sweat/ ) in the message, but got an error that the JIRA server could not be contacted during submit. Some database/charset/collation on the backend, perhaps?

            People

              jplindst Jan Lindström (Inactive)
              Fišer Dominik Fišer
              Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.