[MDEV-21577] Galera crash after MDL BF-BF conflict Created: 2020-01-27  Updated: 2021-11-18  Resolved: 2020-11-03

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.11
Fix Version/s: 10.4.16, 10.5.7, 10.6.0

Type: Bug Priority: Critical
Reporter: Dominik Fišer Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 3
Labels: crash
Environment:

Debian 9.11


Attachments: Text File node1.txt     Text File node2.txt    
Issue Links:
Relates
relates to MDEV-24119 MDL BF-BF Conflict caused by TRUNCATE... Closed
relates to MDEV-23997 truncate is crashing mariadb galera c... Closed

 Description   

Server starts crashing after upgrade to Mariadb 10.4.11. We have 3 node galera cluster. Error is probably related to dropping and recreating table, which is used by other node.
Logs from few last crashes:

led 24 22:00:19 db2 mysqld[18625]: 2020-01-24 22:00:19 16 [Note] WSREP: MDL BF-BF conflict
led 24 22:00:19 db2 mysqld[18625]: schema:  s_demo
led 24 22:00:19 db2 mysqld[18625]: [157B blob data]
led 24 22:00:19 db2 mysqld[18625]: granted: (11         seqno 804791504         wsrep (toi, exec, committed) cmd 0 1         CREATE TABLE IF NOT EXISTS `shopio_vouchers_products` (
led 24 22:00:19 db2 mysqld[18625]:   `id` int(11) NOT NULL AUTO_INCREMENT,
led 24 22:00:19 db2 mysqld[18625]:   `vouchers_id` int(11) NOT NULL,
led 24 22:00:19 db2 mysqld[18625]:   `products_id` int(11) NOT NULL,
led 24 22:00:19 db2 mysqld[18625]:   PRIMARY KEY (`id`),
led 24 22:00:19 db2 mysqld[18625]:   UNIQUE KEY `unique_ids` (`vouchers_id`,`products_id`),
led 24 22:00:19 db2 mysqld[18625]:   KEY `products_id` (`products_id`),
led 24 22:00:19 db2 mysqld[18625]:   CONSTRAINT FOREIGN KEY (`products_id`) REFERENCES `shopio_products` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
led 24 22:00:19 db2 mysqld[18625]:   CONSTRAINT FOREIGN KEY (`vouchers_id`) REFERENCES `shopio_vouchers` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
led 24 22:00:19 db2 mysqld[18625]: ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4)
led 24 22:00:19 db2 mysqld[18625]: 2020-01-24 22:00:19 16 [ERROR] Aborting
led 24 22:08:32 db2 systemd[1]: Stopping MariaDB 10.4.11 database server...
led 24 22:08:32 db2 mysqld[18625]: 2020-01-24 22:08:32 0 [Warning] WSREP: server: db2 unallowed state transition: disconnecting -> disconnecting
led 24 22:08:32 db2 mysqld[18625]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.11/wsrep-lib/src/server_state.cpp:1340: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state): Assertion `0' failed.
led 24 22:08:32 db2 mysqld[18625]: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%200124 22:08:32 [ERROR] mysqld got signal 6 ;
led 24 22:08:32 db2 mysqld[18625]: This could be because you hit a bug. It is also possible that this binary
led 24 22:08:32 db2 mysqld[18625]: or one of the libraries it was linked against is corrupt, improperly built,
led 24 22:08:32 db2 mysqld[18625]: or misconfigured. This error can also be caused by malfunctioning hardware.
led 24 22:08:32 db2 mysqld[18625]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
led 24 22:08:32 db2 mysqld[18625]: We will try our best to scrape up some info that will hopefully help
led 24 22:08:32 db2 mysqld[18625]: diagnose the problem, but since we have already crashed,
led 24 22:08:32 db2 mysqld[18625]: something is definitely wrong and this may fail.
led 24 22:08:32 db2 mysqld[18625]: Server version: 10.4.11-MariaDB-1:10.4.11+maria~stretch-log
led 24 22:08:32 db2 mysqld[18625]: key_buffer_size=1048576
led 24 22:08:32 db2 mysqld[18625]: read_buffer_size=262144
led 24 22:08:32 db2 mysqld[18625]: max_used_connections=44
led 24 22:08:32 db2 mysqld[18625]: max_threads=2327
led 24 22:08:32 db2 mysqld[18625]: thread_count=21
led 24 22:08:32 db2 mysqld[18625]: It is possible that mysqld could use up to
led 24 22:08:32 db2 mysqld[18625]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5419764 K  bytes of memory
led 24 22:08:32 db2 mysqld[18625]: Hope that's ok; if not, decrease some variables in the equation.
led 24 22:08:32 db2 mysqld[18625]: Thread pointer: 0x0
led 24 22:08:32 db2 mysqld[18625]: Attempting backtrace. You can use the following information to find out
led 24 22:08:32 db2 mysqld[18625]: where mysqld died. If you see no messages after this, something went
led 24 22:08:32 db2 mysqld[18625]: terribly wrong...
led 24 22:08:32 db2 mysqld[18625]: stack_bottom = 0x0 thread_stack 0x31000
led 24 22:08:32 db2 mysqld[18625]: *** buffer overflow detected ***: /usr/sbin/mysqld terminated
led 24 22:08:32 db2 mysqld[18625]: ======= Backtrace: =========
led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7f39ad77cbfb]
led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f39ad805437]
led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(+0xf7570)[0x7f39ad803570]
led 24 22:08:32 db2 mysqld[18625]: /lib/x86_64-linux-gnu/libc.so.6(+0xf93aa)[0x7f39ad8053aa]
led 24 22:08:32 db2 mysqld[18625]: /usr/sbin/mysqld(my_addr_resolve+0xe2)[0x559a16f0f492]

led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 805439671)
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [ERROR] Slave SQL: Error executing row event: 'Table 's_demo.shopio_vouchers_products' doesn't exist', Internal MariaDB error code: 1146
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Warning] WSREP: Event 23 Update_rows_v1 apply failed: 1146, seqno 805439671
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [ERROR] WSREP: Failed to apply write set: gtid: 54078096-9d91-11e9-8472-a7c937a2aaa8:805439671 server_id: 34aa6dfd-3d7d-11ea-9f02-1e9305220e61 client_id: 1
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: Closing send monitor...
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: Closed send monitor.
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: gcomm: terminating thread
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: gcomm: joining thread
led 25 02:00:21 db2 mysqld[11890]: 2020-01-25  2:00:21 1 [Note] WSREP: gcomm: closing backend

led 25 07:00:19 db2 mysqld[17264]: 2020-01-25  7:00:19 16 [Note] WSREP: MDL BF-BF conflict
led 25 07:00:19 db2 mysqld[17264]: schema:  s_demo
led 25 07:00:19 db2 mysqld[17264]: request: (16         seqno 806396466         wsrep (toi, exec, committed) cmd 0 3         /*!40000 ALTER TABLE `shopio_vouchers_products` DISABLE KEYS */)
led 25 07:00:19 db2 mysqld[17264]: [157B blob data]
led 25 07:00:19 db2 mysqld[17264]: 2020-01-25  7:00:19 16 [ERROR] Aborting
led 25 07:10:24 db2 systemd[1]: Stopping MariaDB 10.4.11 database server...
led 25 07:10:24 db2 mysqld[17264]: 2020-01-25  7:10:24 0 [Warning] WSREP: server: db2 unallowed state transition: disconnecting -> disconnecting
led 25 07:10:24 db2 mysqld[17264]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.11/wsrep-lib/src/server_state.cpp:1340: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state
led 25 07:10:24 db2 mysqld[17264]: 200125  7:10:24 [ERROR] mysqld got signal 6 ;
led 25 07:10:24 db2 mysqld[17264]: This could be because you hit a bug. It is also possible that this binary
led 25 07:10:24 db2 mysqld[17264]: or one of the libraries it was linked against is corrupt, improperly built,
led 25 07:10:24 db2 mysqld[17264]: or misconfigured. This error can also be caused by malfunctioning hardware.
led 25 07:10:24 db2 mysqld[17264]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
led 25 07:10:24 db2 mysqld[17264]: We will try our best to scrape up some info that will hopefully help
led 25 07:10:24 db2 mysqld[17264]: diagnose the problem, but since we have already crashed,
led 25 07:10:24 db2 mysqld[17264]: something is definitely wrong and this may fail.
led 25 07:10:24 db2 mysqld[17264]: Server version: 10.4.11-MariaDB-1:10.4.11+maria~stretch-log
led 25 07:10:24 db2 mysqld[17264]: key_buffer_size=1048576
led 25 07:10:24 db2 mysqld[17264]: read_buffer_size=262144
led 25 07:10:24 db2 mysqld[17264]: max_used_connections=39
led 25 07:10:24 db2 mysqld[17264]: max_threads=2327
led 25 07:10:24 db2 mysqld[17264]: thread_count=26
led 25 07:10:24 db2 mysqld[17264]: It is possible that mysqld could use up to
led 25 07:10:24 db2 mysqld[17264]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5419764 K  bytes of memory



 Comments   
Comment by Dominik Fišer [ 2020-01-28 ]

Similar problem again, now mysql crash while stoping after MDL BF-BF conflict, but galera cluster was broken after error:

led 28 09:00:17 db1 mysqld[2780]: 2020-01-28  9:00:17 13 [Note] WSREP: MDL BF-BF conflict
led 28 09:00:17 db1 mysqld[2780]: schema:  s_demo
led 28 09:00:17 db1 mysqld[2780]: [157B blob data]
led 28 09:00:17 db1 mysqld[2780]: granted: (10         seqno 819779075         wsrep (toi, exec, committed) cmd 0 8         --
led 28 09:00:17 db1 mysqld[2780]: -- Dumping data for table `shopio_vouchers_products`
led 28 09:00:17 db1 mysqld[2780]: --
led 28 09:00:17 db1 mysqld[2780]: TRUNCATE TABLE `shopio_vouchers_products`)
led 28 09:00:17 db1 mysqld[2780]: 2020-01-28  9:00:17 13 [ERROR] Aborting
led 28 09:23:24 db1 systemd[1]: Stopping MariaDB 10.4.11 database server...
led 28 09:23:24 db1 mysqld[2780]: 2020-01-28  9:23:24 0 [Warning] WSREP: server: db1 unallowed state transition: disconnecting -> disconnecting
led 28 09:23:24 db1 mysqld[2780]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.11/wsrep-lib/src/server_state.cpp:1340: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state:
led 28 09:23:24 db1 mysqld[2780]: 200128  9:23:24 [ERROR] mysqld got signal 6 ;
led 28 09:23:24 db1 mysqld[2780]: This could be because you hit a bug. It is also possible that this binary
led 28 09:23:24 db1 mysqld[2780]: or one of the libraries it was linked against is corrupt, improperly built,
led 28 09:23:24 db1 mysqld[2780]: or misconfigured. This error can also be caused by malfunctioning hardware.
led 28 09:23:24 db1 mysqld[2780]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
led 28 09:23:24 db1 mysqld[2780]: We will try our best to scrape up some info that will hopefully help
led 28 09:23:24 db1 mysqld[2780]: diagnose the problem, but since we have already crashed,
led 28 09:23:24 db1 mysqld[2780]: something is definitely wrong and this may fail.
led 28 09:23:24 db1 mysqld[2780]: Server version: 10.4.11-MariaDB-1:10.4.11+maria~stretch-log
led 28 09:23:24 db1 mysqld[2780]: key_buffer_size=1048576
led 28 09:23:24 db1 mysqld[2780]: read_buffer_size=262144
led 28 09:23:24 db1 mysqld[2780]: max_used_connections=36
led 28 09:23:24 db1 mysqld[2780]: max_threads=1816
led 28 09:23:24 db1 mysqld[2780]: thread_count=19
led 28 09:23:24 db1 mysqld[2780]: It is possible that mysqld could use up to
led 28 09:23:24 db1 mysqld[2780]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4229824 K  bytes of memory
led 28 09:23:24 db1 mysqld[2780]: Hope that's ok; if not, decrease some variables in the equation.
led 28 09:23:24 db1 mysqld[2780]: Thread pointer: 0x0
led 28 09:23:24 db1 mysqld[2780]: Attempting backtrace. You can use the following information to find out
led 28 09:23:24 db1 mysqld[2780]: where mysqld died. If you see no messages after this, something went
led 28 09:23:24 db1 mysqld[2780]: terribly wrong...
led 28 09:23:24 db1 mysqld[2780]: stack_bottom = 0x0 thread_stack 0x31000
led 28 09:23:24 db1 mysqld[2780]: /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5585c0859a6e]
led 28 09:23:24 db1 mysqld[2780]: /usr/sbin/mysqld(handle_fatal_signal+0x3af)[0x5585c02e66bf]
led 28 09:23:24 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7f7c4ab790e0]
led 28 09:23:24 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7f7c49217fff]
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f7c4921942a]
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(+0x2be67)[0x7f7c49210e67]
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(+0x2bf12)[0x7f7c49210f12]
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_ZN5wsrep12server_state5stateERNS_11unique_lockINS_5mutexEEENS0_5stateE+0x755)[0x5585c08f0ad5]
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_ZN5wsrep12server_state10disconnectEv+0x57)[0x5585c08f0bc7]
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_Z26wsrep_shutdown_replicationv+0xba)[0x5585c024a22a]
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x1c0d)[0x5585c001306d]
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f7c492052e1]
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_start+0x2a)[0x5585c000539a]
led 28 09:23:26 db1 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT

MariaDB 10.4.11 with galera cluster looks very unstable so could not be used for production use. IST replication (mariabackup) doesn't work well too, it fails after mysql restart and full transfer is required. This works well with previous 10.3

Comment by Dominik Fišer [ 2020-02-21 ]

I have upgraded cluster to version 10.4.12 and the problem is still there. Only difference is that after MDL BF-BF conflict is broken only one node not whole cluster. On broken node - mysql is running, but cluster status is Non-primary there and process has to killed.

Comment by Seppo Jaakola [ 2020-10-30 ]

The detail in this issue description is not much for troubleshooting. But, there is earlier work for a related issue, where OPTIMIZE TABLE was executed for a table, which has a foreign key (FK) constraint, and other replication applier thread was applying a transaction on the FK parent table. These two executions (OPTIMIZE and applying on the parent table), ended up in similar MDL BF-BF conflict error.

Later analysis of this OPTIMIZE issue revealed that OPTIMIZE, REPAIR and ALTER TABLE can cause "excessive MDL locking" of FK parent table, and this parent table key was not pushed in the replication write set for certification. A pull request to fix this issue has recently been submitted (https://github.com/MariaDB/server/pull/1693)

If the crashes listed in this jira tracker all happen on tables having foreign key constraint relation, it is possible that the PR-1693 helps with this case as well.

Comment by Yanjun Wu [ 2021-02-26 ]

I still have this problem in 10.4.18

2021-02-26 13:46:46 11 [Note] WSREP: MDL BF-BF conflict
schema: keystone
request: (11 seqno 5284036 wsrep (toi, exec, committed) cmd 0 9 DROP TABLE IF EXISTS `password`)
granted: (16 seqno 5284037 wsrep (high priority, exec, committing) cmd 0 161 (null))
2021-02-26 13:46:46 11 [ERROR] Aborting

Comment by Jan Lindström (Inactive) [ 2021-02-26 ]

Can you run with --wsrep-debug=1 and provide error log from all nodes?

Comment by Yanjun Wu [ 2021-02-27 ]

node1.txt node2.txt
I source sql file in node1, it may cause node2 crash.I try few times and it may occur once.

Comment by Walter Doekes [ 2021-11-18 ]

The followup by Fišer accurately describes the hanging mariadbd 10.4.20 that I ran into just now:

  • mariadb (1:10.4.20+maria~bionic)
  • galera-4 (26.4.8-bionic)
  • statement: ALTER TABLE `stc_16_cblist` ADD `field_column` text
  • the Aborting node went non-primary and hung; the rest of the (3 node) cluster was unaffected (the other two considered it desynced)

2021-11-17 16:10:06 22 [Note] WSREP: MDL BF-BF conflict
schema:  redacted_2021
request: (22 	seqno 31261949 	wsrep (toi, exec, committed) cmd 0 3 	ALTER TABLE `stc_16_cblist` ADD `field_column` text)
granted: (17 	seqno 31261952 	wsrep (high priority, exec, committing) cmd 0 161 	(null))
2021-11-17 16:10:06 22 [ERROR] Aborting

The SHOW PROCESSLIST shows all threads in Daemon or Sleep state:

Id User Host db Cmd Time State Info Progr.
1 system user   NULL Sleep 490318 wsrep aborter idle NULL 0.000
2 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
4 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
5 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
6 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
7 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
8 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
11 system user   NULL Sleep 1229 After apply log event NULL 0.000
12 system user   NULL Sleep 1229 After apply log event NULL 0.000
13 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
14 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
15 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
16 system user   NULL Sleep 1229 committing NULL 0.000
17 system user   NULL Sleep 1229 committing NULL 0.000
18 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
19 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
20 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
21 system user   NULL Sleep 1229 After apply log event NULL 0.000
22 system user   xxxxxx_2021 Sleep 1229 After create ALTER TABLE `stc_16_cblist` ADD `field_column` text 0.000
23 system user   NULL Sleep 1229 After apply log event NULL 0.000
24 system user   NULL Sleep 1229 committing NULL 0.000
25 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
26 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
27 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
28 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
29 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
30 system user   NULL Sleep 1230 wsrep applier committed NULL 0.000
31 system user   NULL Sleep 1229 After apply log event NULL 0.000
32 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
34 system user   NULL Sleep 1229 wsrep applier committed NULL 0.000
1324367 root localhost NULL Query 0 Init show full processlist 0.000

I salvaged a core dump. But this setup has a interesting peculiarity with these dumps, which I haven't been able to explain . There seem to be certain threads missing in the core dump, which results in this behaviour:

# gdb `which mysqld` ./core -batch -ex 'info threads' | tail -n1
29	../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
warning: Couldn't find general-purpose registers in core file.
PC register is not available
  61   Thread 0x7f3a8cd756c0 (LWP 1)

That "PC register is not available" error fails all "thread apply all" lookups, but we can get at some/most of the threads using a workaround:

for x in `seq 1 61`; do echo -ex; echo "thread apply $x bt full"; done |
  xargs -d\\n gdb `which mysqld` ./core -batch

Threads are all waiting for something non-obvious:

# tail -n61 info-threads.txt  | sed -e 's/^[*]/ /' | awk '{print $6}' | sort | uniq -c
      1 
     19 __GI___nanosleep
      1 __GI___poll
      1 __GI___sigtimedwait
      8 futex_abstimed_wait_cancelable
     31 futex_wait_cancelable

The Aborting thread appears to be number 47:

Thread 47 (LWP 1202077):
#0  0x00007f3a8b485ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x559b8bcd38f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x559b8bcd38a0, cond=0x559b8bcd38c8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x559b8bcd38c8, mutex=0x559b8bcd38a0) at pthread_cond_wait.c:655
#3  0x00007f3a5da09845 in gu::Lock::wait (this=0x7f38e01acb00, cond=...) at ./galerautils/src/gu_lock.hpp:46
#4  galera::ReplicatorSMM::wait_for_CLOSED (this=this@entry=0x559b8bcd3760, lock=...) at ./galera/src/replicator_smm.cpp:273
#5  0x00007f3a5da098c5 in galera::ReplicatorSMM::close (this=0x559b8bcd3760) at ./galera/src/replicator_smm.cpp:364
#6  0x00007f3a5d9e681b in galera_disconnect (gh=<optimized out>) at ./galera/src/wsrep_provider.cpp:208
#7  0x0000559b89905e51 in wsrep::wsrep_provider_v26::disconnect (this=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:759
#8  0x0000559b898f329b in wsrep::server_state::disconnect (this=0x559b8bc9e790) at ./wsrep-lib/src/server_state.cpp:540
#9  0x0000559b89011888 in unireg_abort (exit_code=exit_code@entry=1) at ./sql/mysqld.cc:1862
#10 0x0000559b8924d4d9 in wsrep_handle_mdl_conflict (requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ticket=ticket@entry=0x7f3638020b80, key=key@entry=0x7f38f86299f8) at ./sql/wsrep_mysqld.cc:2324
#11 0x0000559b891c71f6 in MDL_lock::can_grant_lock (this=this@entry=0x7f38f86299f8, type_arg=<optimized out>, requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ignore_lock_priority=ignore_lock_priority@entry=false) at ./sql/mdl.cc:1769
#12 0x0000559b891c7915 in MDL_context::try_acquire_lock_impl (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, out_ticket=out_ticket@entry=0x7f38e01acf98) at ./sql/mdl.cc:2100
#13 0x0000559b891c7a4e in MDL_context::acquire_lock (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, lock_wait_timeout=lock_wait_timeout@entry=86400) at ./sql/mdl.cc:2247
#14 0x0000559b891c85c2 in MDL_context::upgrade_shared_lock (this=this@entry=0x7f361c000d28, mdl_ticket=0x7f361c07de10, new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400) at ./sql/mdl.cc:2523
#15 0x0000559b8916c826 in mysql_inplace_alter_table (target_mdl_request=0x7f38e01ae300, alter_ctx=0x7f38e01af420, ha_alter_info=0x7f38e01ad3d0, altered_table=0x7f38e01ad470, table=0x7f362c0f19e8, table_list=0x7f361c00be40, thd=0x7f361c000c08) at ./sql/sql_table.cc:7730
#16 mysql_alter_table (thd=thd@entry=0x7f361c000c08, new_db=new_db@entry=0x7f361c005230, new_name=new_name@entry=0x7f361c005658, create_info=create_info@entry=0x7f38e01b0010, table_list=<optimized out>, table_list@entry=0x7f361c00be40, alter_info=alter_info@entry=0x7f38e01aff50, order_num=0, order=0x0, ignore=false) at ./sql/sql_table.cc:10248
#17 0x0000559b891c28ce in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x7f361c000c08) at ./sql/sql_alter.cc:520
#18 0x0000559b890d58c7 in mysql_execute_command (thd=thd@entry=0x7f361c000c08) at ./sql/sql_parse.cc:6189
#19 0x0000559b890db70a in mysql_parse (thd=0x7f361c000c08, rawbuf=<optimized out>, length=51, parser_state=parser_state@entry=0x7f38e01b2740, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7992
#20 0x0000559b893e6f66 in Query_log_event::do_apply_event (this=0x7f361c0660d8, rgi=0x7f361c01c4b0, query_arg=0x7f361c0dfaca "ALTER TABLE `stc_16_cblist` ADD `field_column` text", q_len_arg=<optimized out>) at ./sql/log_event.cc:5734
#21 0x0000559b8925e3b8 in Log_event::apply_event (rgi=0x7f361c01c4b0, this=0x7f361c0660d8) at ./sql/log_event.h:1492
#22 wsrep_apply_events (thd=thd@entry=0x7f361c000c08, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at ./sql/wsrep_applier.cc:200
#23 0x0000559b89244ab9 in Wsrep_high_priority_service::apply_toi (this=0x7f38e01b3cf0, ws_meta=..., data=...) at ./sql/wsrep_high_priority_service.cc:382
#24 0x0000559b898f1eb3 in apply_toi (data=..., ws_meta=..., ws_handle=..., high_priority_service=..., provider=...) at ./wsrep-lib/src/server_state.cpp:461
#25 wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1127
#26 0x0000559b899066d4 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7f38e01b3cf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
#27 (anonymous namespace)::apply_cb (ctx=0x7f38e01b3cf0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7f38e01b300f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504
#28 0x00007f3a5d9fc265 in galera::TrxHandleSlave::apply (this=this@entry=0x7f36a0097820, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, apply_cb=0x559b899065a0 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7f38e01b300f: false) at ./galera/src/trx_handle.cpp:391
#29 0x00007f3a5da0dc10 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x559b8bcd3760, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, ts=...) at ./galera/src/replicator_smm.cpp:504
#30 0x00007f3a5da140f5 in galera::ReplicatorSMM::process_trx (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2127
#31 0x00007f3a5da466b5 in galera::GcsActionSource::process_writeset (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:62
#32 0x00007f3a5da472c2 in galera::GcsActionSource::dispatch (this=this@entry=0x559b8bce0c60, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:110
#33 0x00007f3a5da47582 in galera::GcsActionSource::process (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:183
#34 0x00007f3a5da14b10 in galera::ReplicatorSMM::async_recv (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0) at ./galera/src/replicator_smm.cpp:390
#35 0x00007f3a5d9e69bb in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236
#36 0x0000559b8990731e in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796
#37 0x0000559b8926051f in wsrep_replication_process (thd=0x7f361c000c08, arg=<optimized out>) at ./sql/wsrep_thd.cc:57
#38 0x0000559b8924f9d3 in start_wsrep_THD (arg=arg@entry=0x559b972686f0) at ./sql/wsrep_mysqld.cc:2892
#39 0x0000559b89824e8a in pfs_spawn_thread (arg=0x559b97268718) at ./storage/perfschema/pfs.cc:1869
#40 0x00007f3a8b47f6db in start_thread (arg=0x7f38e01b4700) at pthread_create.c:463
#41 0x00007f3a8a87d71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I'd love to add more useful information, but I don't know where to go from here. I'll keep the core dump around for a while, in case jplindst or seppo or anyone else wants to know the value of something.

If someone can explain the core dump madness where I have fewer register states than apparent threads, I'd love to hear that too. These broken corefiles are really getting on my nerves:

# objdump -x core |
    grep -E '^ *[0-9]+ [.](reg|reg2|reg-xstate|note.linuxcore.siginfo)/' |
    awk '{print $2}' | sed -e 's/\/[0-9].*//' | sort | uniq -c 
     60 .note.linuxcore.siginfo
     60 .reg
     60 .reg-xstate
     60 .reg2

Comment by Jan Lindström (Inactive) [ 2021-11-18 ]

wdoekes I strongly suggest to upgrade to MariaDB 10.4.22 and Galera 26.4.9, it has fixed important problem around MDL-locking (see MDEV-25114).

Comment by Walter Doekes [ 2021-11-18 ]

Heh. I had been waiting for that release (and their 10.3.x counterparts). I did not know it was actually out yet. Will schedule some upgrading here and there..

Comment by Walter Doekes [ 2021-11-18 ]

And here I was wondering why it takes so long for the Ubuntu repos to get updates...

Turns out a mirror has stalled:

# apt-cache policy mariadb-server-10.4
mariadb-server-10.4:
  Installed: (none)
  Candidate: 1:10.4.20+maria~bionic
  Version table:
     1:10.4.20+maria~bionic 500
        500 http://mirror.i3d.net/pub/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
     1:10.4.19+maria~bionic 500
        500 http://mirror.i3d.net/pub/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
...

Versus:

# apt-cache policy mariadb-server-10.4
mariadb-server-10.4:
  Installed: (none)
  Candidate: 1:10.4.22+maria~bionic
  Version table:
     1:10.4.22+maria~bionic 500
        500 http://ftp.nluug.nl/db/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
     1:10.4.21+maria~bionic 500
        500 http://ftp.nluug.nl/db/mariadb/repo/10.4/ubuntu bionic/main amd64 Packages
...

Comment by Walter Doekes [ 2021-11-18 ]

And JIRA doesn't accept smiley faces. I had included the sweat_smile emoji ( https://emojipedia.org/grinning-face-with-sweat/ ) in the message, but got an error that the JIRA server could not be contacted during submit. Some database/charset/collation on the backend, perhaps?

Generated at Thu Feb 08 09:08:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.