|
Similar problem again, now mysql crash while stoping after MDL BF-BF conflict, but galera cluster was broken after error:
led 28 09:00:17 db1 mysqld[2780]: 2020-01-28 9:00:17 13 [Note] WSREP: MDL BF-BF conflict
|
led 28 09:00:17 db1 mysqld[2780]: schema: s_demo
|
led 28 09:00:17 db1 mysqld[2780]: [157B blob data]
|
led 28 09:00:17 db1 mysqld[2780]: granted: (10 seqno 819779075 wsrep (toi, exec, committed) cmd 0 8 --
|
led 28 09:00:17 db1 mysqld[2780]: -- Dumping data for table `shopio_vouchers_products`
|
led 28 09:00:17 db1 mysqld[2780]: --
|
led 28 09:00:17 db1 mysqld[2780]: TRUNCATE TABLE `shopio_vouchers_products`)
|
led 28 09:00:17 db1 mysqld[2780]: 2020-01-28 9:00:17 13 [ERROR] Aborting
|
led 28 09:23:24 db1 systemd[1]: Stopping MariaDB 10.4.11 database server...
|
led 28 09:23:24 db1 mysqld[2780]: 2020-01-28 9:23:24 0 [Warning] WSREP: server: db1 unallowed state transition: disconnecting -> disconnecting
|
led 28 09:23:24 db1 mysqld[2780]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.11/wsrep-lib/src/server_state.cpp:1340: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state:
|
led 28 09:23:24 db1 mysqld[2780]: 200128 9:23:24 [ERROR] mysqld got signal 6 ;
|
led 28 09:23:24 db1 mysqld[2780]: This could be because you hit a bug. It is also possible that this binary
|
led 28 09:23:24 db1 mysqld[2780]: or one of the libraries it was linked against is corrupt, improperly built,
|
led 28 09:23:24 db1 mysqld[2780]: or misconfigured. This error can also be caused by malfunctioning hardware.
|
led 28 09:23:24 db1 mysqld[2780]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
led 28 09:23:24 db1 mysqld[2780]: We will try our best to scrape up some info that will hopefully help
|
led 28 09:23:24 db1 mysqld[2780]: diagnose the problem, but since we have already crashed,
|
led 28 09:23:24 db1 mysqld[2780]: something is definitely wrong and this may fail.
|
led 28 09:23:24 db1 mysqld[2780]: Server version: 10.4.11-MariaDB-1:10.4.11+maria~stretch-log
|
led 28 09:23:24 db1 mysqld[2780]: key_buffer_size=1048576
|
led 28 09:23:24 db1 mysqld[2780]: read_buffer_size=262144
|
led 28 09:23:24 db1 mysqld[2780]: max_used_connections=36
|
led 28 09:23:24 db1 mysqld[2780]: max_threads=1816
|
led 28 09:23:24 db1 mysqld[2780]: thread_count=19
|
led 28 09:23:24 db1 mysqld[2780]: It is possible that mysqld could use up to
|
led 28 09:23:24 db1 mysqld[2780]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4229824 K bytes of memory
|
led 28 09:23:24 db1 mysqld[2780]: Hope that's ok; if not, decrease some variables in the equation.
|
led 28 09:23:24 db1 mysqld[2780]: Thread pointer: 0x0
|
led 28 09:23:24 db1 mysqld[2780]: Attempting backtrace. You can use the following information to find out
|
led 28 09:23:24 db1 mysqld[2780]: where mysqld died. If you see no messages after this, something went
|
led 28 09:23:24 db1 mysqld[2780]: terribly wrong...
|
led 28 09:23:24 db1 mysqld[2780]: stack_bottom = 0x0 thread_stack 0x31000
|
led 28 09:23:24 db1 mysqld[2780]: /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5585c0859a6e]
|
led 28 09:23:24 db1 mysqld[2780]: /usr/sbin/mysqld(handle_fatal_signal+0x3af)[0x5585c02e66bf]
|
led 28 09:23:24 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7f7c4ab790e0]
|
led 28 09:23:24 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7f7c49217fff]
|
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f7c4921942a]
|
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(+0x2be67)[0x7f7c49210e67]
|
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(+0x2bf12)[0x7f7c49210f12]
|
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_ZN5wsrep12server_state5stateERNS_11unique_lockINS_5mutexEEENS0_5stateE+0x755)[0x5585c08f0ad5]
|
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_ZN5wsrep12server_state10disconnectEv+0x57)[0x5585c08f0bc7]
|
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_Z26wsrep_shutdown_replicationv+0xba)[0x5585c024a22a]
|
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x1c0d)[0x5585c001306d]
|
led 28 09:23:25 db1 mysqld[2780]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f7c492052e1]
|
led 28 09:23:25 db1 mysqld[2780]: /usr/sbin/mysqld(_start+0x2a)[0x5585c000539a]
|
led 28 09:23:26 db1 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
|
MariaDB 10.4.11 with galera cluster looks very unstable so could not be used for production use. IST replication (mariabackup) doesn't work well too, it fails after mysql restart and full transfer is required. This works well with previous 10.3
|
|
The followup by Fišer accurately describes the hanging mariadbd 10.4.20 that I ran into just now:
- mariadb (1:10.4.20+maria~bionic)
- galera-4 (26.4.8-bionic)
- statement: ALTER TABLE `stc_16_cblist` ADD `field_column` text
- the Aborting node went non-primary and hung; the rest of the (3 node) cluster was unaffected (the other two considered it desynced)
2021-11-17 16:10:06 22 [Note] WSREP: MDL BF-BF conflict
|
schema: redacted_2021
|
request: (22 seqno 31261949 wsrep (toi, exec, committed) cmd 0 3 ALTER TABLE `stc_16_cblist` ADD `field_column` text)
|
granted: (17 seqno 31261952 wsrep (high priority, exec, committing) cmd 0 161 (null))
|
2021-11-17 16:10:06 22 [ERROR] Aborting
|
The SHOW PROCESSLIST shows all threads in Daemon or Sleep state:
| Id |
User |
Host |
db |
Cmd |
Time |
State |
Info |
Progr. |
| 1 |
system user |
|
NULL |
Sleep |
490318 |
wsrep aborter idle |
NULL |
0.000 |
| 2 |
system user |
|
NULL |
Sleep |
1230 |
wsrep applier committed |
NULL |
0.000 |
| 4 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB purge coordinator |
NULL |
0.000 |
| 5 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB purge worker |
NULL |
0.000 |
| 6 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB purge worker |
NULL |
0.000 |
| 7 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB purge worker |
NULL |
0.000 |
| 8 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB shutdown handler |
NULL |
0.000 |
| 11 |
system user |
|
NULL |
Sleep |
1229 |
After apply log event |
NULL |
0.000 |
| 12 |
system user |
|
NULL |
Sleep |
1229 |
After apply log event |
NULL |
0.000 |
| 13 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 14 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 15 |
system user |
|
NULL |
Sleep |
1230 |
wsrep applier committed |
NULL |
0.000 |
| 16 |
system user |
|
NULL |
Sleep |
1229 |
committing |
NULL |
0.000 |
| 17 |
system user |
|
NULL |
Sleep |
1229 |
committing |
NULL |
0.000 |
| 18 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 19 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 20 |
system user |
|
NULL |
Sleep |
1230 |
wsrep applier committed |
NULL |
0.000 |
| 21 |
system user |
|
NULL |
Sleep |
1229 |
After apply log event |
NULL |
0.000 |
| 22 |
system user |
|
xxxxxx_2021 |
Sleep |
1229 |
After create |
ALTER TABLE `stc_16_cblist` ADD `field_column` text |
0.000 |
| 23 |
system user |
|
NULL |
Sleep |
1229 |
After apply log event |
NULL |
0.000 |
| 24 |
system user |
|
NULL |
Sleep |
1229 |
committing |
NULL |
0.000 |
| 25 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 26 |
system user |
|
NULL |
Sleep |
1230 |
wsrep applier committed |
NULL |
0.000 |
| 27 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 28 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 29 |
system user |
|
NULL |
Sleep |
1230 |
wsrep applier committed |
NULL |
0.000 |
| 30 |
system user |
|
NULL |
Sleep |
1230 |
wsrep applier committed |
NULL |
0.000 |
| 31 |
system user |
|
NULL |
Sleep |
1229 |
After apply log event |
NULL |
0.000 |
| 32 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 34 |
system user |
|
NULL |
Sleep |
1229 |
wsrep applier committed |
NULL |
0.000 |
| 1324367 |
root |
localhost |
NULL |
Query |
0 |
Init |
show full processlist |
0.000 |
I salvaged a core dump. But this setup has a interesting peculiarity with these dumps, which I haven't been able to explain . There seem to be certain threads missing in the core dump, which results in this behaviour:
# gdb `which mysqld` ./core -batch -ex 'info threads' | tail -n1
|
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
|
warning: Couldn't find general-purpose registers in core file.
|
PC register is not available
|
61 Thread 0x7f3a8cd756c0 (LWP 1)
|
That "PC register is not available" error fails all "thread apply all" lookups, but we can get at some/most of the threads using a workaround:
for x in `seq 1 61`; do echo -ex; echo "thread apply $x bt full"; done |
|
xargs -d\\n gdb `which mysqld` ./core -batch
|
Threads are all waiting for something non-obvious:
# tail -n61 info-threads.txt | sed -e 's/^[*]/ /' | awk '{print $6}' | sort | uniq -c
|
1
|
19 __GI___nanosleep
|
1 __GI___poll
|
1 __GI___sigtimedwait
|
8 futex_abstimed_wait_cancelable
|
31 futex_wait_cancelable
|
The Aborting thread appears to be number 47:
Thread 47 (LWP 1202077):
|
#0 0x00007f3a8b485ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x559b8bcd38f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
|
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x559b8bcd38a0, cond=0x559b8bcd38c8) at pthread_cond_wait.c:502
|
#2 __pthread_cond_wait (cond=cond@entry=0x559b8bcd38c8, mutex=0x559b8bcd38a0) at pthread_cond_wait.c:655
|
#3 0x00007f3a5da09845 in gu::Lock::wait (this=0x7f38e01acb00, cond=...) at ./galerautils/src/gu_lock.hpp:46
|
#4 galera::ReplicatorSMM::wait_for_CLOSED (this=this@entry=0x559b8bcd3760, lock=...) at ./galera/src/replicator_smm.cpp:273
|
#5 0x00007f3a5da098c5 in galera::ReplicatorSMM::close (this=0x559b8bcd3760) at ./galera/src/replicator_smm.cpp:364
|
#6 0x00007f3a5d9e681b in galera_disconnect (gh=<optimized out>) at ./galera/src/wsrep_provider.cpp:208
|
#7 0x0000559b89905e51 in wsrep::wsrep_provider_v26::disconnect (this=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:759
|
#8 0x0000559b898f329b in wsrep::server_state::disconnect (this=0x559b8bc9e790) at ./wsrep-lib/src/server_state.cpp:540
|
#9 0x0000559b89011888 in unireg_abort (exit_code=exit_code@entry=1) at ./sql/mysqld.cc:1862
|
#10 0x0000559b8924d4d9 in wsrep_handle_mdl_conflict (requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ticket=ticket@entry=0x7f3638020b80, key=key@entry=0x7f38f86299f8) at ./sql/wsrep_mysqld.cc:2324
|
#11 0x0000559b891c71f6 in MDL_lock::can_grant_lock (this=this@entry=0x7f38f86299f8, type_arg=<optimized out>, requestor_ctx=requestor_ctx@entry=0x7f361c000d28, ignore_lock_priority=ignore_lock_priority@entry=false) at ./sql/mdl.cc:1769
|
#12 0x0000559b891c7915 in MDL_context::try_acquire_lock_impl (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, out_ticket=out_ticket@entry=0x7f38e01acf98) at ./sql/mdl.cc:2100
|
#13 0x0000559b891c7a4e in MDL_context::acquire_lock (this=this@entry=0x7f361c000d28, mdl_request=mdl_request@entry=0x7f38e01ad080, lock_wait_timeout=lock_wait_timeout@entry=86400) at ./sql/mdl.cc:2247
|
#14 0x0000559b891c85c2 in MDL_context::upgrade_shared_lock (this=this@entry=0x7f361c000d28, mdl_ticket=0x7f361c07de10, new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400) at ./sql/mdl.cc:2523
|
#15 0x0000559b8916c826 in mysql_inplace_alter_table (target_mdl_request=0x7f38e01ae300, alter_ctx=0x7f38e01af420, ha_alter_info=0x7f38e01ad3d0, altered_table=0x7f38e01ad470, table=0x7f362c0f19e8, table_list=0x7f361c00be40, thd=0x7f361c000c08) at ./sql/sql_table.cc:7730
|
#16 mysql_alter_table (thd=thd@entry=0x7f361c000c08, new_db=new_db@entry=0x7f361c005230, new_name=new_name@entry=0x7f361c005658, create_info=create_info@entry=0x7f38e01b0010, table_list=<optimized out>, table_list@entry=0x7f361c00be40, alter_info=alter_info@entry=0x7f38e01aff50, order_num=0, order=0x0, ignore=false) at ./sql/sql_table.cc:10248
|
#17 0x0000559b891c28ce in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x7f361c000c08) at ./sql/sql_alter.cc:520
|
#18 0x0000559b890d58c7 in mysql_execute_command (thd=thd@entry=0x7f361c000c08) at ./sql/sql_parse.cc:6189
|
#19 0x0000559b890db70a in mysql_parse (thd=0x7f361c000c08, rawbuf=<optimized out>, length=51, parser_state=parser_state@entry=0x7f38e01b2740, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7992
|
#20 0x0000559b893e6f66 in Query_log_event::do_apply_event (this=0x7f361c0660d8, rgi=0x7f361c01c4b0, query_arg=0x7f361c0dfaca "ALTER TABLE `stc_16_cblist` ADD `field_column` text", q_len_arg=<optimized out>) at ./sql/log_event.cc:5734
|
#21 0x0000559b8925e3b8 in Log_event::apply_event (rgi=0x7f361c01c4b0, this=0x7f361c0660d8) at ./sql/log_event.h:1492
|
#22 wsrep_apply_events (thd=thd@entry=0x7f361c000c08, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at ./sql/wsrep_applier.cc:200
|
#23 0x0000559b89244ab9 in Wsrep_high_priority_service::apply_toi (this=0x7f38e01b3cf0, ws_meta=..., data=...) at ./sql/wsrep_high_priority_service.cc:382
|
#24 0x0000559b898f1eb3 in apply_toi (data=..., ws_meta=..., ws_handle=..., high_priority_service=..., provider=...) at ./wsrep-lib/src/server_state.cpp:461
|
#25 wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1127
|
#26 0x0000559b899066d4 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7f38e01b3cf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
|
#27 (anonymous namespace)::apply_cb (ctx=0x7f38e01b3cf0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7f38e01b300f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504
|
#28 0x00007f3a5d9fc265 in galera::TrxHandleSlave::apply (this=this@entry=0x7f36a0097820, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, apply_cb=0x559b899065a0 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7f38e01b300f: false) at ./galera/src/trx_handle.cpp:391
|
#29 0x00007f3a5da0dc10 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x559b8bcd3760, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, ts=...) at ./galera/src/replicator_smm.cpp:504
|
#30 0x00007f3a5da140f5 in galera::ReplicatorSMM::process_trx (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2127
|
#31 0x00007f3a5da466b5 in galera::GcsActionSource::process_writeset (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:62
|
#32 0x00007f3a5da472c2 in galera::GcsActionSource::dispatch (this=this@entry=0x559b8bce0c60, recv_ctx=recv_ctx@entry=0x7f38e01b3cf0, act=..., exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:110
|
#33 0x00007f3a5da47582 in galera::GcsActionSource::process (this=0x559b8bce0c60, recv_ctx=0x7f38e01b3cf0, exit_loop=@0x7f38e01b38ff: false) at ./galera/src/gcs_action_source.cpp:183
|
#34 0x00007f3a5da14b10 in galera::ReplicatorSMM::async_recv (this=0x559b8bcd3760, recv_ctx=0x7f38e01b3cf0) at ./galera/src/replicator_smm.cpp:390
|
#35 0x00007f3a5d9e69bb in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236
|
#36 0x0000559b8990731e in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796
|
#37 0x0000559b8926051f in wsrep_replication_process (thd=0x7f361c000c08, arg=<optimized out>) at ./sql/wsrep_thd.cc:57
|
#38 0x0000559b8924f9d3 in start_wsrep_THD (arg=arg@entry=0x559b972686f0) at ./sql/wsrep_mysqld.cc:2892
|
#39 0x0000559b89824e8a in pfs_spawn_thread (arg=0x559b97268718) at ./storage/perfschema/pfs.cc:1869
|
#40 0x00007f3a8b47f6db in start_thread (arg=0x7f38e01b4700) at pthread_create.c:463
|
#41 0x00007f3a8a87d71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
I'd love to add more useful information, but I don't know where to go from here. I'll keep the core dump around for a while, in case jplindst or seppo or anyone else wants to know the value of something.
If someone can explain the core dump madness where I have fewer register states than apparent threads, I'd love to hear that too. These broken corefiles are really getting on my nerves:
# objdump -x core |
|
grep -E '^ *[0-9]+ [.](reg|reg2|reg-xstate|note.linuxcore.siginfo)/' |
|
awk '{print $2}' | sed -e 's/\/[0-9].*//' | sort | uniq -c
|
60 .note.linuxcore.siginfo
|
60 .reg
|
60 .reg-xstate
|
60 .reg2
|
|