[MDEV-29512] WSREP: cluster conflict causes cluster deadlock Created: 2022-09-12  Updated: 2023-01-16  Resolved: 2022-11-29

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.15
Fix Version/s: 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Bug Priority: Critical
Reporter: Jean-Louis Dupond Assignee: Seppo Jaakola
Resolution: Fixed Votes: 1
Labels: galera

Issue Links:
Relates
relates to MDEV-25368 Galera cluster hangs on Freeing items Closed

 Description   

On a galera cluster with 3 nodes we have an even running the query below:

2022-09-12  6:02:11 1 [Note] WSREP: cluster conflict due to high priority abort for threads:
2022-09-12  6:02:11 1 [Note] WSREP: Winning thread: 
   THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 76697477
   SQL: delete entry from entry inner join entry_file on entry_file.id=entry.file where entry_file.date_upload < now() -interval 366 day and  entry_file.date_upload > now() - interval 370 day and entry.id not in (select jm.entryid from entrymail_records jm) and entry.country_code <> 'BE'^V<AF>^^c^S^A
2022-09-12  6:02:11 1 [Note] WSREP: Victim thread: 
   THD: 2401338, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: delete entry from entry inner join entry_file on entry_file.id=entry.file where entry_file.date_upload < now() -interval 366 day and  entry_file.date_upload > now() - interval 370 day and entry.id not in (select jm.entryid from entrymail_records jm) and entry.country_code <> 'BE'
2022-09-12  6:02:11 1 [Note] WSREP: context: /home/buildbot/buildbot/build/mariadb-10.5.15/storage/innobase/handler/ha_innodb.cc:18802

Now this caused a conflict, and WSREP aborted a thread.

But this abort caused a deadlock on the node affecting all writes on the whole mariadb cluster.
All writes on other nodes were hanging in a 'Commit' state.

MariaDB also cannot be stopped correctly, only a kill -9 can stop the instance.
I tried to stop the instance, and it failed, but after this I created a gdb coredump.

Some relevant traces:

Thread 5 (Thread 0x7f47f82db700 (LWP 78621)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007f480139a87a in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f44a0665340) at ../nptl/pthread_mutex_lock.c:135
#2  0x000055c808b3ecc3 in inline_mysql_mutex_lock (src_line=591, src_file=0x55c809370458 "/home/buildbot/buildbot/build/mariadb-10.5.15/sql/sql_repl.cc", that=0x7f44a0665340) at ./include/mysql/psi/mysql_thread.h:752
#3  log_in_use_callback (log_name=0x7f47f82d8890 "/var/lib/mysql/cust-db002-bin.000145", thd=0x7f44a06636a8) at ./sql/sql_repl.cc:591
#4  THD_list_iterator::iterate<char const> (arg=<optimized out>, action=<optimized out>, this=<optimized out>) at ./sql/sql_class.h:1077
#5  log_in_use (log_name=log_name@entry=0x7f47f82d8890 "/var/lib/mysql/cust-db002-bin.000145") at ./sql/sql_repl.cc:601
#6  0x000055c808dfbfa7 in MYSQL_BIN_LOG::can_purge_log (this=0x55c809b76dc0 <mysql_bin_log>, log_file_name_arg=0x7f47f82d8890 "/var/lib/mysql/cust-db002-bin.000145") at ./sql/log.cc:5149
#7  MYSQL_BIN_LOG::can_purge_log (this=0x55c809b76dc0 <mysql_bin_log>, log_file_name_arg=0x7f47f82d8890 "/var/lib/mysql/cust-db002-bin.000145") at ./sql/log.cc:5132
#8  0x000055c808e04075 in MYSQL_BIN_LOG::purge_logs_before_date (this=0x55c809b76dc0 <mysql_bin_log>, purge_time=1662091333) at ./sql/log.cc:5075
#9  0x000055c808e09774 in MYSQL_BIN_LOG::trx_group_commit_leader (this=this@entry=0x55c809b76dc0 <mysql_bin_log>, leader=leader@entry=0x7f47f82d8c00) at ./sql/log.cc:8486
#10 0x000055c808e09a14 in MYSQL_BIN_LOG::write_transaction_to_binlog_events (this=0x55c809b76dc0 <mysql_bin_log>, entry=entry@entry=0x7f47f82d8c00) at ./sql/log.cc:8046
#11 0x000055c808e09f07 in MYSQL_BIN_LOG::write_transaction_to_binlog (this=0x55c809b76dc0 <mysql_bin_log>, thd=<optimized out>, cache_mngr=0x7f479c030658, end_ev=<optimized out>, all=<optimized out>, using_stmt_cache=<optimized out>, using_trx_cache=true) at ./sql/log.cc:7641
#12 0x000055c808e09ff7 in binlog_flush_cache (thd=thd@entry=0x7f479c000c18, cache_mngr=cache_mngr@entry=0x7f479c030658, end_ev=end_ev@entry=0x7f47f82d8d60, all=all@entry=true, using_stmt=using_stmt@entry=true, using_trx=using_trx@entry=true) at ./sql/log.cc:1780
#13 0x000055c808e0b953 in binlog_commit_flush_xid_caches (xid=76697477, all=true, cache_mngr=0x7f479c030658, thd=0x7f479c000c18) at ./sql/log.cc:1928
#14 MYSQL_BIN_LOG::log_and_order (this=<optimized out>, thd=0x7f479c000c18, xid=76697477, all=<optimized out>, need_prepare_ordered=<optimized out>, need_commit_ordered=<optimized out>) at ./sql/log.cc:10019
#15 0x000055c808d23637 in ha_commit_trans (thd=thd@entry=0x7f479c000c18, all=all@entry=true) at ./sql/handler.cc:1789
#16 0x000055c808c26611 in trans_commit (thd=thd@entry=0x7f479c000c18) at ./sql/transaction.cc:266
#17 0x000055c808fb69f8 in Wsrep_high_priority_service::commit (this=0x7f47f82dacc0, ws_handle=..., ws_meta=...) at ./sql/wsrep_high_priority_service.cc:329
#18 0x000055c80932fe66 in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:332
#19 0x000055c809330aed in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1114
#20 0x000055c809344314 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7f47f82dacc0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
#21 (anonymous namespace)::apply_cb (ctx=0x7f47f82dacc0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7f47f82d9f4f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:507
#22 0x00007f47f946698b in galera::TrxHandleSlave::apply (this=this@entry=0x7f431cbe7c70, recv_ctx=recv_ctx@entry=0x7f47f82dacc0, apply_cb=0x55c8093441e0 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7f47f82d9f4f: false) at ./galera/src/trx_handle.cpp:396
#23 0x00007f47f9478290 in galera::ReplicatorSMM::apply_trx (this=0x55c80c1e2790, recv_ctx=0x7f47f82dacc0, ts=...) at ./galera/src/replicator_smm.cpp:516
#24 0x00007f47f947a351 in galera::ReplicatorSMM::process_trx (this=0x55c80c1e2790, recv_ctx=0x7f47f82dacc0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2136
#25 0x00007f47f94a9d0b in galera::GcsActionSource::process_writeset (this=0x55c80c2dbf70, recv_ctx=0x7f47f82dacc0, act=..., exit_loop=@0x7f47f82da8cf: false) at ./galera/src/gcs_action_source.cpp:62
#26 0x00007f47f94aa69a in galera::GcsActionSource::dispatch (this=0x55c80c2dbf70, recv_ctx=0x7f47f82dacc0, act=..., exit_loop=<optimized out>) at ./galera/src/gcs_action_source.cpp:110
#27 0x00007f47f94ab1c1 in galera::GcsActionSource::process (this=0x55c80c2dbf70, recv_ctx=0x7f47f82dacc0, exit_loop=@0x7f47f82da8cf: false) at ./galera/src/gcs_action_source.cpp:186
#28 0x00007f47f947b420 in galera::ReplicatorSMM::async_recv (this=0x55c80c1e2790, recv_ctx=0x7f47f82dacc0) at ./galera/src/replicator_smm.cpp:402
#29 0x00007f47f945551b in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:263
#30 0x000055c8093450ee in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:858
#31 0x000055c808fd11c3 in wsrep_replication_process (thd=0x7f479c000c18, arg=<optimized out>) at ./wsrep-lib/include/wsrep/server_state.hpp:320
#32 0x000055c808fc1f7f in start_wsrep_THD (arg=arg@entry=0x55c80c42a580) at ./sql/wsrep_mysqld.h:560
#33 0x000055c808f4fa04 in pfs_spawn_thread (arg=0x55c80c415458) at ./storage/perfschema/pfs.cc:2201
#34 0x00007f4801397fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#35 0x00007f4800fbaeff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Lock seems to be owned by thread 99:

(gdb) print *(pthread_mutex_t*)0x7f44a0665340
$3 = {__data = {__lock = 2, __count = 0, __owner = 105061, __nusers = 1, __kind = 3, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000e\232\001\000\001\000\000\000\003", '\000' <repeats 22 times>, __align = 2}
(gdb) thread find 105061
Thread 99 has target id 'Thread 0x7f44c7687700 (LWP 105061)'

Then a lot of threads in the following state:

Thread 14 (Thread 0x7f4790f63700 (LWP 78879)):
#0  futex_abstimed_wait (private=0, abstime=0x0, expected=3, futex_word=0x55c809b4216c <server_threads+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:172
#1  __pthread_rwlock_wrlock_full (abstime=0x0, rwlock=0x55c809b42160 <server_threads+32>, rwlock@entry=0x20) at pthread_rwlock_common.c:706
#2  __GI___pthread_rwlock_wrlock (rwlock=rwlock@entry=0x55c809b42160 <server_threads+32>) at pthread_rwlock_wrlock.c:27
#3  0x000055c808a5a139 in inline_mysql_rwlock_wrlock (src_line=7501, src_file=0x55c80935a140 "/home/buildbot/buildbot/build/mariadb-10.5.15/sql/sql_class.h", that=0x55c809b42160 <server_threads+32>) at ./include/mysql/psi/mysql_thread.h:1008
#4  THD_list::erase (thd=0x7f478c0113a8, this=0x55c809b42140 <server_threads>) at ./sql/sql_class.h:7501
#5  unlink_thd (thd=0x7f478c0113a8) at ./sql/mysqld.cc:2578
#6  0x000055c808c1757f in do_handle_one_connection (connect=<optimized out>, connect@entry=0x55c817dd3828, put_in_cache=put_in_cache@entry=true) at ./sql/sql_connect.cc:1429
#7  0x000055c808c17c0d in handle_one_connection (arg=arg@entry=0x55c817dd3828) at ./sql/sql_connect.cc:1312
#8  0x000055c808f4fa04 in pfs_spawn_thread (arg=0x55c817dddc68) at ./storage/perfschema/pfs.cc:2201
#9  0x00007f4801397fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#10 0x00007f4800fbaeff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

And one galera thread that seems to hang:

Thread 99 (Thread 0x7f44c7687700 (LWP 105061)):
#0  futex_abstimed_wait_cancelable (private=0, abstime=0x7f44c7685490, expected=0, futex_word=0x7f44a17fb248) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7f44c7685490, mutex=0x55c80c1e3a38, cond=0x7f44a17fb220) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=cond@entry=0x7f44a17fb220, mutex=mutex@entry=0x55c80c1e3a38, abstime=abstime@entry=0x7f44c7685490) at pthread_cond_wait.c:667
#3  0x00007f47f9472346 in gu_cond_timedwait_SYS (ts=0x7f44c7685490, mutex=0x55c80c1e3a38, cond=<optimized out>) at ./galerautils/src/gu_threads.h:264
#4  gu::Lock::wait (this=<synthetic pointer>, date=..., cond=...) at ./galerautils/src/gu_lock.hpp:64
#5  galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::wait (wait_until=..., gtid=..., this=0x55c80c1e3a38) at ./galera/src/monitor.hpp:399
#6  galera::ReplicatorSMM::sync_wait (this=0x55c80c1e2790, upto=<optimized out>, tout=<optimized out>, gtid=0x0) at ./galera/src/replicator_smm.cpp:1603
#7  0x00007f47f945557e in galera_sync_wait (wsrep=<optimized out>, upto=<optimized out>, tout=<optimized out>, gtid=<optimized out>) at ./galera/src/wsrep_provider.cpp:502
#8  0x000055c80934560a in wsrep::wsrep_provider_v26::wait_for_gtid (this=<optimized out>, gtid=..., timeout=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:1056
#9  0x000055c808fd85a9 in Wsrep_server_service::set_position (this=<optimized out>, c=..., gtid=...) at ./wsrep-lib/include/wsrep/server_state.hpp:320
#10 0x000055c809335d36 in wsrep::transaction::release_commit_order (this=0x7f44a0669958, lock=...) at ./wsrep-lib/include/wsrep/provider.hpp:143
#11 0x000055c809340b4a in wsrep::transaction::after_statement (this=this@entry=0x7f44a0669958) at ./wsrep-lib/src/transaction.cpp:890
#12 0x000055c809322c79 in wsrep::client_state::after_statement (this=this@entry=0x7f44a06698f0) at ./wsrep-lib/src/client_state.cpp:278
#13 0x000055c808a885d9 in wsrep_after_statement (thd=<optimized out>) at ./sql/sql_class.h:5102
#14 sp_head::execute (this=0x7f44a18757e0, thd=0x7f44a06636a8, merge_da_on_success=true) at ./sql/sp_head.cc:1476
#15 0x000055c808a896a7 in sp_head::execute_procedure (this=this@entry=0x7f44a18757e0, thd=thd@entry=0x7f44a06636a8, args=args@entry=0x7f44c7686380) at ./sql/sp_head.cc:2449
#16 0x000055c808bfcb82 in Event_job_data::execute (this=this@entry=0x7f44c7686b40, thd=thd@entry=0x7f44a06636a8, drop=<optimized out>) at ./sql/event_data_objects.cc:1485
#17 0x000055c808e88d3d in Event_worker_thread::run (this=0x7f44c7686ddf, thd=0x7f44a06636a8, event=0x7f44a0fe8380) at ./sql/event_scheduler.cc:323
#18 0x000055c808e88dca in event_worker_thread (arg=arg@entry=0x7f44a0fe8380) at ./sql/event_scheduler.cc:268
#19 0x000055c808f4fa04 in pfs_spawn_thread (arg=0x7f44a0e31198) at ./storage/perfschema/pfs.cc:2201
#20 0x00007f4801397fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#21 0x00007f4800fbaeff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It seems very similar to MDEV-25368, but that should already have been fixed in 10.5.15?

Corefile & libs have been uploaded to the MariaDB FTP!
This is a debian 10 system with MariaDB 1:10.5.15+maria~buster installed.



 Comments   
Comment by Seppo Jaakola [ 2022-11-02 ]

dupondje thanks for the stack trace analysis.
Are these conflicting delete statements due to the same event fired in separate cluster nodes? Galera replication is supposed to enable event execution only in one of the nodes, to avoid conflicts.
Does the event body contain more statements, or just this single delete?

Comment by Jean-Louis Dupond [ 2022-11-04 ]

@Seppo Jaakola
This is indeed due to events being fired from the 3 nodes at the same time. After this we disabled events on 2 nodes to make sure they get only executed on 1 nodes.
The event contains 4 delete statements but on different tables.

Comment by Seppo Jaakola [ 2022-11-04 ]

Galera replication should manage events so that they are enabled only in one node, while events are in "slave side disabled" state in other nodes. But this is just best effort, it is possible to manually override event states.

If the event body contains any DML writes, and event is enabled in several nodes, it will easily lead to write-write conflicts, which cluster has to resolve by aborting all but one of such event executions.
However, cluster should be able to resolve all conflicts successfully, and this MDEV case apparently shows that there is a bug in conflict resolving.

dupondje you have uploaded the core file in MariaDB FTP site? Is that ftp.mariadb.org / public folder? I could not find it there...

I'm curious of the THD class contents of threads 5 and 99. If you still have the core available, here is what I'm looking for:
gdb> thread 5
gdb> frame 14
gdb> print *thd
gdb> thread 99
gdb> frame 14
gdb> print *thd
that should give plenty of material to analyze

Comment by Jean-Louis Dupond [ 2022-11-04 ]

I did upload them to the /private/ folder:
MDEV-29512_coredump.gz and MDEV-29512_libs.gz

I still have them available if they need to be re-uploaded.

Comment by Seppo Jaakola [ 2022-11-07 ]

dupondje thanks for the gdb infos, the reason for the hanging appears to be a deadlock between replication commit monitor and LOCK_thd_data mutex protecting the victim thread's data members.
We have a potential fix for this issue under testing now.

Comment by Jean-Louis Dupond [ 2022-11-07 ]

Thanks for fixing this!
Does this affect only 10.5 or also newer versions?

Comment by Seppo Jaakola [ 2022-11-08 ]

I would expect this issue to affect all versions 10.4 and later. I have now a test for reproducing the hanging, and it should be simple to verify the affected versions. The fix looks effective in my testing.

Comment by Seppo Jaakola [ 2022-11-09 ]

Fix is pushed in wsrep-lib
PR contains a new mtr test for reproducing the issue, all versions 10.4 and up are vulnerable

Comment by Seppo Jaakola [ 2022-11-09 ]

PR containing the fix (in wsrep-lib) and mtr test was submitted today.
The issue is not limited to event processing, it can happen with regular transactions as well (e.g. the mtr test does not use events).
The conditions for the issue to happen are:

  • binlog purging must happen (i.e. binlog expiration is configured)
  • cluster wide conflict must happen, and the BF abort victim should be ordered later than the binlog purge processor
Comment by Jan Lindström (Inactive) [ 2022-11-29 ]
Generated at Thu Feb 08 10:09:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.