[MDEV-21910] KIlling thread on Galera could cause mutex deadlock Created: 2020-03-11  Updated: 2022-12-16  Resolved: 2020-07-29

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4.14, 10.5.5

Type: Bug Priority: Blocker
Reporter: Jan Lindström (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Attachments: File wsrep_kill.diff    
Issue Links:
Problem/Incident
causes MDEV-23328 Server hang due to Galera lock confli... Closed
Relates
relates to MDEV-18464 Port kill_one_trx fixes from 10.4 to ... Closed
relates to MDEV-23101 SIGSEGV in lock_rec_unlock() when Gal... Closed
relates to MDEV-22666 galera.MW-328A MTR failed: "Semaphore... Closed
relates to MDEV-23851 Galera assertion at lock0lock.cc line... Closed

 Description   

There following issues here:

  • Whenever Galera BF (brute force) transaction decides to abort conflicting transaction it will kill that thread using thd::awake()
  • User KILL [QUERY|CONNECTION] ... for a thread it will also call thd::awake()
    Whenever one of these actions is executed we will hold number of InnoDB internal mutexes and thd mutexes.
    Sometimes these mutexes are taken in different order causing mutex deadlock.


 Comments   
Comment by Sergey Vojtovich [ 2020-04-14 ]

It is unfortunate that we have to take this approach, but apparently it is the easiest what we can do.

Speaking of d1578f7ad548aa5ecd80d313601a347641e5037f, it seem to go inline with this idea. I didn't dive much into the wsrep code, I'll leave it up to you and Galera team.

One thing that doesn't look completely right is passing THD to a background thread. Neither bf_thd nor victim_thd seem to be protected against destruction. Accessing sync point action of a bf_thd from wsrep_kill() looks even worse.

Solution: I don't think you need to pass bf_thd and victim_thd should be found by thread id instead (like KILL does it).

Another thing is that you both enqueue and kill under the LOCK_wsrep_kill. Which introduces another "reverse" mutex locking pattern:
enqueue: lock_sys->mutex -> trx->mutex -> LOCK_wsrep_kill
kill: LOCK_wsrep_kill -> LOCK_thd_data -> lock_sys->mutex -> trx->mutex

And I guess it may eventually lead to another deadlock.

I also wonder if wsrep really-really has to issue THD::awake()? Why lock_cancel_waiting_and_release() is not enough? If that's not enough, can victim commit suicide itself thereafter?

Looks alright otherwise.

Comment by Jan Lindström (Inactive) [ 2020-04-28 ]

seppo Can you please review patch for 10.2 on branch bb-10.2-galera-21910 same on git https://github.com/MariaDB/server/commit/2471c1fb88bdb58cae20121b2eb15c2a1ed606cd

I know that 10.2 and 10.3 versions of above fix do not crash on tests or in randgen.

Comment by Jan Lindström (Inactive) [ 2020-04-29 ]

Analysis from 10.4

  • BF abort:

    wsrep_kill_victim (lock_sys->mutex, trx->mutex) => wsrep_kill_one_trx (thd->LOCK_thd_data)

  • KILL QUERY:

    find_thread_by_id(thd->LOCK_thd_data , thd->LOCK_thd_kill) => thd->awake_no_mutex() => innobase_kill_query() => lock_trx_handle_wait( lock_sys->mutex, trx->mutex if not wsrep victim)

From these only lock_sys->mutex is global mutex. All others are either thread internal (thd) or transaction internal (trx). Therefore, mutex deadlock is possible if and only if thread we have selected as victim for BF abort and thread user is trying to kill are exactly the same one. While BF abort also will call thd->awake() it will have trx->lock.was_chosen_as_wsrep_victim=true and then we do not take lock_sys->mutex or trx->mutex at lock_trx_handle_wait(). Second problem is possible when in wsrep_kill_one_trx() we set trx->lock.was_chosen_as_wsrep_victim=true and release thd->LOCK_thd_data mutex. Now if we have a schedule where KILL is executed there is possibility that we either try to take lock_sys->mutex and have to wait or if this is same thread as victim no InnoDB mutexes are taken. Furthermore, in bf_abort() we might take thd->LOCK_thd_data again in wsrep-lib.

Comment by Jan Lindström (Inactive) [ 2020-04-29 ]

For 10.1-10.3 it seems that I can fix the problem using background killing the victim thread. However, for 10.4 and upwards that does not work based on already regression testing (This version is on branch bb-10.4-MDEV-21910).

For 10.4 I'm now suggesting a lot more simpler approach where we release InnoDB-locks before entering wsrep_kill_one_trx(). This way we can request safely thd->LOCK_thd_data mutex and later on wsrep_thd_bf_abort call thd->awake(). Replacing thd->wake() with e.g. thd->set_killed(KILL_QUERY) did not work for some tests already in 10.4.

Comment by Jan Lindström (Inactive) [ 2020-04-29 ]

marko Can you have a look of https://github.com/MariaDB/server/commit/06629988a22ac022d517dcc6d03c7f4f257de43a

Comment by Marko Mäkelä [ 2020-04-30 ]

jplindst, here are some quick comments or questions:

  • Why is wsrep_innobase_kill_one_trx() using the parameter types void* and ibool instead of THD* and bool? Shouldn’t 3 parameters be enough: (const trx_t *bf_trx, trx_t *victim_trx, bool signal)?
  • What does BF mean? It is not explained in the function comment.
  • Can wsrep_abort_transaction() ever be called with bf_thd=nullptr? Or with a transaction that is not holding any locks? It is missing debug assertions.
  • What prevents the victim_trx from being aborted or terminated while wsrep_innobase_kill_one_trx() is executing?
  • The is_referenced() test (since MariaDB 10.3) in trx_commit_in_memory() or trx_t::commit_in_memory() could protect us while we are not holding a mutex. But where are the calls to trx_t::reference() and trx_t::release_reference()? Please show some stack traces.
  • Can you avoid duplicating so many trx_mutex_exit() calls in wsrep_kill_victim()?

I think that your proposed change could be a step to the right direction, but it needs some more work. Studying MDEV-15326 could be useful, because it is addressing a scenario that is somewhat similar to one of our problems here: How to ensure that we are killing the intended transaction and not an unrelated transaction that is reusing the same trx_t object after our victim transaction terminated before we got the chance to kill it?

Comment by Jan Lindström (Inactive) [ 2020-04-30 ]

Thank you for your detailed review.

Comment by Jan Lindström (Inactive) [ 2020-05-06 ]

Current status:

  • 10.1 branch bb-10.1-MDEV-21910 commit e7071a039c8ca1a94b6f43c9b42c1e262991e5f9
  • 10.2 branch bb-10.2-MDEV-21910 commit c0f7d3783254240fa6eeb59248dafc8b55518de2
  • 10.3 branch bb-10.3-MDEV-21910 commit 1af58dfb1b959c2f7ae3747b078e1c31f6169701
  • 10.4 branch bb-10.4-MDEV-21910-v2 commit 21fa04ecf92570a5cbcd9d2cddda1e1ad790172c
  • 10.5 it would be exactly same as in 10.4 if correct fix is found.

From these 10.1-10.3 looks better on both regression testing as well as randgen. However, 10.4 version does not look good on regression testing.

Answers to review questions:

  • Why is wsrep_innobase_kill_one_trx() using the parameter types void* and ibool instead of THD* and bool? Shouldn’t 3 parameters be enough: (const trx_t *bf_trx, trx_t *victim_trx, bool signal)?
    • Replaced parameters with minimal version we need THD bf_thd, victim_trx, my_bool signal
  • What does BF mean? It is not explained in the function comment.
    • Brute force, added more clearer comments
  • Can wsrep_abort_transaction() ever be called with bf_thd=nullptr? Or with a transaction that is not holding any locks? It is missing debug assertions.
    • Yes it can be, e.g. on MDL-conflict
  • What prevents the victim_trx from being aborted or terminated while wsrep_innobase_kill_one_trx() is executing?
    The is_referenced() test (since MariaDB 10.3) in trx_commit_in_memory() or trx_t::commit_in_memory() could protect us while we are not holding a mutex. But where are the calls to trx_t::reference() and trx_t::release_reference()?
    • Nothing. Added trx_t::reference() and trx_t::release_reference() but it could be that I used it incorrectly OR/AND there is other problem, as we release lock_sys->mutex and trx->mutex, there is nothing restricting victim thread to continue executing.
  • Can you avoid duplicating so many trx_mutex_exit() calls in wsrep_kill_victim()?
    • Fixed

In my opinion the biggest open questions is the fact that if we release lock_sys->mutex and trx->mutex how to make sure that victim does not continue executing operations (possible up to commit/rollback)?

10.4 in regression testing locally has at least following problem (test case: galera.galera_bf_lock_wait)

mysqld: /home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc:1515: void trx_t::commit_in_memory(const mtr_t*): Assertion `!(this)->is_referenced()' failed.
200506 11:52:55 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.13-MariaDB-debug-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=4
max_threads=153
thread_count=12
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63636 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7faf8c001fe0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fafebfb3dd0 thread_stack 0x49000
/home/jan/mysql/10.4-kill/sql/mysqld(my_print_stacktrace+0x44)[0x5590aebc897d]
/home/jan/mysql/10.4-kill/sql/mysqld(handle_fatal_signal+0x3b9)[0x5590ae319695]
sigaction.c:0(__restore_rt)[0x7fb00f7e63c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fb00eebc18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fb00ee9b859]
/lib/x86_64-linux-gnu/libc.so.6(+0x25729)[0x7fb00ee9b729]
/lib/x86_64-linux-gnu/libc.so.6(+0x36f36)[0x7fb00eeacf36]
/home/jan/mysql/10.4-kill/sql/mysqld(+0x1055165)[0x5590ae7f6165]
/home/jan/mysql/10.4-kill/sql/mysqld(+0x105049b)[0x5590ae7f149b]
trx/trx0trx.cc:1515(trx_t::commit_in_memory(mtr_t const*))[0x5590ae7f152d]
trx/trx0trx.cc:1578(trx_t::commit_low(mtr_t*))[0x5590ae7f1a13]
trx/trx0trx.cc:1724(trx_commit_for_mysql(trx_t*))[0x5590ae5708ab]
handler/ha_innodb.cc:4375(innobase_commit_low(trx_t*))[0x5590ae570cc5]
handler/ha_innodb.cc:4506(innobase_commit_ordered_2(trx_t*, THD*))[0x5590ae571309]
handler/ha_innodb.cc:4630(innobase_commit(handlerton*, THD*, bool))[0x5590ae31fe1e]
sql/handler.cc:1776(commit_one_phase_2(THD*, bool, THD_TRANS*, bool))[0x5590ae31fd0d]
sql/handler.cc:1756(ha_commit_one_phase(THD*, bool))[0x5590ae31ef5e]
sql/handler.cc:1564(ha_commit_trans(THD*, bool))[0x5590ae158a0e]
sql/transaction.cc:236(trans_commit(THD*))[0x5590adfc2aa5]
sql/sql_parse.cc:5620(mysql_execute_command(THD*))[0x5590adeca428]
sql/sp_head.cc:3683(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x5590adec97b1]
sql/sp_head.cc:3415(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x5590adec9fd0]
sql/sp_head.cc:3589(sp_instr_stmt::execute(THD*, unsigned int*))[0x5590adec36c1]
sql/sp_head.cc:1361(sp_head::execute(THD*, bool))[0x5590adec6464]
sql/sp_head.cc:2366(sp_head::execute_procedure(THD*, List<Item>*))[0x5590adfb97e2]
sql/sql_parse.cc:3256(Sql_cmd_call::execute(THD*))[0x5590adfba495]
sql/sql_parse.cc:6101(mysql_execute_command(THD*))[0x5590adfc470f]
sql/sql_parse.cc:7900(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x5590adfc9c89]
sql/sql_parse.cc:7715(wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x5590adfc937b]
sql/sql_parse.cc:1827(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x5590adfb610f]
sql/sql_parse.cc:1359(do_command(THD*))[0x5590adfb4a7c]
sql/sql_connect.cc:1412(do_handle_one_connection(CONNECT*))[0x5590ae13feab]
sql/sql_connect.cc:1317(handle_one_connection)[0x5590ae13fbf4]
perfschema/pfs.cc:1871(pfs_spawn_thread)[0x5590aeb54ab8]
nptl/pthread_create.c:478(start_thread)[0x7fb00f7da609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fb00ef98103]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7faf8c02eea8): commit
Connection ID (thread ID): 19
Status: KILL_QUERY
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/jan/mysql/10.4-kill/mysql-test/var/3/mysqld.1/d...
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             127512               127512               processes 
Max open files            1024                 1024                 files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       127512               127512               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/share/apport/apport %p %s %c %d %P ...
----------SERVER LOG END-------------
 
 
 - found 'core' (0/5)
 
Trying 'dbx' to get a backtrace
 
Trying 'gdb' to get a backtrace from coredump /home/jan/mysql/10.4-kill/mysql-test/var/3/log/galera.galera_bf_lock_wait-innodb/mysqld.1/data/core
Core generated by '/home/jan/mysql/10.4-kill/sql/mysqld'
Output from gdb follows. The first stack trace is from the failing thread.
The following stack traces are from all threads (so the failing one is
duplicated).
--------------------------
[New LWP 2013105]
[New LWP 2012112]
[New LWP 2012080]
[New LWP 2012190]
[New LWP 2012069]
[New LWP 2012083]
[New LWP 2012113]
[New LWP 2012168]
[New LWP 2012085]
[New LWP 2012167]
[New LWP 2012086]
[New LWP 2012181]
[New LWP 2012087]
[New LWP 2012184]
[New LWP 2012088]
[New LWP 2012091]
[New LWP 2012107]
[New LWP 2012108]
[New LWP 2012109]
[New LWP 2012110]
[New LWP 2012151]
[New LWP 2012152]
[New LWP 2012153]
[New LWP 2012154]
[New LWP 2012155]
[New LWP 2012169]
[New LWP 2012111]
[New LWP 2012170]
[New LWP 2012171]
[New LWP 2012182]
[New LWP 2012183]
[New LWP 2012197]
[New LWP 2012199]
[New LWP 2013072]
[New LWP 2013106]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/jan/mysql/10.4-kill/sql/mysqld --defaults-group-suffix=.1 --defaults-file'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
56	../sysdeps/unix/sysv/linux/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fafebfb4700 (LWP 2013105))]
#0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1  0x00005590aebc8a73 in my_write_core (sig=6) at /home/jan/mysql/10.4-kill/mysys/stacktrace.c:481
#2  0x00005590ae319a1f in handle_fatal_signal (sig=6) at /home/jan/mysql/10.4-kill/sql/signal_handler.cc:343
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00007fb00ee9b859 in __GI_abort () at abort.c:79
#6  0x00007fb00ee9b729 in __assert_fail_base (fmt=0x7fb00f031588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5590aef69896 "!(this)->is_referenced()", file=0x5590aef683d0 "/home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc", line=1515, function=<optimized out>) at assert.c:92
#7  0x00007fb00eeacf36 in __GI___assert_fail (assertion=0x5590aef69896 "!(this)->is_referenced()", file=0x5590aef683d0 "/home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc", line=1515, function=0x5590aef696a0 "void trx_t::commit_in_memory(const mtr_t*)") at assert.c:101
#8  0x00005590ae7f6165 in trx_t::commit_in_memory (this=0x7fb000a71378, mtr=0x0) at /home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc:1515
#9  0x00005590ae7f149b in trx_t::commit_low (this=0x7fb000a71378, mtr=0x0) at /home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc:1577
#10 0x00005590ae7f152d in trx_t::commit (this=0x7fb000a71378) at /home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc:1591
#11 0x00005590ae7f1a13 in trx_commit_for_mysql (trx=0x7fb000a71378) at /home/jan/mysql/10.4-kill/storage/innobase/trx/trx0trx.cc:1723
#12 0x00005590ae5708ab in innobase_commit_low (trx=0x7fb000a71378) at /home/jan/mysql/10.4-kill/storage/innobase/handler/ha_innodb.cc:4375
#13 0x00005590ae570cc5 in innobase_commit_ordered_2 (trx=0x7fb000a71378, thd=0x7faf8c001fe0) at /home/jan/mysql/10.4-kill/storage/innobase/handler/ha_innodb.cc:4504
#14 0x00005590ae571309 in innobase_commit (hton=0x5590b0794080, thd=0x7faf8c001fe0, commit_trx=true) at /home/jan/mysql/10.4-kill/storage/innobase/handler/ha_innodb.cc:4620
#15 0x00005590ae31fe1e in commit_one_phase_2 (thd=0x7faf8c001fe0, all=true, trans=0x7faf8c005578, is_real_trans=true) at /home/jan/mysql/10.4-kill/sql/handler.cc:1776
#16 0x00005590ae31fd0d in ha_commit_one_phase (thd=0x7faf8c001fe0, all=true) at /home/jan/mysql/10.4-kill/sql/handler.cc:1756
#17 0x00005590ae31ef5e in ha_commit_trans (thd=0x7faf8c001fe0, all=true) at /home/jan/mysql/10.4-kill/sql/handler.cc:1564
#18 0x00005590ae158a0e in trans_commit (thd=0x7faf8c001fe0) at /home/jan/mysql/10.4-kill/sql/transaction.cc:236
#19 0x00005590adfc2aa5 in mysql_execute_command (thd=0x7faf8c001fe0) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:5620
#20 0x00005590adeca428 in sp_instr_stmt::exec_core (this=0x7faf8c021eb0, thd=0x7faf8c001fe0, nextp=0x7fafebfb1f24) at /home/jan/mysql/10.4-kill/sql/sp_head.cc:3683
#21 0x00005590adec97b1 in sp_lex_keeper::reset_lex_and_exec_core (this=0x7faf8c021ef8, thd=0x7faf8c001fe0, nextp=0x7fafebfb1f24, open_tables=false, instr=0x7faf8c021eb0) at /home/jan/mysql/10.4-kill/sql/sp_head.cc:3415
#22 0x00005590adec9fd0 in sp_instr_stmt::execute (this=0x7faf8c021eb0, thd=0x7faf8c001fe0, nextp=0x7fafebfb1f24) at /home/jan/mysql/10.4-kill/sql/sp_head.cc:3589
#23 0x00005590adec36c1 in sp_head::execute (this=0x7faf8c01aba8, thd=0x7faf8c001fe0, merge_da_on_success=true) at /home/jan/mysql/10.4-kill/sql/sp_head.cc:1361
#24 0x00005590adec6464 in sp_head::execute_procedure (this=0x7faf8c01aba8, thd=0x7faf8c001fe0, args=0x7faf8c006dd0) at /home/jan/mysql/10.4-kill/sql/sp_head.cc:2366
#25 0x00005590adfb97e2 in do_execute_sp (thd=0x7faf8c001fe0, sp=0x7faf8c01aba8) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:3014
#26 0x00005590adfba495 in Sql_cmd_call::execute (this=0x7faf8c013400, thd=0x7faf8c001fe0) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:3256
#27 0x00005590adfc470f in mysql_execute_command (thd=0x7faf8c001fe0) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:6101
#28 0x00005590adfc9c89 in mysql_parse (thd=0x7faf8c001fe0, rawbuf=0x7faf8c013358 "call p1", length=7, parser_state=0x7fafebfb34d0, is_com_multi=false, is_next_command=false) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:7900
#29 0x00005590adfc937b in wsrep_mysql_parse (thd=0x7faf8c001fe0, rawbuf=0x7faf8c013358 "call p1", length=7, parser_state=0x7fafebfb34d0, is_com_multi=false, is_next_command=false) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:7704
#30 0x00005590adfb610f in dispatch_command (command=COM_QUERY, thd=0x7faf8c001fe0, packet=0x7faf8c0099e1 "call p1", packet_length=7, is_com_multi=false, is_next_command=false) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:1827
#31 0x00005590adfb4a7c in do_command (thd=0x7faf8c001fe0) at /home/jan/mysql/10.4-kill/sql/sql_parse.cc:1359
#32 0x00005590ae13feab in do_handle_one_connection (connect=0x5590b0b8ac60) at /home/jan/mysql/10.4-kill/sql/sql_connect.cc:1412
#33 0x00005590ae13fbf4 in handle_one_connection (arg=0x5590b0b8ac60) at /home/jan/mysql/10.4-kill/sql/sql_connect.cc:1316
#34 0x00005590aeb54ab8 in pfs_spawn_thread (arg=0x5590b0b64410) at /home/jan/mysql/10.4-kill/storage/perfschema/pfs.cc:1869
#35 0x00007fb00f7da609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#36 0x00007fb00ef98103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment by Jan Lindström (Inactive) [ 2020-05-06 ]

seppo marko Based on above comment, I'm again stuck and do not know how to proceed from here.

Comment by Jan Lindström (Inactive) [ 2020-05-07 ]

Additional discussion: There was a question do we need to call thd::awake(). Firstly, avoiding that call in 10.4 will not help as in wsrep_thd_bf_abort() and below wsrep-lib will take thd->LOCK_thd_data mutex anyway. And in my understanding we can't avoid taking thd->LOCK_thd_data mutex on KILL and lock_sys->mutex and trx->mutex when canceling victim locks.

Comment by Marko Mäkelä [ 2020-05-08 ]

jplindst, in the 10.2 fix of MDEV-15326 you should have noticed the following comment:

state must be rechecked after trx_mutex_enter(), and trx->release_reference() must be invoked

I do not see any reference-count operations in the proposed 10.2 or 10.3 fixes (c0f7d3783254240fa6eeb59248dafc8b55518de2, 1af58dfb1b959c2f7ae3747b078e1c31f6169701).
I did not look at the proposed 10.1 fix.

In the proposed 10.4 fix, I do not see any check of victim_trx->state in wsrep_innobase_kill_one_trx() after acquiring the mutex:

			WSREP_DEBUG("canceling wait lock");
			lock_mutex_enter();
			trx_mutex_enter(victim_trx);
			victim_trx->lock.was_chosen_as_deadlock_victim= TRUE;
			lock_cancel_waiting_and_release(victim_trx->lock.wait_lock);
			trx_mutex_exit(victim_trx);
			lock_mutex_exit();

The victim_trx may already have been committed at that point! We must not pollute trx->lock.was_chosen_as_deadlock_victim for the next user who would allocate the transaction object from the pool (see MDEV-22494).
The reference-counting in the caller wsrep_kill_victim() might be fine:

@@ -1133,9 +1133,15 @@ wsrep_kill_victim(
 					   << wsrep_thd_query(lock->trx->mysql_thd);
 			}
 
-			wsrep_innobase_kill_one_trx(trx->mysql_thd,
-						    trx, lock->trx, TRUE);
+			lock->trx->reference();
+			trx_mutex_exit(lock->trx);
+			lock_mutex_exit();
+			wsrep_innobase_kill_one_trx(trx->mysql_thd, lock->trx, true);
+			lock_mutex_enter();
+			lock->trx->release_reference();
 		}
+	} else {
+		trx_mutex_exit(lock->trx);
 	}
 }
 #endif /* WITH_WSREP */

Note: wsrep_kill_victim() is missing an assertion or check lock->trx->is_wsrep(). Are internal (non-Galera) transactions supposed to be killed by replicated transactions? If yes, then the clearing of that flag in my proposed patch for MDEV-22494 needs to be moved outside the if (wsrep).

General remarks:

  • I wonder why you prefer to use the nonstandard C data type my_bool instead of bool in C++ functions, such as wsrep_innobase_kill_one_trx(), and then proceed to assign bool literals to it, instead of the C-style FALSE and TRUE.
  • Please ensure that the output of git show will not exceed 80 columns.
Comment by Seppo Jaakola [ 2020-05-22 ]

I am able to reproduce a hang with KILL QUERY and victim transaction, which is in committing state, by running galera.MW-328A test repeatedly (~50 rounds should make it surface)
However, BF aborting is not related in the hang, so it does not match this jira issue description.

Comment by Seppo Jaakola [ 2020-05-27 ]

I was able to create a mtr test for reproducing a debug assert, when both BF abort (issued by wsrep applier thread) and manual KILL command try to kill same victim. The symptom is debug level assert, not a deadlock, so somewhat different that this issue description has.
I have a fix for the mtr test scenario, under jenkins testing atm. It may be that the fix is effective for deadlock issue as well , although it has different scenario.

Comment by Jan Lindström (Inactive) [ 2020-06-02 ]

seppo I have attached necessary changes to code and idea of test case that should repeat the mutex deadlock if test case is correct. Idea is to start transaction on node_2 so that it will take conflicting row locks and then execute normal update on node_1 that will be executed by applier on node_2 and will conflict against open transaction on node_2. Idea is to execute these so that, BF transaction will execute up to point before we take LOCK_thd_data(victim) to victim but it still owns lock_sys->mutex and trx->mutex(victim). Then, from another connection to node_2 we try to kill open update query on node_2 up to point it already own LOCK_thd_data(victim) and then we let BF to try to take LOCK_thd_data(victim) that it can't have as it is already taken and continue KILL that will try to take lock_sys->mutex that is held by BF thread so both BF and KILL are waiting each other.

My code changes and test case against 10.2 commit 50641db2d11ad8a2228f7938d851e52decb71a9b debug build

Comment by Seppo Jaakola [ 2020-06-04 ]

Submitted a PR for MariaDB 10.4. It contains a mtr test for reproducing an issue in scenario where an idle client connection is victim for simultaneous KILL command and BF abort and a potential fix for the issue.

The PR is against 10.4, and had good test response in our jenkins testing. PR is here: https://github.com/MariaDB/server/pull/1577

Comment by Jan Lindström (Inactive) [ 2020-06-04 ]

https://github.com/MariaDB/server/pull/1577

Comment by Marko Mäkelä [ 2020-06-04 ]

I would like to see a detailed explanation of the failure scenario, and root cause analysis. Concurrency is hard.

Comment by Jan Lindström (Inactive) [ 2020-06-23 ]

Consider following table and data:

create table t1(a int not null primay key, b int) engine=innodb;
insert into t1 values (1,0);

We start a open transaction on node_2 from connection node_2a, lets call this thread_1.

#
# connection node_2a runs a local transaction, that is victim of BF abort
# and victim of KILL command by connection node_2
#
--connect node_2a, 127.0.0.1, root, , test, $NODE_MYPORT_2
--connection node_2a
# start a transaction that will conflict with later applier
begin;
update t1 set b=2 where a=1;

Now we start transaction on node_1, this transaction will be certified successfully on node_1 and write-set is sent to node_2 to be executed by the applier using brute force (BF) lets call this thread_2. However, applier will find out that this write-set will conflict with thread_1.

#
# replicate an update, which will BF abort the victim node_2a
# however, while applier in node 2 is handling the abort,
# it will pause in sync point set by node_2b
#
--connection node_1
select * from t1;
update t1 set b= 1 where a=1;

In applier execution we notice that there is conflicting lock request where a=1 in lock0lock.cc: lock_rec_other_has_conflicting() BF thread_2 has already locked lock_sys->mutex and when it finds out conflicting record lock it will lock trx->mutex(victim) where victim is same as thread_1. Then we go to wsrep_kill_victim() that will call ha_innodb.cc:wsrep_kill_one_trx(). Lets assume that CPU executing thread_2 will do context switch to execute other threads, especially on test case we want it to execute KILL to certain point:

--connection node_2
--let $k_thread = `SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER = 'root' AND COMMAND = 'Sleep' LIMIT 1`
--send_eval KILL $k_thread

Execution of this thread (lets call it thread_3) will start from sql_parse.cc:sql_kill() that will call kill_one_thread() and then find_thread_by_id(). In all current versions this function will find the victim thread (thread_3) and will lock victim->LOCK_thd_data(). At this point we let CPU again do context switch and continue BF thread (thread_2) execution on ha_innobase::wsrep_kill_one_trx(). There BF thread will try to lock victim->LOCK_thd_data(), but it cant as this victim thread is the same thread that we try to KILL in thread_3, so victim->LOCK_thd_data() is already held. For this BF thread we are holding following mutexes:

lock_sys->mutex(global) -> trx_mutex(victim) and waiting victim->LOCK_thd_data() (here -> means order of mutex lock operations). 

As BF thread can't continue we let CPU again do context switch and continue KILL execution (thread_3) from kill_one_thread(). This thread will call victim->awake() and that will call ha_innodb.cc:innobase_kill_query(). There, if thread is found we try to lock lock_sys->mutex. However, that is not possible for this KILL thread (thread_3) as BF thread (thread_2) already has locked it. Therefore, KILL thread (thread_3) has following mutexes:

victim->LOCK_thd_data  and waiting lock_sys->mutex(global). 

This means that thread_2 we have mutex ordering of:

 lock_sys->mutex(global) -> trx->mutex(victim) -> victim->LOCK_thd_data

and thread_3 has mutex ordering of

victim->LOCK_thd_data -> lock_sys->mutex(global). 

As both thread are waiting a mutex that is held by other thread both can't continue execution. Real problem is that these two execution paths take lock_sys->mutex and victim->LOCK_thd_data mutexes in different order, exactly opposite order. Thus we have thread_3(BF) waiting thread_2(KILL) waiting thread_3(BF). By, iteration this mean thread_3(BF) -> thread_3(BF), and that is impossible, we have mutex deadlock.

Idea of the fix is to set thd->wsrep_killed=true by first thread that is executing either the KILL or BF victim kill. We let only one thread to continue further, the other one just exits. We protect thd->wsrep_killed by thd->LOCK_thd_data so if KILL has executed find_thread_by_id() where it locks thd->LOCK_thd_data we first check is thd->wsrep_killed == true. If it is someone has already started kill operation and we release thd->LOCK_thd_data and exit from KILL, if not we set thd->wsrep_killed=true and continue. Similarly, for BF kill.
after we have locked victim->LOCK_thd_data we check

Comment by Marko Mäkelä [ 2020-07-13 ]

How exactly is the THD::wsrep_killed protected? I suppose that it would be zero-initialized at THD object instantiation. After the field has been set, when exactly will it be reset? Keep in mind that KILL QUERY would retain the connection, and we would probably want to reset the flag before starting to execute the next statement.

Could we allocate the flag in trx_t instead of THD? It is specific to InnoDB, after all.

Comment by Jan Lindström (Inactive) [ 2020-07-14 ]

THD::wsrep_killed is protected by THD::LOCK_thd_data mutex in my bb-10-4-MDEV-21910 branch where I have taken account review comments. Correct, it is zero-initialized at THD constructor and it is reset on THD::reset_killed(). Moving it to trx_t does not work as we could do kill at MDL-conflict before any transaction is started.

Comment by Marko Mäkelä [ 2020-07-16 ]

I see. I am also glad to see that the InnoDB changes to are limited to the function wsrep_innobase_kill_one_trx(). I only have a minor nitpick about mysql_mutex_assert_not_owner(), which I believe to be redundant:

@@ -2121,11 +2123,19 @@ void THD::reset_killed()
   DBUG_ENTER("reset_killed");
   if (killed != NOT_KILLED)
   {
+    mysql_mutex_assert_not_owner(&LOCK_thd_kill);
     mysql_mutex_lock(&LOCK_thd_kill);
     killed= NOT_KILLED;
     killed_err= 0;
     mysql_mutex_unlock(&LOCK_thd_kill);
   }
+#ifdef WITH_WSREP
+  mysql_mutex_assert_not_owner(&LOCK_thd_data);
+  mysql_mutex_lock(&LOCK_thd_data);
+  wsrep_aborter= 0;
+  mysql_mutex_unlock(&LOCK_thd_data);
+#endif /* WITH_WSREP */
+
   DBUG_VOID_RETURN;
 }
 

The function safe_mutex_lock() would already guard against recursive acquisition:

    /*
      Check that we are not trying to lock mutex twice. This is an error
      even if we are using 'try_lock' as it's not portably what happens
      if you lock the mutex many times and this is in any case bad
      behaviour that should not be encouraged
    */
    if (pthread_equal(pthread_self(),mp->thread))

Comment by Marko Mäkelä [ 2020-07-20 ]

A fix of this was pushed to the 10.4 branch, and there are conflicts when merging to the 10.5 branch. Because I needed to merge something else, I performed a null merge of this to 10.5, and this change will have to be applied to 10.5 (as well as 10.1) separately.

Comment by Jan Lindström (Inactive) [ 2020-07-22 ]

10.5 version pushed on 7bffe468b239645d2f27d1d5625cb9c914ae994d

Comment by Jan Lindström (Inactive) [ 2020-07-28 ]

Attached patch wsrep_kill.diff to create a test case that repeats the problem on current implementation in 10.4.

Comment by Jan Lindström (Inactive) [ 2020-07-29 ]

Problem with current implementation can be repeated with following very high level test case that requires at least 2-node Galera cluster:

  1. Start a victim transaction on node_2, and leave it pending while holding a row locked
  2. From another connection to node_2 set sync point to pause applier in GAP1
  3. Send a conflicting write on node_1, applier will run it and it will pause at the node_2 sync point in GAP1
  4. Though another connection to node_2, kill the local transaction it will pause on sync point GAP2
  5. Release first kill transaction and then applier transaction

Problem is that both kill transaction and applier transaction try to kill the same victim transaction while they hold necessary mutexes. However, mutexes are taken in different order in kill transaction compared to applier transaction leading to mutex deadlock.

To further analyze the bug lets use 10.4 and start from kill transaction. Here user has issued KILL <thread> where thread id is same as victim transaction's thread id. Interesting point starts from sql_parse.cc at function kill_one_thread().

  // Find the requested thread 
  if (id && (tmp= find_thread_by_id_with_thd_data_lock(id, type == KILL_TYPE_QUERY)))
  {
   // This point we will be holding tmp->LOCK_thd_kill and tmp->LOCK_thd_data
   // Check permissions and make sure that user does not try to kill BF (brute force) transactions
    // ...
    // We have not set victim_thd->wsrep_aborter so we continue kill  
     // <<<GAP 1>>>
        tmp->awake_no_mutex(kill_signal);

If we did find the thread we are trying to kill it will be protected from concurrent kill by LOCK_thd_kill mutex and THD's data members are protected from concurrent change by LOCK_thd_data mutex. Thus, user can't issue another concurrent KILL for this thread anymore, that KILL statement would have to wait and will not anymore find the thread.

However, there is other method that can kill the same thread as the thread user is trying to kill. In problematic scheduling we assume that user kill is executed to GAP1 point and then it will pause.

Let's now consider other method. For this we need conflicting transaction that is paused at node_2.

--connection node_2
CREATE TABLE t1(a int not null primary key auto_increment,b int) engine=InnoDB;
insert into t1 values (1,0);
begin;
update t1 set b=2 where a=1;

This paused update is the victim transaction that user will try to kill and another method will try to kill. Let's now move to node_2 in Galera cluster.

--connection node_1
select * from t1;
update t1 set b= 1 where a=1;

We can assume that in node_1 there is no conflicting transactions that access t1. Thus, this transaction will pass it certification and write set is send to node_2 to be executed by the applier. In node_2 update to row (1,0) has a conflicting lock request i.e exclusive lock. This is found in lock0lock.cc function lock_rec_other_has_conflicting() when we try to create lock request for applier transaction.

lock_rec_other_has_conflicting()
{
	ut_ad(lock_mutex_own());
 
        // If we find conflicting lock request
        trx_mutex_enter(lock->trx);			
	wsrep_kill_victim(const_cast<trx_t*>(trx), lock);

At the point we have found the conflicting lock request we are holding lock_sys->mutex that is global mutex and trx_mutex for victim transaction. Lock_sys mutex protect so that no new lock request may be created and lock request we have found can't be released. Trx_mutex does not allow victim to commit or abort. Applier transaction is BF (brute force) transaction and we would not like to wait long, so we try to kill victim transaction to get this record lock released. KIlling is done in ha_innodb.cc in function wsrep_innobase_kill_one_trx.

wsrep_innobase_kill_one_trx()
{
	ut_ad(lock_mutex_own());
	ut_ad(trx_mutex_own(victim_trx));
       // <<< GAP2 >>>
       // Below we lock thd->LOCK_thd_data mutex
	wsrep_thd_LOCK(thd);
        // <<< GAP3 >>>
  

At the GAP2 point we are still holding global lock_sys->mutex and trx_mutex for the victim transaction but but these do not restrict the other method of killing victim threads i.e. user KILL statement to be executed.

Bad CPU scheduler in node_2 can first execute applier transaction (BF) up to GAP2 and pause. Then it could start executing user kill transaction up to GAP1 and pause. Finally, we can try to execute applier transaction up to GAP3 but we cant as kill transaction is holding victim_thd->LOCK_thd_data and applier is also requesting it so applier must wait.
Finally, kill transaction may continue to thd->awake_no_mutex() execution and that will call ha_innodb.cc function innobase_kill_query()

static void innobase_kill_query(handlerton*, THD *thd, enum thd_kill_levels)
{
    // ...
    lock_mutex_enter();

Execution of user KILL statement will pause when kill transaction requests lock_sys->mutex as that mutex is already locked by applier transaction. Now both user kill statement transaction and applier transaction execution has been paused on mutex wait and they wait each other.

  • User KILL statement transaction is holding victim_thd->LOCK_thd_data and waits for lock_sys->mutex
  • Applier transaction is holding lock_sys->mutex and waiting for victim_thd->LOCK_thd_data mutex

This is a mutex deadlock.

Comment by Jan Lindström (Inactive) [ 2020-07-29 ]

Design how to avoid mutex deadlock. We will still use thd->wsrep_aborter but with atomic operations so that

  • wsrep_aborter == 0 : kill lock for this thd is free i.e. there is no concurrent kill operation under execution
  • wsrep_aborter == bf_thd->thread_id : kill lock is reserved to BF thread and there is concurrent kill operation on execution

We need to change both kill_one_thread() function as well as wsrep_innobase_kill_one_trx() function.

// Find the requested thread 
  if (id && (tmp= find_thread_by_id_with_thd_data_lock(id, type == KILL_TYPE_QUERY)))
  {
   // This point we will be holding tmp->LOCK_thd_kill and tmp->LOCK_thd_data
   // Check permissions and make sure that user does not try to kill BF (brute force) transactions
    // ...
      auto free=0;
      if (!compare_and_exchange_strong(victim_thd->wsrep_aborter, free, bf_thd->thread_id) && 
           free != bf_thd->thread_id)) {
        unlock(tmp->LOCK_thd_kill);
        unlock(tmp->LOCK_thd_data);
         return e.g. THREAD_NOT_FOUND;
     }
     // GAP1: We should be able to get here only if wsrep_aborter was free or
     // this thread has ralready reserved it
        tmp->awake_no_mutex(kill_signal);

Condition for wsrep_aborter must be done using atomic operation with strong consistency so that if it is 0 we set our current bf_thd->thread_id as a marker that execution of kill operation has started. If it is not 0 i.e. free executing thread i.e. bf_thd->thread_id must be same as wsrep_aborter value.

wsrep_innobase_kill_one_trx()
{
	ut_ad(lock_mutex_own());
	ut_ad(trx_mutex_own(victim_trx));
        // <<< GAP2a >>>
        auto free=0;
        if (!compare_and_exchange_strong(victim_thd-> wsrep_aborter, free, bf_thd->thread_id) && 
            free != bf_thd->thread_id)) { {
          return;
        }
       // GAP2: we should be here only if wsrep_aborter == 0 or 
       // this thread has already reserved it
       // Below we lock thd->LOCK_thd_data mutex
	wsrep_thd_LOCK(thd);
        // <<< GAP3 >>>

Here again condition for wsrep_aborter must be done using atomic operation with strong consistency so that if it is 0 we set our current bf_thd->thread_id as a marker that execution of kill operation has started. If it is not 0 i.e. free executing thread i.e. bf_thd->id must be same as wsrep_aborter value.

How this high-level description of the algorithm avoids the mutex deadlock presented earlier? Remember that reason for mutex deadlock was that both kill transaction and applier transaction was trying to lock victim_thd->LOCK_thd_data and lock_sys->mutex. For user KILL statement transaction execution we can't avoid locking victim_thd->LOCK_thd_data BUT we can avoid request to lock_sys->mutex at innobase_kill_query(). This is done by atomic operation to wsrep_aborter before we even enter victim_thd->awake_no_mutex():

  • if wsrep_aborter == 0 no other thread has started execution of kill and we set current bf_thd->thread_id to wsrep_aborter to mark that we have started execution of kill. After this it is safe to continue.
  • if wsrep_aborter != 0 and wsrep_aborter != bf_thd->thread_id some other applier has already started executing kill and we must release all mutexes we have locked and return
  • if wsrep_aborter != 0 and wsrep_aborter == bf_thd->thread_id we have started executing kill and it is safe to continue

Similarly on applier transaction executing wsrep_innobase_kill_one_trx function before we even request victim_thd->LOCK_thd_data mutex. Idea is to control what thread can continue kill operation execution so that only one thread will be able to do it and so that first thread who can set wsrep_aborter variable will do it.

  1. If user KILL statement is scheduled first on wsrep_aborter atomic compare_exchange_strong it will naturally still wait lock_sys->mutex on innobase_kill_query(). But as applier transaction with same atomic_compare_exchange_strong will see that wsrep_aborter != 0 and that wsrep_aborter != bf_thd->thread_id, applier will return to lock_rec_other_has_conflicting() and lock request will be inserted at the head of the lock queue and mutexes will be released and wait for victim transaction kill where it's locks are released.
  2. If applier statement is scheduled first on wsrep_aborter atomic compare_exchange_strong it will naturally still wait victim_thd->LOCK_thd_data mutex on wsrep_innobase_kill_one_trx function. But as user KILL statement execution with same atomic_compare_exchange_strong will see that wsrep_aborter != 0 and that wsrep_aborter != bf_thd->thread_id, applier will release victim_thd->LOCK_thd_data and return applier can continue.
Comment by Marko Mäkelä [ 2020-07-29 ]

An incomplete fix for this has been pushed to 10.4.14 and 10.5.5, and it does not look like we would be able to revise that fix before the next releases.
I filed MDEV-23328 to actually fix the server hang.

Comment by Rick Pizzi [ 2022-11-22 ]

jplindst I may have a case for this issue on 10.6

Generated at Thu Feb 08 09:10:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.