[MDEV-23328] Server hang due to Galera lock conflict resolution Created: 2020-07-29  Updated: 2022-12-16  Resolved: 2021-01-24

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9, 10.6.0

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Sergei Golubchik
Resolution: Fixed Votes: 3
Labels: hang, not-10.6

Attachments: File atomics.cc     File kill_test.diff     Text File mdev-23328-spin.txt     File mdev-23328.pl     Text File mdev-pre-21010-spin.txt     File mdev-pre-21910.pl    
Issue Links:
Blocks
blocks MDEV-18874 Galera test MW-286 causes Mutex = TTA... Closed
is blocked by MDEV-23536 Race condition between KILL and trans... Closed
Problem/Incident
causes MDEV-24704 Galera test failure on galera.galera_... Closed
causes MDEV-24829 10.5.8 fails to startup on approx 10%... Closed
causes MDEV-24915 Galera conflict resolution is unneces... Closed
causes MDEV-25114 Crash: WSREP: invalid state ROLLED_BA... Closed
is caused by MDEV-21910 KIlling thread on Galera could cause ... Closed
Relates
relates to MDEV-18534 galera_sr.GCF-1018B: Test failure: my... Closed
relates to MDEV-23536 Race condition between KILL and trans... Closed
relates to MDEV-24966 Galera multi-master regression Closed
relates to MDEV-25518 mariadb crash with mysqld got signal 11 Closed
relates to MDEV-17092 ERROR: AddressSanitizer: use-after-po... Closed
relates to MDEV-18464 Port kill_one_trx fixes from 10.4 to ... Closed
relates to MDEV-23851 Galera assertion at lock0lock.cc line... Closed
relates to MDEV-24923 Port selected Galera conflict resolut... Closed
relates to MDEV-25368 Galera cluster hangs on Freeing items Closed

 Description   

When a SQL KILL statement is requesting a transaction to be aborted at the same time when the same transaction is chosen as a victim in the Galera transaction certification process, the server can hang.

There have been attempts to fix this problem earlier. A suggested fix for MDEV-18464 had been pushed and soon thereafter reverted because of issues. Another fix (which adds another field to THD, expanding the potential state space) was pushed to 10.4 and 10.5 in MDEV-21910, but it fails to prevent such hangs.

It seems possible that something related to this caused MDEV-17092, which I had worked around by changing the code in InnoDB.



 Comments   
Comment by Jan Lindström (Inactive) [ 2020-07-29 ]

Problem with current implementation can be repeated with following very high level test case that requires at least 2-node Galera cluster:

  1. Start a victim transaction on node_2, and leave it pending while holding a row locked
  2. From another connection to node_2 set sync point to pause applier in GAP1
  3. Send a conflicting write on node_1, applier will run it and it will pause at the node_2 sync point in GAP1
  4. Though another connection to node_2, kill the local transaction it will pause on sync point GAP2
  5. Release first kill transaction and then applier transaction

Problem is that both kill transaction and applier transaction try to kill the same victim transaction while they hold necessary mutexes. However, mutexes are taken in different order in kill transaction compared to applier transaction leading to mutex deadlock.

To further analyze the bug lets use 10.4 and start from kill transaction. Here user has issued KILL <thread> where thread id is same as victim transaction's thread id. Interesting point starts from sql_parse.cc at function kill_one_thread().

  // Find the requested thread 
  if (id && (tmp= find_thread_by_id_with_thd_data_lock(id, type == KILL_TYPE_QUERY)))
  {
   // This point we will be holding tmp->LOCK_thd_kill and tmp->LOCK_thd_data
   // Check permissions and make sure that user does not try to kill BF (brute force) transactions
    // ...
    // We have not set victim_thd->wsrep_aborter so we continue kill  
     // <<<GAP 1>>>
        tmp->awake_no_mutex(kill_signal);

If we did find the thread we are trying to kill it will be protected from concurrent kill by LOCK_thd_kill mutex and THD's data members are protected from concurrent change by LOCK_thd_data mutex. Thus, user can't issue another concurrent KILL for this thread anymore, that KILL statement would have to wait and will not anymore find the thread.

However, there is other method that can kill the same thread as the thread user is trying to kill. In problematic scheduling we assume that user kill is executed to GAP1 point and then it will pause.

Let's now consider other method. For this we need conflicting transaction that is paused at node_2.

--connection node_2
CREATE TABLE t1(a int not null primary key auto_increment,b int) engine=InnoDB;
insert into t1 values (1,0);
begin;
update t1 set b=2 where a=1;

This paused update is the victim transaction that user will try to kill and another method will try to kill. Let's now move to node_2 in Galera cluster.

--connection node_1
select * from t1;
update t1 set b= 1 where a=1;

We can assume that in node_1 there is no conflicting transactions that access t1. Thus, this transaction will pass it certification and write set is send to node_2 to be executed by the applier. In node_2 update to row (1,0) has a conflicting lock request i.e exclusive lock. This is found in lock0lock.cc function lock_rec_other_has_conflicting() when we try to create lock request for applier transaction.

lock_rec_other_has_conflicting()
{
	ut_ad(lock_mutex_own());
 
        // If we find conflicting lock request
        trx_mutex_enter(lock->trx);			
	wsrep_kill_victim(const_cast<trx_t*>(trx), lock);

At the point we have found the conflicting lock request we are holding lock_sys->mutex that is global mutex and trx_mutex for victim transaction. Lock_sys mutex protect so that no new lock request may be created and lock request we have found can't be released. Trx_mutex does not allow victim to commit or abort. Applier transaction is BF (brute force) transaction and we would not like to wait long, so we try to kill victim transaction to get this record lock released. KIlling is done in ha_innodb.cc in function wsrep_innobase_kill_one_trx.

wsrep_innobase_kill_one_trx()
{
	ut_ad(lock_mutex_own());
	ut_ad(trx_mutex_own(victim_trx));
       // <<< GAP2 >>>
       // Below we lock thd->LOCK_thd_data mutex
	wsrep_thd_LOCK(thd);
        // <<< GAP3 >>>
  

At the GAP2 point we are still holding global lock_sys->mutex and trx_mutex for the victim transaction but but these do not restrict the other method of killing victim threads i.e. user KILL statement to be executed.

Bad CPU scheduler in node_2 can first execute applier transaction (BF) up to GAP2 and pause. Then it could start executing user kill transaction up to GAP1 and pause. Finally, we can try to execute applier transaction up to GAP3 but we cant as kill transaction is holding victim_thd->LOCK_thd_data and applier is also requesting it so applier must wait.
Finally, kill transaction may continue to thd->awake_no_mutex() execution and that will call ha_innodb.cc function innobase_kill_query()

static void innobase_kill_query(handlerton*, THD *thd, enum thd_kill_levels)
{
    // ...
    lock_mutex_enter();

Execution of user KILL statement will pause when kill transaction requests lock_sys->mutex as that mutex is already locked by applier transaction. Now both user kill statement transaction and applier transaction execution has been paused on mutex wait and they wait each other.

  • User KILL statement transaction is holding victim_thd->LOCK_thd_data and waits for lock_sys->mutex
  • Applier transaction is holding lock_sys->mutex and waiting for victim_thd->LOCK_thd_data mutex

This is a mutex deadlock.

Comment by Jan Lindström (Inactive) [ 2020-07-29 ]

Design how to avoid mutex deadlock. We will still use thd->wsrep_aborter but with atomic operations so that

  • wsrep_aborter == 0 : kill lock for this thd is free i.e. there is no concurrent kill operation under execution
  • wsrep_aborter == bf_thd->thread_id : kill lock is reserved to BF thread and there is concurrent kill operation on execution

We need to change both kill_one_thread() function as well as wsrep_innobase_kill_one_trx() function.

// Find the requested thread 
  if (id && (tmp= find_thread_by_id_with_thd_data_lock(id, type == KILL_TYPE_QUERY)))
  {
   // This point we will be holding tmp->LOCK_thd_kill and tmp->LOCK_thd_data
   // Check permissions and make sure that user does not try to kill BF (brute force) transactions
    // ...
      auto free=0;
      if (!compare_and_exchange_strong(victim_thd->wsrep_aborter, free, bf_thd->thread_id) && 
           free != bf_thd->thread_id)) {
        unlock(tmp->LOCK_thd_kill);
        unlock(tmp->LOCK_thd_data);
         return e.g. THREAD_NOT_FOUND;
     }
     // GAP1: We should be able to get here only if wsrep_aborter was free or
     // this thread has ralready reserved it
        tmp->awake_no_mutex(kill_signal);

Condition for wsrep_aborter must be done using atomic operation with strong consistency so that if it is 0 we set our current bf_thd->thread_id as a marker that execution of kill operation has started. If it is not 0 i.e. free executing thread i.e. bf_thd->thread_id must be same as wsrep_aborter value.

wsrep_innobase_kill_one_trx()
{
	ut_ad(lock_mutex_own());
	ut_ad(trx_mutex_own(victim_trx));
        // <<< GAP2a >>>
        auto free=0;
        if (!compare_and_exchange_strong(victim_thd-> wsrep_aborter, free, bf_thd->thread_id) && 
            free != bf_thd->thread_id)) { {
          return;
        }
       // GAP2: we should be here only if wsrep_aborter == 0 or 
       // this thread has already reserved it
       // Below we lock thd->LOCK_thd_data mutex
	wsrep_thd_LOCK(thd);
        // <<< GAP3 >>>

Here again condition for wsrep_aborter must be done using atomic operation with strong consistency so that if it is 0 we set our current bf_thd->thread_id as a marker that execution of kill operation has started. If it is not 0 i.e. free executing thread i.e. bf_thd->id must be same as wsrep_aborter value.

How this high-level description of the algorithm avoids the mutex deadlock presented earlier? Remember that reason for mutex deadlock was that both kill transaction and applier transaction was trying to lock victim_thd->LOCK_thd_data and lock_sys->mutex. For user KILL statement transaction execution we can't avoid locking victim_thd->LOCK_thd_data BUT we can avoid request to lock_sys->mutex at innobase_kill_query(). This is done by atomic operation to wsrep_aborter before we even enter victim_thd->awake_no_mutex():

  • if wsrep_aborter == 0 no other thread has started execution of kill and we set current bf_thd->thread_id to wsrep_aborter to mark that we have started execution of kill. After this it is safe to continue.
  • if wsrep_aborter != 0 and wsrep_aborter != bf_thd->thread_id some other applier has already started executing kill and we must release all mutexes we have locked and return
  • if wsrep_aborter != 0 and wsrep_aborter == bf_thd->thread_id we have started executing kill and it is safe to continue

Similarly on applier transaction executing wsrep_innobase_kill_one_trx function before we even request victim_thd->LOCK_thd_data mutex. Idea is to control what thread can continue kill operation execution so that only one thread will be able to do it and so that first thread who can set wsrep_aborter variable will do it.

  1. If user KILL statement is scheduled first on wsrep_aborter atomic compare_exchange_strong it will naturally still wait lock_sys->mutex on innobase_kill_query(). But as applier transaction with same atomic_compare_exchange_strong will see that wsrep_aborter != 0 and that wsrep_aborter != bf_thd->thread_id, applier will return to lock_rec_other_has_conflicting() and lock request will be inserted at the head of the lock queue and mutexes will be released and wait for victim transaction kill where it's locks are released.
  2. If applier statement is scheduled first on wsrep_aborter atomic compare_exchange_strong it will naturally still wait victim_thd->LOCK_thd_data mutex on wsrep_innobase_kill_one_trx function. But as user KILL statement execution with same atomic_compare_exchange_strong will see that wsrep_aborter != 0 and that wsrep_aborter != bf_thd->thread_id, applier will release victim_thd->LOCK_thd_data and return applier can continue.
Comment by Jan Lindström (Inactive) [ 2020-07-29 ]

serg I need some guidance. Firstly, I hope my description of the problem is understandable and detailed enough. Open questions are:

  • Does the fix plan sound feasible and correct?
  • Should this bug be assigned to Codership to be fixed or who should it be assigned to fix it?
  • Who can review the fix candidate? I must say my knowledge on atomic operations is not deep enough.
  • Current idea requires C++11 not sure if we have anything if that is not available (10.2 ?)
  • http://www.cplusplus.com/reference/atomic/atomic/compare_exchange_strong/
Comment by Marko Mäkelä [ 2020-07-30 ]

jplindst, how are you addressing the comment that svoj made in MDEV-17092?

you can conclude that thd->free_connection() can be called concurrently with KILL/thd->awake().

He is suggesting to revert the changes that were made in MDEV-17092 and to fix the root cause in the server layer. I think that it should be addressed as part of this ticket.

I have a theoretical background on model checking and verification of parallel and distributed systems. The common line of thought is that the more global state is being accessed, the bigger the potential state space is. I believe that it was a mistake to explode the reachable state space further by introducing THD::wsrep_aborter in MDEV-21910.

Concurrency is hard, and there should be no excuse against using freely available tools such as Spin. It has been successfully applied on telecommunication protocols and in space technology. It was even applied on the InnoDB rw_lock_t implementation at one time in the past.

I do not think that anyone can meaningfully assess correctness without carefully checking that the implementation conforms to an unambiguous specification, and without checking that the specification itself is correct. When it comes to checking the specification, I think that we must employ exhaustive state space enumeration. Model checking in terms of temporal logic formulae is not needed; we are only interested in the absence of deadlocks or violations of safety invariants.

Comment by Jan Lindström (Inactive) [ 2020-07-30 ]

I can't address problem that thd->free_connection() can be called concurrently with KILL/thd->awake(). In my opinion that is technical debt that requires it own ticket, that part do not belong on this ticket. I do agree that this technical debt should be fixed but I'm not sure if I would be correct assignee to fix it and this fix might be so big that we should target it only for 10.6. This technical debt belong to MariaDB to fix. I let here serg to decide this part and ralf.gebhardt@mariadb.com FYI.

I agree that introducing THD::wsrep_aborter will explode the reachable state space further and the fact that you could model simplified version of this problem using formal model tools. I'm not familiar with Spin or any other formal method tool and I do not have really theoretical background or experience on model checking and verification of parallel and distributed systems.

On this ticket I do understand what is current problem, its cause and test case that does repeat the problem every time. With this complex code there is possibility of other bugs. I agree that it could be not possible meaningfully assess correctness of my suggested fix with just looking the changes but at the moment we do not have anything better.
Shown problem is caused by Galera, thus serg should first decide do we just let Codership to fix this problem (fix for MDEV-21910 does fix one possible scheduling of KILL/bf kill but not all of them) as there is customers support requests or do we fix it.

Comment by Jan Lindström (Inactive) [ 2020-07-30 ]

Attached patch to create test case kill_test.diff to repeat the current known problem and a simple experimental test program that uses compare_exchange_strong method (atomics.cc).

Comment by Jan Lindström (Inactive) [ 2020-08-19 ]

Attached spin models and verification results. In mdev-pre-21910.pl we have very simplified model for original problem and as expected spin finds the mutex deadlock see mdev-pre-21010-spin.txt. in mdev-23328.pl we have very simplified model for proposed fix and verification result is on mdev-23328-spin.txt. There is no deadlocks in this verification. I know that this is very simplified model but it does contain same mutexes as actual problem. Missing is communication between thread starting a victim kill and victim thread but in my understanding that is not the problem on this bug. For InnoDB point of view Galera lock conflict resolution is similar to lock deadlock situation. Both have at least two treads and one of them is selected as victim and this victim is then aborted.

Comment by Marko Mäkelä [ 2020-08-22 ]

I filed MDEV-23536 for fixing MDEV-17092 in a better way, outside the storage engine layer. The surrounding code does include some Galera specific changes.

Comment by Jan Lindström (Inactive) [ 2020-09-03 ]

Current status:

  • 10.2 : branch: bb-10.2-MDEV-21910 commit: 1b6bd701b30f8297db38e605692aa91aff92a06d one test failure
  • 10.3 : branch: bb-10.3-MDEV-21910 commit: a8d75cd0885707be1791f9dd61723cc5ac0013a6
Comment by Marko Mäkelä [ 2020-10-08 ]

Having as little shared data as possible should make things easier to understand. Likewise, a one-shot flag (which is reset at object creation and cannot be reset once it has been set) is much easier to reason about than a flag that can be reset and set multiple times.

Is there an alternative solution that would allow us to remove the field THD::wsrep_aborter that was added in MDEV-21910? If not, can we simplify the data flow so that THD::reset_killed() would not be involved, or would never reset that flag once it has been set? If that is possible, then I think that we could employ Release-Consume ordering between loads and stores (or compare-and-exchange in wsrep_thd_set_wsrep_aborter()).

Comment by Marko Mäkelä [ 2020-10-08 ]

Sorry, I missed the note in the page that I linked to:

The specification of release-consume ordering is being revised, and the use of memory_order_consume is temporarily discouraged.

It seems to refer to P0371R1, which states:

All current compilers essentially map it to memory_order_acquire.

Comment by Seppo Jaakola [ 2020-10-20 ]

The scenario Jan posted is a valid one. With proper scheduling manual KILL command and BF aborting may still end up in mutex deadlock.

This anomaly is possible to avoid by changing the manual KILL processing to do victim marking up front, before the actual THD::awake call. This victim marking must happen without holding THD::LOCK_thd_data and with holding the victim's innodb trx mutex. In this model both manual KILL and BF aborting will happen through same mutex locking protocols.
I will implement a patch for this model and append in the PR for further testing.

Comment by Jan Lindström (Inactive) [ 2020-11-09 ]
Comment by Jan Lindström (Inactive) [ 2020-11-10 ]

This review request is for changes on server code.

Comment by Jan Lindström (Inactive) [ 2020-11-12 ]

marko Can you review InnoDB changes? 10.2 and 10.3 are similar and then 10.4 and 10.5 are similar.

Comment by MikaH [ 2020-12-02 ]

Is this proceeding?

Comment by Marko Mäkelä [ 2020-12-08 ]

I would like to see the SAFE_MUTEX warnings that would demonstrate the problem and the effectiveness of the solution. In the bb-10.6-MDEV-21452 branch, all InnoDB mutexes have been replaced with generic ones. With the exception of trx_t::mutex, all of them are covered by SAFE_MUTEX.

It seems to me that both MDEV-23536 and MDEV-23468 are somewhat related to this.

Comment by Jan Lindström (Inactive) [ 2020-12-14 ]

Using bb-10.6-MDEV-21452 and attached test case applier thread (i.e. brute force thread) will hang here:

Thread 8 (Thread 0x7fb99467e640 (LWP 48069)):
#0  __lll_lock_wait (futex=futex@entry=0x7fb96c0042b0, private=0) at lowlevellock.c:52
#1  0x00007fb99944aa95 in __GI___pthread_mutex_lock (mutex=0x7fb96c0042b0) at ../nptl/pthread_mutex_lock.c:135
#2  0x000055ca9fb04d30 in safe_mutex_lock (mp=0x7fb96c004288, my_flags=0, file=0x55ca9ff00878 "/home/jan/mysql/10.6/sql/service_wsrep.cc", line=32) at /home/jan/mysql/10.6/mysys/thr_mutex.c:290
#3  0x000055ca9f62ed07 in inline_mysql_mutex_lock (that=0x7fb96c004288, src_file=0x55ca9ff00878 "/home/jan/mysql/10.6/sql/service_wsrep.cc", src_line=32) at /home/jan/mysql/10.6/include/mysql/psi/mysql_thread.h:750
#4  0x000055ca9f62f00e in wsrep_thd_LOCK (thd=0x7fb96c002528) at /home/jan/mysql/10.6/sql/service_wsrep.cc:32
#5  0x000055ca9f6701b3 in wsrep_innobase_kill_one_trx (bf_thd=0x7fb9780022b8, victim_trx=0x7fb97e3fe260, signal=true) at /home/jan/mysql/10.6/storage/innobase/handler/ha_innodb.cc:18017
#6  0x000055ca9f70101b in wsrep_kill_victim (trx=0x7fb97e3ff378, lock=0x7fb97e3fe3c0) at /home/jan/mysql/10.6/storage/innobase/lock/lock0lock.cc:1047
#7  0x000055ca9f7011ce in lock_rec_other_has_conflicting (mode=1027, block=0x7fb97d40d520, heap_no=2, trx=0x7fb97e3ff378) at /home/jan/mysql/10.6/storage/innobase/lock/lock0lock.cc:1088
#8  0x000055ca9f70336b in lock_rec_lock (impl=false, mode=1027, block=0x7fb97d40d520, heap_no=2, index=0x7fb9640602b8, thr=0x7fb96c03a438) at /home/jan/mysql/10.6/storage/innobase/lock/lock0lock.cc:1717
#9  0x000055ca9f70f45f in lock_clust_rec_read_check_and_lock (flags=0, block=0x7fb97d40d520, rec=0x7fb97d95407e "\200", index=0x7fb9640602b8, offsets=0x7fb99467b9a0, mode=LOCK_X, gap_mode=1024, thr=0x7fb96c03a438) at /home/jan/mysql/10.6/storage/innobase/lock/lock0lock.cc:5348
#10 0x000055ca9f843376 in sel_set_rec_lock (pcur=0x7fb96c039c30, rec=0x7fb97d95407e "\200", index=0x7fb9640602b8, offsets=0x7fb99467b9a0, mode=3, type=1024, thr=0x7fb96c03a438, mtr=0x7fb99467bd80) at /home/jan/mysql/10.6/storage/innobase/row/row0sel.cc:1238
#11 0x000055ca9f84dafe in row_search_mvcc (buf=0x7fb96c038da8 "\375\001", mode=PAGE_CUR_GE, prebuilt=0x7fb96c039a68, match_mode=1, direction=0) at /home/jan/mysql/10.6/storage/innobase/row/row0sel.cc:5049
#12 0x000055ca9f65c759 in ha_innobase::index_read (this=0x7fb96c039210, buf=0x7fb96c038da8 "\375\001", key_ptr=0x7fb96c0390b8 "\001", key_len=4, find_flag=HA_READ_KEY_EXACT) at /home/jan/mysql/10.6/storage/innobase/handler/ha_innodb.cc:8565
#13 0x000055ca9f65dabd in ha_innobase::rnd_pos (this=0x7fb96c039210, buf=0x7fb96c038da8 "\375\001", pos=0x7fb96c0390b8 "\001") at /home/jan/mysql/10.6/storage/innobase/handler/ha_innodb.cc:9054
#14 0x000055ca9f1e3da5 in handler::ha_rnd_pos (this=0x7fb96c039210, buf=0x7fb96c038da8 "\375\001", pos=0x7fb96c0390b8 "\001") at /home/jan/mysql/10.6/sql/handler.cc:3098
#15 0x000055ca9f1f6de2 in handler::rnd_pos_by_record (this=0x7fb96c039210, record=0x7fb96c038da8 "\375\001") at /home/jan/mysql/10.6/sql/handler.h:3929
#16 0x000055ca9f0a30ee in handler::ha_rnd_pos_by_record (this=0x7fb96c039210, buf=0x7fb96c038da8 "\375\001") at /home/jan/mysql/10.6/sql/sql_class.h:7055
#17 0x000055ca9f3634a4 in Rows_log_event::find_row (this=0x7fb978029858, rgi=0x7fb978023230) at /home/jan/mysql/10.6/sql/log_event_server.cc:7820
#18 0x000055ca9f36498a in Update_rows_log_event::do_exec_row (this=0x7fb978029858, rgi=0x7fb978023230) at /home/jan/mysql/10.6/sql/log_event_server.cc:8271
#19 0x000055ca9f35c571 in Rows_log_event::do_apply_event (this=0x7fb978029858, rgi=0x7fb978023230) at /home/jan/mysql/10.6/sql/log_event_server.cc:5668
#20 0x000055ca9ed6a599 in Log_event::apply_event (this=0x7fb978029858, rgi=0x7fb978023230) at /home/jan/mysql/10.6/sql/log_event.h:1498
#21 0x000055ca9f622a6b in wsrep_apply_events (thd=0x7fb9780022b8, rli=0x7fb9780176b0, events_buf=0x7fb988000a00, buf_len=0) at /home/jan/mysql/10.6/sql/wsrep_applier.cc:208
#22 0x000055ca9f6013e2 in apply_events (thd=0x7fb9780022b8, rli=0x7fb9780176b0, data=..., err=...) at /home/jan/mysql/10.6/sql/wsrep_high_priority_service.cc:127
#23 0x000055ca9f60326b in Wsrep_applier_service::apply_write_set (this=0x7fb99467dc60, ws_meta=..., data=..., err=...) at /home/jan/mysql/10.6/sql/wsrep_high_priority_service.cc:541
#24 0x000055ca9fba315d in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.6/wsrep-lib/src/server_state.cpp:327
#25 0x000055ca9fba70f2 in wsrep::server_state::on_apply (this=0x55caa14432a0, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.6/wsrep-lib/src/server_state.cpp:1136
#26 0x000055ca9fbc1393 in wsrep::high_priority_service::apply (this=0x7fb99467dc60, ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.6/wsrep-lib/include/wsrep/high_priority_service.hpp:47
#27 0x000055ca9fbbe40f in (anonymous namespace)::apply_cb (ctx=0x7fb99467dc60, wsh=0x7fb99467ce80, flags=65, buf=0x7fb99467ce90, meta=0x7fb99467d150, exit_loop=0x7fb99467d10f) at /home/jan/mysql/10.6/wsrep-lib/src/wsrep_provider_v26.cpp:502
#28 0x00007fb9949c92b5 in galera::TrxHandleSlave::apply (this=this@entry=0x7fb9780376c0, recv_ctx=recv_ctx@entry=0x7fb99467dc60, apply_cb=0x55ca9fbbe1dc <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7fb99467d10f: false) at galera/src/trx_handle.cpp:391
#29 0x00007fb994a0d256 in galera::ReplicatorSMM::apply_trx (this=0x55caa1476750, recv_ctx=0x7fb99467dc60, ts=...) at galera/src/replicator_smm.cpp:504
#30 0x00007fb994a10e60 in galera::ReplicatorSMM::process_trx (this=0x55caa1476750, recv_ctx=0x7fb99467dc60, ts_ptr=...) at galera/src/replicator_smm.cpp:2127
#31 0x00007fb9949e7e30 in galera::GcsActionSource::process_writeset (this=0x55caa14a8780, recv_ctx=0x7fb99467dc60, act=..., exit_loop=@0x7fb99467d83f: false) at galera/src/gcs_action_source.cpp:62
#32 0x00007fb9949e8ac2 in galera::GcsActionSource::process (this=0x55caa14a8780, recv_ctx=0x7fb99467dc60, exit_loop=@0x7fb99467d83f: false) at galera/src/gcs_action_source.cpp:183
#33 0x00007fb994a11350 in galera::ReplicatorSMM::async_recv (this=0x55caa1476750, recv_ctx=0x7fb99467dc60) at galera/src/replicator_smm.cpp:390
#34 0x00007fb994a2d6e1 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:236
#35 0x000055ca9fbbf4be in wsrep::wsrep_provider_v26::run_applier (this=0x55caa13710e0, applier_ctx=0x7fb99467dc60) at /home/jan/mysql/10.6/wsrep-lib/src/wsrep_provider_v26.cpp:740
#36 0x000055ca9f6238c4 in wsrep_replication_process (thd=0x7fb9780022b8, arg=0x55caa14107d0) at /home/jan/mysql/10.6/sql/wsrep_thd.cc:58
#37 0x000055ca9f613ff9 in start_wsrep_THD (arg=0x55caa14107d0) at /home/jan/mysql/10.6/sql/wsrep_mysqld.cc:3033

User KILL command hangs here:

Thread 12 (Thread 0x7fb97d3af640 (LWP 48283)):
#0  __lll_lock_wait (futex=futex@entry=0x55caa104bba8 <lock_sys+104>, private=0) at lowlevellock.c:52
#1  0x00007fb99944a903 in __GI___pthread_mutex_lock (mutex=0x55caa104bba8 <lock_sys+104>) at ../nptl/pthread_mutex_lock.c:80
#2  0x000055ca9fb04d30 in safe_mutex_lock (mp=0x55caa104bb80 <lock_sys+64>, my_flags=0, file=0x55ca9ff0a040 "/home/jan/mysql/10.6/storage/innobase/include/lock0lock.h", line=747) at /home/jan/mysql/10.6/mysys/thr_mutex.c:290
#3  0x000055ca9f6476ae in inline_mysql_mutex_lock (that=0x55caa104bb80 <lock_sys+64>, src_file=0x55ca9ff0a040 "/home/jan/mysql/10.6/storage/innobase/include/lock0lock.h", src_line=747) at /home/jan/mysql/10.6/include/mysql/psi/mysql_thread.h:750
#4  0x000055ca9f67bd22 in lock_sys_t::mutex_lock (this=0x55caa104bb40 <lock_sys>) at /home/jan/mysql/10.6/storage/innobase/include/lock0lock.h:747
#5  0x000055ca9f652021 in innobase_kill_query (thd=0x7fb96c002528) at /home/jan/mysql/10.6/storage/innobase/handler/ha_innodb.cc:4465
#6  0x000055ca9f1dd8e4 in kill_handlerton (thd=0x7fb96c002528, plugin=0x7fb964002468, level=0x7fb97d3ada44) at /home/jan/mysql/10.6/sql/handler.cc:917
#7  0x000055ca9ee9087d in plugin_foreach_with_mask (thd=0x7fb96c002528, func=0x55ca9f1dd868 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x7fb97d3ada44) at /home/jan/mysql/10.6/sql/sql_plugin.cc:2505
#8  0x000055ca9f1dd942 in ha_kill_query (thd=0x7fb96c002528, level=THD_ABORT_ASAP) at /home/jan/mysql/10.6/sql/handler.cc:924
#9  0x000055ca9edf49b6 in THD::awake_no_mutex (this=0x7fb96c002528, state_to_set=KILL_CONNECTION_HARD) at /home/jan/mysql/10.6/sql/sql_class.cc:1932
#10 0x000055ca9ee83fe1 in kill_one_thread (thd=0x7fb96400b6f8, id=14, kill_signal=KILL_CONNECTION_HARD, type=KILL_TYPE_ID) at /home/jan/mysql/10.6/sql/sql_parse.cc:9110
#11 0x000055ca9ee8455a in sql_kill (thd=0x7fb96400b6f8, id=14, state=KILL_CONNECTION_HARD, type=KILL_TYPE_ID) at /home/jan/mysql/10.6/sql/sql_parse.cc:9234
#12 0x000055ca9ee792ad in mysql_execute_command (thd=0x7fb96400b6f8) at /home/jan/mysql/10.6/sql/sql_parse.cc:5403
#13 0x000055ca9ee812ca in mysql_parse (thd=0x7fb96400b6f8, rawbuf=0x7fb964015250 "KILL 14", length=7, parser_state=0x7fb97d3ae3f0) at /home/jan/mysql/10.6/sql/sql_parse.cc:7881
#14 0x000055ca9ee80a14 in wsrep_mysql_parse (thd=0x7fb96400b6f8, rawbuf=0x7fb964015250 "KILL 14", length=7, parser_state=0x7fb97d3ae3f0) at /home/jan/mysql/10.6/sql/sql_parse.cc:7695
#15 0x000055ca9ee6d7f8 in dispatch_command (command=COM_QUERY, thd=0x7fb96400b6f8, packet=0x7fb964002729 "KILL 14", packet_length=7) at /home/jan/mysql/10.6/sql/sql_parse.cc:1803
#16 0x000055ca9ee6c2f4 in do_command (thd=0x7fb96400b6f8) at /home/jan/mysql/10.6/sql/sql_parse.cc:1348
#17 0x000055ca9f019461 in do_handle_one_connection (connect=0x55caa1aab228, put_in_cache=true) at /home/jan/mysql/10.6/sql/sql_connect.cc:1410
#18 0x000055ca9f0191c4 in handle_one_connection (arg=0x55caa1aa53a8) at /home/jan/mysql/10.6/sql/sql_connect.cc:1312
#19 0x000055ca9f57430b in pfs_spawn_thread (arg=0x55caa149a638) at /home/jan/mysql/10.6/storage/perfschema/pfs.cc:2201
#20 0x00007fb999448590 in start_thread (arg=0x7fb97d3af640) at pthread_create.c:463
#21 0x00007fb99901b223 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Actual SAFE_MUTEX warning looks as follows (it would be good that it would show something more meaningful on name not just mutex):

safe_mutex: Found wrong usage of mutex 'LOCK_thd_kill' and 'mutex'
Mutex currently locked (in reverse order):
mutex                             /home/jan/mysql/10.6/storage/innobase/include/lock0lock.h  line 747
LOCK_thd_kill                     /home/jan/mysql/10.6/sql/sql_parse.cc  line 9030
LOCK_thd_data                     /home/jan/mysql/10.6/sql/sql_parse.cc  line 9029

Comment by Jan Lindström (Inactive) [ 2020-12-14 ]

Rebased versions 14.12.2020:

Comment by Jan Lindström (Inactive) [ 2020-12-14 ]

monty Can you please review server changes.

Comment by Marko Mäkelä [ 2020-12-14 ]

The function wsrep_thd_set_wsrep_aborter(), which seems to be at the core of the fix, is not documented by any source code comment and not even mentioned in the commit message. It appears to be comparing and copying data members of up to two THD objects. How the data members are protected are not documented, neither in source code comments nor in the form of mysql_mutex_assert_owner() assertions.

Introducing more complex data structures might fix the hang, but it could also shift or transform the problem and make it harder to diagnose.

Do we really have to extend the KILL interface with a Galera specific mode? It seems to me that the changes to innobase_kill_query() will thwart the MDEV-17092 work-around of MDEV-23536 when that mode is being used.

Comment by Jan Lindström (Inactive) [ 2020-12-14 ]

Added requested source code comments and added assertions. Improved commit message. Currently, I do not know any other working solution for Galera 4, for Galera 3 we could use similar background killing method as synchronous replication is using but that change is significantly bigger.

Comment by Jan Lindström (Inactive) [ 2020-12-14 ]

I updated and rebased commits on list posted today.

Comment by Marko Mäkelä [ 2020-12-14 ]

Most of my comments were addressed. Unfortunately, no effort to fix the race condition in innobase_kill_query() was made. I think that the proper course of action would be to fix MDEV-23536 first.

Comment by Jan Lindström (Inactive) [ 2021-01-10 ]

serg Can you review https://github.com/MariaDB/server/commit/59b71d9cfa05f6b1ad3ff5f90433d141a1947582 . This does address Marko's above comment now that MDEV-23536 is fixed. However, it still extends thd_kill_levels. If that is no go what are my options here. Galera has introduces few functions to Handler API (see fake_trx_id ), I could use new function instead of extending thd_kill_levels. We are talking here GA-releases so some effort is needed to keep this patch short.

Comment by Sergei Golubchik [ 2021-01-13 ]

suggestion: introduce THD::LOCK_wsrep_data that will protect all THD::wsrep* members. It'll allow to fix mutex order violation

Comment by Jan Lindström (Inactive) [ 2021-01-14 ]

serg I would not go that solution, we already had one but it was refactored away during 10.4 Galera 4 review requested by Svoj. Furthermore, using THD::LOCK_thd_data to protect THD::wsrep* variables prevent concurrent thread to manually kill (KILL QUERY|CONNECTION) while other thread accesses THD::wsrep* variables. I agree that current implementation does prevent mutex deadlock but does not solve mutex ordering violation. Thus, a better solution is required.

Looked 10.2 and how asynchronous replication does when it needs to kill victim. In lock0lock.cc function DeadlockChecker::search() we hold lock_sys->mutex and call thd_rpl_deadlock_check(). This is on sql_class.cc and there we take LOCK_slave_background that looks like a global mutex.

Background thread then handles these victims on handle_slave_background it will take that same LOCK_slave_background and victim_thread->LOCK_thd_data and calls victim_thd->awake(). To me it seems that we do not hold LOCK_slave_background during THD::awake().

I had similar approach for 10.2 and it did work fine, but I had problems with 10.4. Do you see some other better solution or should I try again this background kill ?

Comment by Jan Lindström (Inactive) [ 2021-01-17 ]

Something like https://github.com/MariaDB/server/commit/1b6bd701b30f8297db38e605692aa91aff92a06d

Comment by Sergei Golubchik [ 2021-01-19 ]

background kill alone doesn't help unless you move the complete wsrep_innobase_kill_one_trx and the complete wsrep_abort_transaction into the background.

Comment by Jan Lindström (Inactive) [ 2021-01-21 ]

ok to push from me, there is just minor formatting error on wsrep_innobase_kill_one_trx

Comment by Sergei Golubchik [ 2021-01-24 ]

approved and pushed commit, for the record, https://github.com/MariaDB/server/commit/29bbcac0ee

with two preparatory commits

Comment by Marko Mäkelä [ 2021-02-17 ]

In 10.6, this conflicted with MDEV-24671 and MDEV-20612. I had 21 Galera tests busy-looping (with a galera-4 library version 26.4.7-2 from Debian Sid) because bg_wsrep_kill_trx() invoked lock_cancel_waiting_and_release(lock) and therefore let the brute-force continue execution prematurely. Removing that call fixed the trouble and let the brute-force nicely wait in lock_wait() for the kill of the victim transaction. I wonder if we really need that separate thread in 10.6. We could just handle any buffered kills at the start of lock_wait().

On my merge, I also added some condition on trx->state == TRX_STATE_COMMITTED_IN_MEMORY in order to avoid unnecessary work. If a transaction has already been committed or rolled back, it will release its locks in lock_release() and let the waiting thread(s) continue execution.

I think that it could make sense to backport these changes to earlier versions as well. In 10.5, I noticed some "long semaphore wait" messages in one of the tests. I believe that this is due to an unnecessary race between the killer thread and the brute-force thread (which keeps requesting the same victim transaction to be killed over and over again, while creating more and more lock requests for something that could have been handled with implicit locking).

The CPU usage that I experienced was probably explained by this:

#2  __pthread_cond_timedwait (cond=0x7fb136cfd2d8, mutex=0x56204b2acac0, abstime=0x7fb114f949d0) at pthread_cond_wait.c:656
#3  0x00007fb144a2c74a in ?? () from /usr/lib/galera/libgalera_smm.so
#4  0x00007fb144a1323a in ?? () from /usr/lib/galera/libgalera_smm.so
#5  0x0000562049cf7d15 in wsrep::wsrep_provider_v26::causal_read (this=<optimized out>, timeout=<optimized out>) at /mariadb/10.6/wsrep-lib/src/wsrep_provider_v26.cpp:924
#6  0x0000562049cdfebd in wsrep::server_state::causal_read (this=<optimized out>, timeout=timeout@entry=-1) at /mariadb/10.6/wsrep-lib/src/server_state.cpp:850
#7  0x0000562049cd8918 in wsrep::client_state::sync_wait (this=this@entry=0x7fb118008418, timeout=timeout@entry=-1) at /mariadb/10.6/wsrep-lib/src/client_state.cpp:744
#8  0x0000562049800ec5 in wsrep_sync_wait (thd=thd@entry=0x7fb118001e88, mask=mask@entry=1) at /mariadb/10.6/sql/sql_class.h:5293
#9  0x00005620491b2ec8 in mysql_execute_command (thd=thd@entry=0x7fb118001e88) at /mariadb/10.6/sql/sql_parse.cc:3871
#10 0x00005620491b9eaf in mysql_parse (thd=thd@entry=0x7fb118001e88, rawbuf=rawbuf@entry=0x7fb118014050 "SELECT COUNT(*) = 1 FROM t1 WHERE f2 = 'node_1'", length=length@entry=47, parser_state=parser_state@entry=0x7fb114f954c0) at /mariadb/10.6/sql/sql_parse.cc:7972
#11 0x00005620491babba in wsrep_mysql_parse (thd=thd@entry=0x7fb118001e88, rawbuf=0x7fb118014050 "SELECT COUNT(*) = 1 FROM t1 WHERE f2 = 'node_1'", length=47, parser_state=parser_state@entry=0x7fb114f954c0) at /mariadb/10.6/sql/sql_parse.cc:7786
#12 0x00005620491bc59d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7fb118001e88, packet=0x7fb114f954c0 "\377\377\377\377", packet@entry=0x7fb11800bc49 "SELECT COUNT(*) = 1 FROM t1 WHERE f2 = 'node_1'", packet_length=packet_length@entry=47, blocking=blocking@entry=true) at /mariadb/10.6/sql/sql_parse.cc:1873

One might imagine that timeout=-1 means ‘infinite’. But, to this version of the galera-4 library, it apparently meant ‘1 second in the past’, causing an immediate return from the pthread_cond_timeout().

Comment by Marko Mäkelä [ 2021-02-18 ]

My observation about wsrep_sync_wait() turned out to be incorrect. There is a short timeout. The -1 refers to repl.causal_read_timeout from wsrep_provider_options. I checked with perf top that the wsrep_sync_wait() thread was generating very few samples, possibly corresponding to waking up once every few seconds.

The CPU was consumed by high-priority ("brute-force") applier thread until I removed the offending lock_cancel_waiting_and_release(lock) call, as noted in my previous comment.

I submitted MDEV-24915 for cleaning up the conflict resolution further in 10.6. We can simply let the high-priority applier thread kill the conflicting transactions, and there does not seem any need to have any Galera-specific code in the lock creation.

Comment by Iosif Peterfi [ 2021-03-22 ]

This fix in 10.4.18 made the cluster freeze every few days with one node lagging behind. When the lagging node is restarted the cluster activity is resumed.

https://jira.mariadb.org/browse/MDEV-24294

Comment by Walter Doekes [ 2021-03-22 ]

Interesting. I also appear to have an issue with the new bg_wsrep_kill_trx (in 10.3.28) as well, in MDEV-25190, but the symptoms are different.

Comment by Jan Lindström (Inactive) [ 2021-03-23 ]

Hi, lets get first the facts. Did you use KILL [QUERY|CONNECTION] from command line or from application? If not then problem can't be similar to this MDEV. If your cluster freezes can you please provide some stack dump on separate MDEV.

Comment by Florian Bezdeka [ 2021-03-26 ]

After updating from 10.4.17 to 10.4.18 I run into the "complete frozen cluster" issue twice now. After reading the changelog I assume the fix for this MDEV is introducing a regression. The "follow up" MDEV might be MDEV-24294 which has already been mentioned.

Comment by Florian Bezdeka [ 2021-08-28 ]

Why was this issue closed? I can't find a fix for 10.4.x...

Comment by Marko Mäkelä [ 2021-12-03 ]

In 10.6.0, this was fixed in a simpler way by MDEV-24915.

Comment by Stephan Vos [ 2022-09-04 ]

OK so to check this issue has been fixed correctly in the 10.5.9 or is it still a problem?
I'm planning to upgrade from 10.5.6 to 10.5.17 and want to make sure this fixed wont cause issues.

Comment by Khai Ping [ 2022-09-18 ]

i am still seeing this in 10.6.9. Reported in MDEV-29346

Generated at Thu Feb 08 09:21:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.