[MDEV-25190] Semaphore wait has lasted > 600 seconds; stuck on bg_wsrep_kill_trx Created: 2021-03-18  Updated: 2021-05-15  Resolved: 2021-04-07

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.3.28
Fix Version/s: 10.3.29

Type: Bug Priority: Major
Reporter: Walter Doekes Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Attachments: Text File dump1.txt     Text File dump2.txt    
Issue Links:
Duplicate
duplicates MDEV-25111 Long semaphore wait (> 800 secs), ser... Closed
PartOf
is part of MDEV-24872 galera.galera_insert_multi MTR failed... Closed
Relates
relates to MDEV-24704 Galera test failure on galera.galera_... Closed

 Description   

Hi,

we've had a bunch of deadlocks (+sigabrt) now that resulted in these logs:

2021-03-18  1:06:37 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140349926676224 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
Mutex at 0x5587b08404c0, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
...
 
2021-03-18  1:18:29 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
210318  1:18:29 [ERROR] mysqld got signal 6 ;

Relevant versions:

  • mariadb 1:10.3.28+maria~bionic
  • galera-3 25.3.32-bionic

I've compared two core dumps:

  • dump1: threads: 432
  • dump2 : threads: 418
  • dump1: 0 locks at LOCK_show_status
  • dump2: 3 locks at LOCK_show_status
  • dump1: 1 lock in DeadlockChecker::search waiting for thread 68
  • dump2: 1 lock in trx_commit waiting for thread 97
  • dump1: thread 68 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event
  • dump2: thread 97 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event

See the attached dump1.txt and dump2.txt for closer inspection.

The thread that appears to unjustly be holding the lock (68 and 97 respectively) has this BT:

  (gdb) bt
  #0  0x00007fc58cdd3ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55881a038ec4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55881a038e70, cond=0x55881a038e98) at pthread_cond_wait.c:502
  #2  __pthread_cond_wait (cond=cond@entry=0x55881a038e98, mutex=mutex@entry=0x55881a038e70) at pthread_cond_wait.c:655
  #3  0x00005587afb3f230 in os_event::wait (this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:158
  #4  os_event::wait_low (reset_sig_count=8, this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:325
  #5  os_event_wait_low (event=0x55881a038e60, reset_sig_count=<optimized out>) at ./storage/innobase/os/os0event.cc:502
  #6  0x00005587afbdb82c in sync_array_wait_event (arr=0x5587b1ad5430, cell=@0x7fa5ea7fbcd8: 0x5587b1ad56b0) at ./storage/innobase/sync/sync0arr.cc:471
  #7  0x00005587afadccb7 in TTASEventMutex<GenericPolicy>::enter (line=18772, 
      filename=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", max_delay=4, max_spins=<optimized out>, 
      this=0x5587b08404c0 <lock_sys+64>) at ./storage/innobase/include/ib0mutex.h:471
  #8  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5587b08404c0 <lock_sys+64>, n_spins=30, n_delay=4, 
      name=name@entry=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", line=line@entry=18772)
      at ./storage/innobase/include/ib0mutex.h:592
  #9  0x00005587afad0798 in bg_wsrep_kill_trx (void_arg=0x7fa530046ea0) at ./storage/innobase/handler/ha_innodb.cc:18772
  #10 0x00005587af7565d3 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
  #11 0x00005587afe5612a in pfs_spawn_thread (arg=0x55881a187138) at ./storage/perfschema/pfs.cc:1869
  #12 0x00007fc58cdcd6db in start_thread (arg=0x7fa5ea7fc700) at pthread_create.c:463
  #13 0x00007fc58c3cf71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is this a known issue? Is there any additional info I can provide?
(I have the complete core dumps, but I cannot share them in their entirety obviously.)

Cheers,
Walter Doekes
OSSO B.V.



 Comments   
Comment by Walter Doekes [ 2021-03-19 ]

Also of interest:

  • on production, we're running 10.3.25, and that runs smoothly (we've skipped 26 and 27 because of other breaking bugs);

I poked around some in the dump. Thread 68 in dump 1 is here:

static void bg_wsrep_kill_trx(
        void *void_arg)
{
        bg_wsrep_kill_trx_arg *arg = (bg_wsrep_kill_trx_arg*)void_arg;
        THD *thd                   = find_thread_by_id(arg->thd_id, false);    
        trx_t *victim_trx          = NULL;
        bool awake                 = false;
        DBUG_ENTER("bg_wsrep_kill_trx");
 
        if (thd) {
                wsrep_thd_LOCK(thd);  // <-- locked; thread 0x7fa368019568
                victim_trx = thd_to_trx(thd);
                lock_mutex_enter();  // <-- mutex_enter(&lock_sys.mutex) WAITING
                trx_mutex_enter(victim_trx);
                wsrep_thd_UNLOCK(thd);
                if (victim_trx->id != arg->trx_id)
                {
                        trx_mutex_exit(victim_trx);
                        lock_mutex_exit();
                        victim_trx = NULL;
                        wsrep_thd_kill_UNLOCK(thd);
                }
        }

(gdb) print &lock_sys
$22 = (lock_sys_t *) 0x5587b0840480 <lock_sys>
(gdb) print lock_sys.mutex.m_impl.m_event
$28 = (os_event_t) 0x55881a038e60
(gdb) print &((os_event_t) 0x55881a038e60).mutex
$32 = (EventMutex *) 0x55881a038e70
(gdb) print *(EventMutex*)0x55881a038e70
$33 = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 55, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\067", '\000' <repeats 26 times>, __align = 0}}

That thread is what the deadlock checker (dump1:thread170) was waiting for:

Thread 170 (Thread 0x7fa5e8ca5700 (LWP 19772)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fc58cdd019c in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fa36801afe8) at ../nptl/pthread_mutex_lock.c:135
#2  0x00005587af8c3600 in inline_mysql_mutex_lock (src_file=0x5587aff40518 "/home/buildbot/buildbot/build/mariadb-10.3.28/sql/wsrep_thd.cc", src_line=761, that=0x7fa36801afe8) at ./include/mysql/psi/mysql_thread.h:719
#3  wsrep_thd_is_BF (thd=thd@entry=0x7fa368019568, sync=sync@entry=1 '\001') at ./sql/wsrep_thd.cc:761
...

If I check the changes between 10.3.25 and 10.3.28, I notice that commit 29bbcac adds the bg_wsrep_kill_trx code, where there is only one, not two locks:

commit 29bbcac0ee841faaa68eeb09c86ff825eabbe6b6
Author: Sergei Golubchik <serg@mariadb.org>
Date:   Wed Jan 20 18:22:38 2021 +0100
 
    MDEV-23328 Server hang due to Galera lock conflict resolution
...
    mutex order violation here.
...
    When a normal KILL statement is executed, [...]
      victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
...
+       if (thd) {
+               victim_trx = thd_to_trx(thd);
+               lock_mutex_enter();
+               trx_mutex_enter(victim_trx);

But then this commit with the obscurely named message "merge" adds the second lock – adding a lock that we now appear to block on:

commit 37e24970cbbcca4102094d177eee570b3338262a
Author: Sergei Golubchik <serg@mariadb.org>
Date:   Tue Feb 2 10:02:48 2021 +0100
 
    merge
 
...
--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -18767,15 +18767,17 @@ static void bg_wsrep_kill_trx(
        DBUG_ENTER("bg_wsrep_kill_trx");
 
        if (thd) {
+               wsrep_thd_LOCK(thd);
                victim_trx = thd_to_trx(thd);
                lock_mutex_enter();
                trx_mutex_enter(victim_trx);
+               wsrep_thd_UNLOCK(thd);

.. although it does rhyme with "correct lock order" comment in the first commit:

victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex

So.. that would still be the correct order.

I'm probably completely off track about now. Why is lock_mutex_enter() causing a conditional wait? And who is supposed to kick it to life?

Comment by Jan Lindström (Inactive) [ 2021-03-23 ]

Thank you for you anlysis. There is clearly a bug here. In my opinion it should be :

	if (thd) {
		wsrep_thd_LOCK(thd);
		victim_trx = thd_to_trx(thd);
		/* Victim trx might not exists e.g. on MDL-conflict. */
		if (victim_trx) {
			lock_mutex_enter();
			trx_mutex_enter(victim_trx);
			wsrep_thd_UNLOCK(thd);
			if (victim_trx->id != arg->trx_id ||
			    victim_trx->state == TRX_STATE_COMMITTED_IN_MEMORY)
			{
				/* Victim was meanwhile rolled back or
				committed */
				trx_mutex_exit(victim_trx);
				lock_mutex_exit();
				victim_trx = NULL;
				wsrep_thd_kill_UNLOCK(thd);
			}
		} else {
			/* find_thread_by_id locked
			THD::LOCK_thd_kill */
			wsrep_thd_UNLOCK(thd);
			wsrep_thd_kill_UNLOCK(thd);
		}
	}

Now we have victim_thread->LOCK_thd_kill > victim_thread>LOCK_thd_data->lock_sys->mutex > victim_trx>mutex. But we run in problems if victim_trx is not even there or it is already committed/aborted.

Comment by Jan Lindström (Inactive) [ 2021-04-07 ]

There was a clear problem in sql_class.cc in function thd_need_ordering_with as it had

 wsrep_thd_is_BF(const_cast<THD *>(other_thd), true)

Here last parameter means that we would lock other_thd->LOCK_thd_data mutex. This could lead mutex deadlock as we are already holding lock_sys->mutex. Correct ordering of these mutexes is THD::LOCK_thd_data before lock_sys->mutex not other way around. This has been now fixed.

Comment by Walter Doekes [ 2021-04-07 ]

Excellent!

We're anxiously awaiting the commit and the new release to show up

Tack!
Walter

Comment by Walter Doekes [ 2021-05-14 ]

Dear jplindst,

I didn't notice any commits in this area at the time, and now I still don't see any changes to sql_class.cc, nor do I see anything related in any changelog.

Is this really fixed?

Comment by Walter Doekes [ 2021-05-14 ]

Oh, so this was already fixed by 45e33e05e2529e456fc4ce28f9f32fbe1a546526 for MDEV-24872.

That was not obvious to me.

Never mind. All clear

Comment by Jan Lindström (Inactive) [ 2021-05-15 ]

wdoekes Based on extensive QA all possible problematic cases are not fixed in 10.2 and 10.3 see https://jira.mariadb.org/browse/MDEV-25609

Generated at Thu Feb 08 09:35:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.