Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25190

Semaphore wait has lasted > 600 seconds; stuck on bg_wsrep_kill_trx

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.3.28
    • 10.3.29
    • Galera
    • None

    Description

      Hi,

      we've had a bunch of deadlocks (+sigabrt) now that resulted in these logs:

      2021-03-18  1:06:37 0 [Warning] InnoDB: A long semaphore wait:
      --Thread 140349926676224 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
      Mutex at 0x5587b08404c0, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      ...
       
      2021-03-18  1:18:29 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      210318  1:18:29 [ERROR] mysqld got signal 6 ;
      

      Relevant versions:

      • mariadb 1:10.3.28+maria~bionic
      • galera-3 25.3.32-bionic

      I've compared two core dumps:

      • dump1: threads: 432
      • dump2 : threads: 418
      • dump1: 0 locks at LOCK_show_status
      • dump2: 3 locks at LOCK_show_status
      • dump1: 1 lock in DeadlockChecker::search waiting for thread 68
      • dump2: 1 lock in trx_commit waiting for thread 97
      • dump1: thread 68 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event
      • dump2: thread 97 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event

      See the attached dump1.txt and dump2.txt for closer inspection.

      The thread that appears to unjustly be holding the lock (68 and 97 respectively) has this BT:

        (gdb) bt
        #0  0x00007fc58cdd3ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55881a038ec4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
        #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55881a038e70, cond=0x55881a038e98) at pthread_cond_wait.c:502
        #2  __pthread_cond_wait (cond=cond@entry=0x55881a038e98, mutex=mutex@entry=0x55881a038e70) at pthread_cond_wait.c:655
        #3  0x00005587afb3f230 in os_event::wait (this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:158
        #4  os_event::wait_low (reset_sig_count=8, this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:325
        #5  os_event_wait_low (event=0x55881a038e60, reset_sig_count=<optimized out>) at ./storage/innobase/os/os0event.cc:502
        #6  0x00005587afbdb82c in sync_array_wait_event (arr=0x5587b1ad5430, cell=@0x7fa5ea7fbcd8: 0x5587b1ad56b0) at ./storage/innobase/sync/sync0arr.cc:471
        #7  0x00005587afadccb7 in TTASEventMutex<GenericPolicy>::enter (line=18772, 
            filename=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", max_delay=4, max_spins=<optimized out>, 
            this=0x5587b08404c0 <lock_sys+64>) at ./storage/innobase/include/ib0mutex.h:471
        #8  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5587b08404c0 <lock_sys+64>, n_spins=30, n_delay=4, 
            name=name@entry=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", line=line@entry=18772)
            at ./storage/innobase/include/ib0mutex.h:592
        #9  0x00005587afad0798 in bg_wsrep_kill_trx (void_arg=0x7fa530046ea0) at ./storage/innobase/handler/ha_innodb.cc:18772
        #10 0x00005587af7565d3 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
        #11 0x00005587afe5612a in pfs_spawn_thread (arg=0x55881a187138) at ./storage/perfschema/pfs.cc:1869
        #12 0x00007fc58cdcd6db in start_thread (arg=0x7fa5ea7fc700) at pthread_create.c:463
        #13 0x00007fc58c3cf71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Is this a known issue? Is there any additional info I can provide?
      (I have the complete core dumps, but I cannot share them in their entirety obviously.)

      Cheers,
      Walter Doekes
      OSSO B.V.

      Attachments

        1. dump1.txt
          19 kB
        2. dump2.txt
          19 kB

        Issue Links

          Activity

            wdoekes Walter Doekes added a comment -

            Also of interest:

            • on production, we're running 10.3.25, and that runs smoothly (we've skipped 26 and 27 because of other breaking bugs);

            I poked around some in the dump. Thread 68 in dump 1 is here:

            static void bg_wsrep_kill_trx(
                    void *void_arg)
            {
                    bg_wsrep_kill_trx_arg *arg = (bg_wsrep_kill_trx_arg*)void_arg;
                    THD *thd                   = find_thread_by_id(arg->thd_id, false);    
                    trx_t *victim_trx          = NULL;
                    bool awake                 = false;
                    DBUG_ENTER("bg_wsrep_kill_trx");
             
                    if (thd) {
                            wsrep_thd_LOCK(thd);  // <-- locked; thread 0x7fa368019568
                            victim_trx = thd_to_trx(thd);
                            lock_mutex_enter();  // <-- mutex_enter(&lock_sys.mutex) WAITING
                            trx_mutex_enter(victim_trx);
                            wsrep_thd_UNLOCK(thd);
                            if (victim_trx->id != arg->trx_id)
                            {
                                    trx_mutex_exit(victim_trx);
                                    lock_mutex_exit();
                                    victim_trx = NULL;
                                    wsrep_thd_kill_UNLOCK(thd);
                            }
                    }
            

            (gdb) print &lock_sys
            $22 = (lock_sys_t *) 0x5587b0840480 <lock_sys>
            (gdb) print lock_sys.mutex.m_impl.m_event
            $28 = (os_event_t) 0x55881a038e60
            (gdb) print &((os_event_t) 0x55881a038e60).mutex
            $32 = (EventMutex *) 0x55881a038e70
            (gdb) print *(EventMutex*)0x55881a038e70
            $33 = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 55, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\067", '\000' <repeats 26 times>, __align = 0}}
            

            That thread is what the deadlock checker (dump1:thread170) was waiting for:

            Thread 170 (Thread 0x7fa5e8ca5700 (LWP 19772)):
            #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
            #1  0x00007fc58cdd019c in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fa36801afe8) at ../nptl/pthread_mutex_lock.c:135
            #2  0x00005587af8c3600 in inline_mysql_mutex_lock (src_file=0x5587aff40518 "/home/buildbot/buildbot/build/mariadb-10.3.28/sql/wsrep_thd.cc", src_line=761, that=0x7fa36801afe8) at ./include/mysql/psi/mysql_thread.h:719
            #3  wsrep_thd_is_BF (thd=thd@entry=0x7fa368019568, sync=sync@entry=1 '\001') at ./sql/wsrep_thd.cc:761
            ...
            

            If I check the changes between 10.3.25 and 10.3.28, I notice that commit 29bbcac adds the bg_wsrep_kill_trx code, where there is only one, not two locks:

            commit 29bbcac0ee841faaa68eeb09c86ff825eabbe6b6
            Author: Sergei Golubchik <serg@mariadb.org>
            Date:   Wed Jan 20 18:22:38 2021 +0100
             
                MDEV-23328 Server hang due to Galera lock conflict resolution
            ...
                mutex order violation here.
            ...
                When a normal KILL statement is executed, [...]
                  victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
            ...
            +       if (thd) {
            +               victim_trx = thd_to_trx(thd);
            +               lock_mutex_enter();
            +               trx_mutex_enter(victim_trx);
            

            But then this commit with the obscurely named message "merge" adds the second lock – adding a lock that we now appear to block on:

            commit 37e24970cbbcca4102094d177eee570b3338262a
            Author: Sergei Golubchik <serg@mariadb.org>
            Date:   Tue Feb 2 10:02:48 2021 +0100
             
                merge
             
            ...
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -18767,15 +18767,17 @@ static void bg_wsrep_kill_trx(
                    DBUG_ENTER("bg_wsrep_kill_trx");
             
                    if (thd) {
            +               wsrep_thd_LOCK(thd);
                            victim_trx = thd_to_trx(thd);
                            lock_mutex_enter();
                            trx_mutex_enter(victim_trx);
            +               wsrep_thd_UNLOCK(thd);
            

            .. although it does rhyme with "correct lock order" comment in the first commit:

            victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
            

            So.. that would still be the correct order.

            I'm probably completely off track about now. Why is lock_mutex_enter() causing a conditional wait? And who is supposed to kick it to life?

            wdoekes Walter Doekes added a comment - Also of interest: on production, we're running 10.3.25, and that runs smoothly (we've skipped 26 and 27 because of other breaking bugs); I poked around some in the dump. Thread 68 in dump 1 is here: static void bg_wsrep_kill_trx( void *void_arg) { bg_wsrep_kill_trx_arg *arg = (bg_wsrep_kill_trx_arg*)void_arg; THD *thd = find_thread_by_id(arg->thd_id, false); trx_t *victim_trx = NULL; bool awake = false; DBUG_ENTER("bg_wsrep_kill_trx");   if (thd) { wsrep_thd_LOCK(thd); // <-- locked; thread 0x7fa368019568 victim_trx = thd_to_trx(thd); lock_mutex_enter(); // <-- mutex_enter(&lock_sys.mutex) WAITING trx_mutex_enter(victim_trx); wsrep_thd_UNLOCK(thd); if (victim_trx->id != arg->trx_id) { trx_mutex_exit(victim_trx); lock_mutex_exit(); victim_trx = NULL; wsrep_thd_kill_UNLOCK(thd); } } (gdb) print &lock_sys $22 = (lock_sys_t *) 0x5587b0840480 <lock_sys> (gdb) print lock_sys.mutex.m_impl.m_event $28 = (os_event_t) 0x55881a038e60 (gdb) print &((os_event_t) 0x55881a038e60).mutex $32 = (EventMutex *) 0x55881a038e70 (gdb) print *(EventMutex*)0x55881a038e70 $33 = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 55, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\067", '\000' <repeats 26 times>, __align = 0}} That thread is what the deadlock checker (dump1:thread170) was waiting for: Thread 170 (Thread 0x7fa5e8ca5700 (LWP 19772)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007fc58cdd019c in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fa36801afe8) at ../nptl/pthread_mutex_lock.c:135 #2 0x00005587af8c3600 in inline_mysql_mutex_lock (src_file=0x5587aff40518 "/home/buildbot/buildbot/build/mariadb-10.3.28/sql/wsrep_thd.cc", src_line=761, that=0x7fa36801afe8) at ./include/mysql/psi/mysql_thread.h:719 #3 wsrep_thd_is_BF (thd=thd@entry=0x7fa368019568, sync=sync@entry=1 '\001') at ./sql/wsrep_thd.cc:761 ... If I check the changes between 10.3.25 and 10.3.28, I notice that commit 29bbcac adds the bg_wsrep_kill_trx code, where there is only one, not two locks: commit 29bbcac0ee841faaa68eeb09c86ff825eabbe6b6 Author: Sergei Golubchik <serg@mariadb.org> Date: Wed Jan 20 18:22:38 2021 +0100   MDEV-23328 Server hang due to Galera lock conflict resolution ... mutex order violation here. ... When a normal KILL statement is executed, [...] victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex ... + if (thd) { + victim_trx = thd_to_trx(thd); + lock_mutex_enter(); + trx_mutex_enter(victim_trx); But then this commit with the obscurely named message "merge" adds the second lock – adding a lock that we now appear to block on: commit 37e24970cbbcca4102094d177eee570b3338262a Author: Sergei Golubchik <serg@mariadb.org> Date: Tue Feb 2 10:02:48 2021 +0100   merge   ... --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -18767,15 +18767,17 @@ static void bg_wsrep_kill_trx( DBUG_ENTER("bg_wsrep_kill_trx"); if (thd) { + wsrep_thd_LOCK(thd); victim_trx = thd_to_trx(thd); lock_mutex_enter(); trx_mutex_enter(victim_trx); + wsrep_thd_UNLOCK(thd); .. although it does rhyme with "correct lock order" comment in the first commit: victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex So.. that would still be the correct order. I'm probably completely off track about now. Why is lock_mutex_enter() causing a conditional wait? And who is supposed to kick it to life?

            Thank you for you anlysis. There is clearly a bug here. In my opinion it should be :

            	if (thd) {
            		wsrep_thd_LOCK(thd);
            		victim_trx = thd_to_trx(thd);
            		/* Victim trx might not exists e.g. on MDL-conflict. */
            		if (victim_trx) {
            			lock_mutex_enter();
            			trx_mutex_enter(victim_trx);
            			wsrep_thd_UNLOCK(thd);
            			if (victim_trx->id != arg->trx_id ||
            			    victim_trx->state == TRX_STATE_COMMITTED_IN_MEMORY)
            			{
            				/* Victim was meanwhile rolled back or
            				committed */
            				trx_mutex_exit(victim_trx);
            				lock_mutex_exit();
            				victim_trx = NULL;
            				wsrep_thd_kill_UNLOCK(thd);
            			}
            		} else {
            			/* find_thread_by_id locked
            			THD::LOCK_thd_kill */
            			wsrep_thd_UNLOCK(thd);
            			wsrep_thd_kill_UNLOCK(thd);
            		}
            	}
            

            Now we have victim_thread->LOCK_thd_kill > victim_thread>LOCK_thd_data->lock_sys->mutex > victim_trx>mutex. But we run in problems if victim_trx is not even there or it is already committed/aborted.

            jplindst Jan Lindström (Inactive) added a comment - Thank you for you anlysis. There is clearly a bug here. In my opinion it should be : if (thd) { wsrep_thd_LOCK(thd); victim_trx = thd_to_trx(thd); /* Victim trx might not exists e.g. on MDL-conflict. */ if (victim_trx) { lock_mutex_enter(); trx_mutex_enter(victim_trx); wsrep_thd_UNLOCK(thd); if (victim_trx->id != arg->trx_id || victim_trx->state == TRX_STATE_COMMITTED_IN_MEMORY) { /* Victim was meanwhile rolled back or committed */ trx_mutex_exit(victim_trx); lock_mutex_exit(); victim_trx = NULL; wsrep_thd_kill_UNLOCK(thd); } } else { /* find_thread_by_id locked THD::LOCK_thd_kill */ wsrep_thd_UNLOCK(thd); wsrep_thd_kill_UNLOCK(thd); } } Now we have victim_thread->LOCK_thd_kill > victim_thread >LOCK_thd_data->lock_sys->mutex > victim_trx >mutex. But we run in problems if victim_trx is not even there or it is already committed/aborted.

            There was a clear problem in sql_class.cc in function thd_need_ordering_with as it had

             wsrep_thd_is_BF(const_cast<THD *>(other_thd), true)

            Here last parameter means that we would lock other_thd->LOCK_thd_data mutex. This could lead mutex deadlock as we are already holding lock_sys->mutex. Correct ordering of these mutexes is THD::LOCK_thd_data before lock_sys->mutex not other way around. This has been now fixed.

            jplindst Jan Lindström (Inactive) added a comment - There was a clear problem in sql_class.cc in function thd_need_ordering_with as it had wsrep_thd_is_BF(const_cast<THD *>(other_thd), true) Here last parameter means that we would lock other_thd->LOCK_thd_data mutex. This could lead mutex deadlock as we are already holding lock_sys->mutex. Correct ordering of these mutexes is THD::LOCK_thd_data before lock_sys->mutex not other way around. This has been now fixed.
            wdoekes Walter Doekes added a comment -

            Excellent!

            We're anxiously awaiting the commit and the new release to show up

            Tack!
            Walter

            wdoekes Walter Doekes added a comment - Excellent! We're anxiously awaiting the commit and the new release to show up Tack! Walter
            wdoekes Walter Doekes added a comment -

            Dear jplindst,

            I didn't notice any commits in this area at the time, and now I still don't see any changes to sql_class.cc, nor do I see anything related in any changelog.

            Is this really fixed?

            wdoekes Walter Doekes added a comment - Dear jplindst , I didn't notice any commits in this area at the time, and now I still don't see any changes to sql_class.cc, nor do I see anything related in any changelog. Is this really fixed?
            wdoekes Walter Doekes added a comment -

            Oh, so this was already fixed by 45e33e05e2529e456fc4ce28f9f32fbe1a546526 for MDEV-24872.

            That was not obvious to me.

            Never mind. All clear

            wdoekes Walter Doekes added a comment - Oh, so this was already fixed by 45e33e05e2529e456fc4ce28f9f32fbe1a546526 for MDEV-24872 . That was not obvious to me. Never mind. All clear

            wdoekes Based on extensive QA all possible problematic cases are not fixed in 10.2 and 10.3 see https://jira.mariadb.org/browse/MDEV-25609

            jplindst Jan Lindström (Inactive) added a comment - - edited wdoekes Based on extensive QA all possible problematic cases are not fixed in 10.2 and 10.3 see https://jira.mariadb.org/browse/MDEV-25609

            People

              jplindst Jan Lindström (Inactive)
              wdoekes Walter Doekes
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.