Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25190

Semaphore wait has lasted > 600 seconds; stuck on bg_wsrep_kill_trx

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.3.28
    • Fix Version/s: 10.3.29
    • Component/s: Galera
    • Labels:
      None

      Description

      Hi,

      we've had a bunch of deadlocks (+sigabrt) now that resulted in these logs:

      2021-03-18  1:06:37 0 [Warning] InnoDB: A long semaphore wait:
      --Thread 140349926676224 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
      Mutex at 0x5587b08404c0, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      ...
       
      2021-03-18  1:18:29 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      210318  1:18:29 [ERROR] mysqld got signal 6 ;
      

      Relevant versions:

      • mariadb 1:10.3.28+maria~bionic
      • galera-3 25.3.32-bionic

      I've compared two core dumps:

      • dump1: threads: 432
      • dump2 : threads: 418
      • dump1: 0 locks at LOCK_show_status
      • dump2: 3 locks at LOCK_show_status
      • dump1: 1 lock in DeadlockChecker::search waiting for thread 68
      • dump2: 1 lock in trx_commit waiting for thread 97
      • dump1: thread 68 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event
      • dump2: thread 97 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event

      See the attached dump1.txt and dump2.txt for closer inspection.

      The thread that appears to unjustly be holding the lock (68 and 97 respectively) has this BT:

        (gdb) bt
        #0  0x00007fc58cdd3ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55881a038ec4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
        #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55881a038e70, cond=0x55881a038e98) at pthread_cond_wait.c:502
        #2  __pthread_cond_wait (cond=cond@entry=0x55881a038e98, mutex=mutex@entry=0x55881a038e70) at pthread_cond_wait.c:655
        #3  0x00005587afb3f230 in os_event::wait (this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:158
        #4  os_event::wait_low (reset_sig_count=8, this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:325
        #5  os_event_wait_low (event=0x55881a038e60, reset_sig_count=<optimized out>) at ./storage/innobase/os/os0event.cc:502
        #6  0x00005587afbdb82c in sync_array_wait_event (arr=0x5587b1ad5430, cell=@0x7fa5ea7fbcd8: 0x5587b1ad56b0) at ./storage/innobase/sync/sync0arr.cc:471
        #7  0x00005587afadccb7 in TTASEventMutex<GenericPolicy>::enter (line=18772, 
            filename=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", max_delay=4, max_spins=<optimized out>, 
            this=0x5587b08404c0 <lock_sys+64>) at ./storage/innobase/include/ib0mutex.h:471
        #8  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5587b08404c0 <lock_sys+64>, n_spins=30, n_delay=4, 
            name=name@entry=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", line=line@entry=18772)
            at ./storage/innobase/include/ib0mutex.h:592
        #9  0x00005587afad0798 in bg_wsrep_kill_trx (void_arg=0x7fa530046ea0) at ./storage/innobase/handler/ha_innodb.cc:18772
        #10 0x00005587af7565d3 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
        #11 0x00005587afe5612a in pfs_spawn_thread (arg=0x55881a187138) at ./storage/perfschema/pfs.cc:1869
        #12 0x00007fc58cdcd6db in start_thread (arg=0x7fa5ea7fc700) at pthread_create.c:463
        #13 0x00007fc58c3cf71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Is this a known issue? Is there any additional info I can provide?
      (I have the complete core dumps, but I cannot share them in their entirety obviously.)

      Cheers,
      Walter Doekes
      OSSO B.V.

        Attachments

        1. dump1.txt
          19 kB
        2. dump2.txt
          19 kB

          Issue Links

            Activity

              People

              Assignee:
              jplindst Jan Lindström
              Reporter:
              wdoekes Walter Doekes
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.