Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25190

Semaphore wait has lasted > 600 seconds; stuck on bg_wsrep_kill_trx

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.3.28
    • 10.3.29
    • Galera
    • None

    Description

      Hi,

      we've had a bunch of deadlocks (+sigabrt) now that resulted in these logs:

      2021-03-18  1:06:37 0 [Warning] InnoDB: A long semaphore wait:
      --Thread 140349926676224 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
      Mutex at 0x5587b08404c0, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      ...
       
      2021-03-18  1:18:29 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      210318  1:18:29 [ERROR] mysqld got signal 6 ;
      

      Relevant versions:

      • mariadb 1:10.3.28+maria~bionic
      • galera-3 25.3.32-bionic

      I've compared two core dumps:

      • dump1: threads: 432
      • dump2 : threads: 418
      • dump1: 0 locks at LOCK_show_status
      • dump2: 3 locks at LOCK_show_status
      • dump1: 1 lock in DeadlockChecker::search waiting for thread 68
      • dump2: 1 lock in trx_commit waiting for thread 97
      • dump1: thread 68 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event
      • dump2: thread 97 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event

      See the attached dump1.txt and dump2.txt for closer inspection.

      The thread that appears to unjustly be holding the lock (68 and 97 respectively) has this BT:

        (gdb) bt
        #0  0x00007fc58cdd3ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55881a038ec4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
        #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55881a038e70, cond=0x55881a038e98) at pthread_cond_wait.c:502
        #2  __pthread_cond_wait (cond=cond@entry=0x55881a038e98, mutex=mutex@entry=0x55881a038e70) at pthread_cond_wait.c:655
        #3  0x00005587afb3f230 in os_event::wait (this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:158
        #4  os_event::wait_low (reset_sig_count=8, this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:325
        #5  os_event_wait_low (event=0x55881a038e60, reset_sig_count=<optimized out>) at ./storage/innobase/os/os0event.cc:502
        #6  0x00005587afbdb82c in sync_array_wait_event (arr=0x5587b1ad5430, cell=@0x7fa5ea7fbcd8: 0x5587b1ad56b0) at ./storage/innobase/sync/sync0arr.cc:471
        #7  0x00005587afadccb7 in TTASEventMutex<GenericPolicy>::enter (line=18772, 
            filename=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", max_delay=4, max_spins=<optimized out>, 
            this=0x5587b08404c0 <lock_sys+64>) at ./storage/innobase/include/ib0mutex.h:471
        #8  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5587b08404c0 <lock_sys+64>, n_spins=30, n_delay=4, 
            name=name@entry=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", line=line@entry=18772)
            at ./storage/innobase/include/ib0mutex.h:592
        #9  0x00005587afad0798 in bg_wsrep_kill_trx (void_arg=0x7fa530046ea0) at ./storage/innobase/handler/ha_innodb.cc:18772
        #10 0x00005587af7565d3 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
        #11 0x00005587afe5612a in pfs_spawn_thread (arg=0x55881a187138) at ./storage/perfschema/pfs.cc:1869
        #12 0x00007fc58cdcd6db in start_thread (arg=0x7fa5ea7fc700) at pthread_create.c:463
        #13 0x00007fc58c3cf71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Is this a known issue? Is there any additional info I can provide?
      (I have the complete core dumps, but I cannot share them in their entirety obviously.)

      Cheers,
      Walter Doekes
      OSSO B.V.

      Attachments

        1. dump1.txt
          19 kB
        2. dump2.txt
          19 kB

        Issue Links

          Activity

            People

              jplindst Jan Lindström (Inactive)
              wdoekes Walter Doekes
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.