[MDEV-25190] Semaphore wait has lasted > 600 seconds; stuck on bg_wsrep_kill_trx - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.3.28
Fix Version/s: 10.3.29
Component/s: Galera
Labels:
None

Description

Hi,

we've had a bunch of deadlocks (+sigabrt) now that resulted in these logs:

2021-03-18  1:06:37 0 [Warning] InnoDB: A long semaphore wait:

--Thread 140349926676224 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:

Mutex at 0x5587b08404c0, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2

...

2021-03-18  1:18:29 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

210318  1:18:29 [ERROR] mysqld got signal 6 ;

Relevant versions:

mariadb 1:10.3.28+maria~bionic
galera-3 25.3.32-bionic

I've compared two core dumps:

dump1: threads: 432
dump2 : threads: 418
dump1: 0 locks at LOCK_show_status
dump2: 3 locks at LOCK_show_status
dump1: 1 lock in DeadlockChecker::search waiting for thread 68
dump2: 1 lock in trx_commit waiting for thread 97
dump1: thread 68 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event
dump2: thread 97 has lock, but is waiting for condition in bg_wsrep_kill_trx->TTASEventMutex->sync_array_wait_event

See the attached dump1.txt and dump2.txt for closer inspection.

The thread that appears to unjustly be holding the lock (68 and 97 respectively) has this BT:

  (gdb) bt

  #0  0x00007fc58cdd3ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55881a038ec4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88

  #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55881a038e70, cond=0x55881a038e98) at pthread_cond_wait.c:502

  #2  __pthread_cond_wait (cond=cond@entry=0x55881a038e98, mutex=mutex@entry=0x55881a038e70) at pthread_cond_wait.c:655

  #3  0x00005587afb3f230 in os_event::wait (this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:158

  #4  os_event::wait_low (reset_sig_count=8, this=0x55881a038e60) at ./storage/innobase/os/os0event.cc:325

  #5  os_event_wait_low (event=0x55881a038e60, reset_sig_count=<optimized out>) at ./storage/innobase/os/os0event.cc:502

  #6  0x00005587afbdb82c in sync_array_wait_event (arr=0x5587b1ad5430, cell=@0x7fa5ea7fbcd8: 0x5587b1ad56b0) at ./storage/innobase/sync/sync0arr.cc:471

  #7  0x00005587afadccb7 in TTASEventMutex<GenericPolicy>::enter (line=18772,

      filename=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", max_delay=4, max_spins=<optimized out>,

      this=0x5587b08404c0 <lock_sys+64>) at ./storage/innobase/include/ib0mutex.h:471

  #8  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5587b08404c0 <lock_sys+64>, n_spins=30, n_delay=4,

      name=name@entry=0x5587b0044130 "/home/buildbot/buildbot/build/mariadb-10.3.28/storage/innobase/handler/ha_innodb.cc", line=line@entry=18772)

      at ./storage/innobase/include/ib0mutex.h:592

  #9  0x00005587afad0798 in bg_wsrep_kill_trx (void_arg=0x7fa530046ea0) at ./storage/innobase/handler/ha_innodb.cc:18772

  #10 0x00005587af7565d3 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112

  #11 0x00005587afe5612a in pfs_spawn_thread (arg=0x55881a187138) at ./storage/perfschema/pfs.cc:1869

  #12 0x00007fc58cdcd6db in start_thread (arg=0x7fa5ea7fc700) at pthread_create.c:463

  #13 0x00007fc58c3cf71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is this a known issue? Is there any additional info I can provide?
(I have the complete core dumps, but I cannot share them in their entirety obviously.)

Cheers,
Walter Doekes
OSSO B.V.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dump1.txt
19 kB
2021-03-18 16:54
dump2.txt
19 kB
2021-03-18 16:54

Issue Links

duplicates

MDEV-25111 Long semaphore wait (> 800 secs), server stops responding

Closed

is part of

MDEV-24872 galera.galera_insert_multi MTR failed: crash with SIGABRT

Closed

relates to

MDEV-24704 Galera test failure on galera.galera_nopk_unicode

Closed

Activity

Transition	Time In Source Status	Execution Times

Jan Lindström (Inactive) made transition - 2021-04-06 16:22

Open

In Progress

18d 23h 26m

Jan Lindström (Inactive) made transition - 2021-04-07 05:32

In Progress

Closed

13h 9m

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Walter Doekes

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2021-03-18 16:56

Updated:: 2021-05-15 10:23

Resolved:: 2021-04-07 05:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration