[MDEV-31293] Threads stuck on semaphore wait causing MariaDB to crash - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.5.17, 10.5.18, 10.5.19
Fix Version/s: N/A
Component/s: OTHER, Storage Engine - InnoDB
Labels:
- crash
- deadlock
- semaphore
Environment:
OS: CloudLinux 8 like RHEL - Kernel: 4.18.0-348.20.1.lve.1.el8.x86_64

Description

We run rather large production servers with over hundreds of databases with varying sizes between a few MB and many GB.

For over a year we have issues where suddenly all threads of MariaDB get stuck on semaphore wait.

The only way to resolve this is by killing MariaDB or waiting for intentional semaphore wait crash.

With all crashes the threads get stuck on the same lock:

2023-05-12  8:30:03 0 [Note] InnoDB: A semaphore wait:

--Thread 140122331952896 has waited at ha_innodb.cc line 14402 for 237.00 seconds the semaphore:

Mutex at 0x5563d14c8bc0, Mutex DICT_SYS created /builddir/build/BUILD/mariadb-10.5.19/storage/innobase/dict/dict0dict.cc:1038, lock var 2

Previously we presumed together with the help of Sergei Golubchik ( ~~MDEV-30390~~ ) that this might be related to jemalloc memory allocator.

After switching to tcmalloc this behaviour became less visible. However it is still happening. Fortunately less often but the same behaviour is still there.

It is not reproducible but it happens mostly on most busy production servers running over hundreds or even thousand of databases.

Also the change seems higher if a server has bigger InnoDB databases ( 1 GB or bigger ) and the chance seems higher when there is more memory pressure on a system (e.g. still 20 GB RAM free of 128 GB in total)

We use ZFS which requires a lot of 128K memory segments. This can cause memory pressure and might influence MariaDB in its behaviour.

We ensure however that servers have enough CPU and RAM available and try to prevent performance degradation/swapping. So when this behaviour happens the load isn't higher than normal and well below what the system and MariaDB should be able to handle.

Attached are a redacted Backtraces For All Threads From a Core File, MariaDB logging during a crash.
SHOW ENGINE INNODB STATUS.txt was made after the crash.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mariadb.log
27 kB
2023-05-16 13:55
mariadbd_full_bt_all_threads.txt
2.42 MB
2023-05-16 13:56
SHOW ENGINE INNODB STATUS.txt
46 kB
2023-05-16 13:58

Issue Links

relates to

MDEV-29843 Server hang in thd_decrement_pending_ops/pthread_cond_signal

Closed

MDEV-30390 MariaDB 10.5 gets stuck on "Too many connections"

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Joris de Leeuw

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023-05-16 14:31

Updated:: 2023-09-18 12:36

Resolved:: 2023-09-18 12:36

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.