We run rather large production servers with over hundreds of databases with varying sizes between a few MB and many GB.
For over a year we have issues where suddenly all threads of MariaDB get stuck on semaphore wait.
The only way to resolve this is by killing MariaDB
or waiting for intentional semaphore wait crash.
With all crashes the threads get stuck on the same lock:
Previously we presumed together with the help of Sergei Golubchik (
MDEV-30390 ) that this might be related to jemalloc memory allocator.
After switching to tcmalloc this behaviour became less visible. However it is still happening. Fortunately less often but the same behaviour is still there.
It is not reproducible but it happens mostly on most busy production servers running over hundreds or even thousand of databases.
Also the change seems higher if a server has bigger InnoDB databases ( 1 GB or bigger ) and the chance seems higher when there is more memory pressure on a system (e.g. still 20 GB RAM free of 128 GB in total)
We use ZFS which requires a lot of 128K memory segments. This can cause memory pressure and might influence MariaDB in its behaviour.
We ensure however that servers have enough CPU and RAM available and try to prevent performance degradation/swapping. So when this behaviour happens the load isn't higher than normal and well below what the system and MariaDB should be able to handle.
Attached are a redacted Backtraces For All Threads From a Core File, MariaDB logging during a crash.
SHOW ENGINE INNODB STATUS.txt was made after the crash.