[MDEV-31293] Threads stuck on semaphore wait causing MariaDB to crash Created: 2023-05-16  Updated: 2023-09-18  Resolved: 2023-09-18

Status: Closed
Project: MariaDB Server
Component/s: OTHER, Storage Engine - InnoDB
Affects Version/s: 10.5.17, 10.5.18, 10.5.19
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Joris de Leeuw Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: crash, deadlock, semaphore
Environment:

OS: CloudLinux 8 like RHEL - Kernel: 4.18.0-348.20.1.lve.1.el8.x86_64


Attachments: Text File SHOW ENGINE INNODB STATUS.txt     Text File mariadb.log     Text File mariadbd_full_bt_all_threads.txt    
Issue Links:
Relates
relates to MDEV-29843 Server hang in thd_decrement_pending_... Closed
relates to MDEV-30390 MariaDB 10.5 gets stuck on "Too many ... Closed

 Description   

We run rather large production servers with over hundreds of databases with varying sizes between a few MB and many GB.

For over a year we have issues where suddenly all threads of MariaDB get stuck on semaphore wait.

The only way to resolve this is by killing MariaDB or waiting for intentional semaphore wait crash.

With all crashes the threads get stuck on the same lock:

2023-05-12  8:30:03 0 [Note] InnoDB: A semaphore wait:
 
--Thread 140122331952896 has waited at ha_innodb.cc line 14402 for 237.00 seconds the semaphore:
 
Mutex at 0x5563d14c8bc0, Mutex DICT_SYS created /builddir/build/BUILD/mariadb-10.5.19/storage/innobase/dict/dict0dict.cc:1038, lock var 2

Previously we presumed together with the help of Sergei Golubchik ( MDEV-30390 ) that this might be related to jemalloc memory allocator.

After switching to tcmalloc this behaviour became less visible. However it is still happening. Fortunately less often but the same behaviour is still there.

It is not reproducible but it happens mostly on most busy production servers running over hundreds or even thousand of databases.

Also the change seems higher if a server has bigger InnoDB databases ( 1 GB or bigger ) and the chance seems higher when there is more memory pressure on a system (e.g. still 20 GB RAM free of 128 GB in total)

We use ZFS which requires a lot of 128K memory segments. This can cause memory pressure and might influence MariaDB in its behaviour.

We ensure however that servers have enough CPU and RAM available and try to prevent performance degradation/swapping. So when this behaviour happens the load isn't higher than normal and well below what the system and MariaDB should be able to handle.

Attached are a redacted Backtraces For All Threads From a Core File, MariaDB logging during a crash.
SHOW ENGINE INNODB STATUS.txt was made after the crash.



 Comments   
Comment by Daniel Black [ 2023-05-16 ]

Thanks for the detailed backtrace. Most threads are spinning in base::internal::SpinLockDelay from tcmalloc, and it appears at a quick look, the others are waiting on locks held by the other threads stuck in base::internal::SpinLockDelay while attempting to free memory.

I'd suggest starting looking at the tcmalloc version and seeing if a) the current version has bugs, b) if any tuning is required to handle as many threads as MariaDB is using in your environment.

Comment by Daniel Black [ 2023-08-14 ]

Could it be like tcmalloc issue 111 where vm.max_map_count being too low? Check wc -l /proc/$(pidof mariadbd)/maps and see how it compares to the sysctl vm.max_map_count? Is thp configured?

Comment by Marko Mäkelä [ 2023-08-14 ]

In mariadbd_full_bt_all_threads.txt I see that at least Thread 22, 117, 125, 209 and 251 are waiting for dict_sys.mutex. Most of them are waiting in ha_innobase::info_low().

The holder of dict_syst.mutex ought to be Thread 140, executing the following:

Thread 140 (Thread 0x7f70c4fbb700 (LWP 2262712)):
#0  0x00007f7165f863d0 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#1  0x00007f7165f862d3 in SpinLock::SlowLock() () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#2  0x00007f7165f755a5 in tcmalloc::CentralFreeList::InsertRange(void*, void*, int) () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#3  0x00007f7165f795a5 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#4  0x00007f7165f798e0 in tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int) () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#5  0x00005563d0865a90 in ut_allocator<os_event, true>::deallocate (this=<synthetic pointer>, n_elements=0, ptr=0x5563f30dd798) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/ut0new.h:765
        pfx = 0x5563f30dd780
        pfx = <optimized out>
#6  ut_delete<os_event> (ptr=0x5563f30dd798) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/ut0new.h:983
        allocator = <optimized out>
        allocator = <optimized out>
#7  os_event_destroy (event=@0x5563ff1404a0: 0x5563f30dd798) at /usr/src/debug/MariaDB-/src_0/storage/innobase/os/os0event.cc:513
No locals.
#8  0x00005563d08f756c in rw_lock_free_func (lock=lock@entry=0x5563ff140470) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/sync0rw.cc:262
No locals.
#9  0x00005563d099944f in pfs_rw_lock_free_func (lock=0x5563ff140470) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/sync0rw.inl:611
No locals.
#10 dict_index_remove_from_cache_low (table=0x5563de000380, index=0x5563ff140300, lru_evict=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:2210
No locals.
#11 0x00005563d09999a3 in dict_sys_t::remove (this=0x5563d14c8bc0 <dict_sys>, table=0x5563de000380, lru=true, keep=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:1914
        foreign = <optimized out>
        index = <optimized out>
        id_hash = <optimized out>
        id_fold = <optimized out>
        freed = <optimized out>
#12 0x00005563d099a096 in dict_make_room_in_cache (max_tables=4096, pct_check=pct_check@entry=100) at /usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:1406
        prev_table = <optimized out>
        i = <optimized out>
        len = 4546
        table = 0x5563de000380
        check_up_to = 0
        n_evicted = 3
#13 0x00005563d08ec853 in srv_master_evict_from_table_cache (pct_check=pct_check@entry=100) at /usr/src/debug/MariaDB-/src_0/storage/innobase/srv/srv0srv.cc:1480
        n_tables_evicted = 0

I double checked that anything below srv_master_evict_from_table_cache() should be continuously holding dict_sys.mutex.

Hypothetically speaking, we could suspect a bug in MariaDB Server if:

  1. we made use of some of https://www.gnu.org/software/libc/manual/html_node/Hooks-for-Malloc.html (and these were compatible with tcmalloc), and
  2. the malloc library implementation invoked one of these hooks while holding some synchronization primitive inside the malloc library, and
  3. the hook implementation waited for some MariaDB synchronization primitive

Based on the tcmalloc function names, this looks like a hang in tcmalloc itself. Any synchronization primitives inside a memory allocation library are conceptually at a lower level than anything that MariaDB Server would use. I am not aware of any intentional or designed ‘callbacks’ from a memory allocator to MariaDB Server. Thread 196 seems to trigger SIGFPE somewhere in tcmalloc:

#10 0x00005563d051b105 in handle_fatal_signal (sig=8) at /usr/src/debug/MariaDB-/src_0/sql/signal_handler.cc:234
        curr_time = 1683872756
        tm = {tm_sec = 56, tm_min = 25, tm_hour = 8, tm_mday = 12, tm_mon = 4, tm_year = 123, tm_wday = 5, tm_yday = 131, tm_isdst = 1, tm_gmtoff = 7200, tm_zone = 0x5563d270e170 "CEST"}
        thd = 0x5563eb638018
        print_invalid_query_pointer = false
#11 <signal handler called>
No locals.
#12 0x00007f7165f75163 in tcmalloc::CentralFreeList::ReleaseToSpans(void*) () from /usr/lib64/libtcmalloc.so
No symbol table info available.

Could it simply be that the invocation of my_print_stacktrace() in a signal handler is violating man 7 signal-safety? At the top of this stack, the signal handler is invoking tcmalloc again:

Thread 196 (Thread 0x7f70cfb4a700 (LWP 4098200)):
#0  0x00007f7165f863d0 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#1  0x00007f7165f862d3 in SpinLock::SlowLock() () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#2  0x00007f7165f784f5 in tcmalloc::CentralCacheLockAll() () from /usr/lib64/libtcmalloc.so
No symbol table info available.
#3  0x00007f7163b2cb98 in __run_prefork_handlers (do_locking=do_locking@entry=true) at register-atfork.c:144
        runp = <optimized out>
        id = 2
        i = <optimized out>
        lastrun = 3
        sl = <optimized out>
#4  0x00007f7163b8fa35 in __libc_fork () at ../sysdeps/nptl/fork.c:59
        pid = <optimized out>
        multiple_threads = true
        lastrun = <optimized out>
#5  0x00005563d0abc8e0 in start_addr2line_fork (binary_path=0x7ffdd4383eb4 "/usr/sbin/mariadbd") at /usr/src/debug/MariaDB-/src_0/mysys/my_addr_resolve.c:192
No locals.

A section from man 7 signal-safety on my system discourages invoking fork(2) in a signal handler:

POSIX.1‐2001 TC1 clarified that if an application calls fork(2) from a signal handler and any of the fork handlers registered by pthread_atfork(3) calls a function that is not async‐signal‐safe, the behavior is undefined. A future revision of the standard is likely to remove fork(2) from the list of async‐signal‐safe functions.

For what it is worth, I do not see any reference to pthread_atfork in MariaDB Server either.

It would seem plausible to me that the hang was caused by the SIGFPE handler invoking my_print_stacktrace().

Do these hangs go away if the server is invoked with --skip-stack-trace or the option skip_stack_trace is present in the configuration files?

Generated at Thu Feb 08 10:22:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.