Details
-
Bug
-
Status: In Progress (View Workflow)
-
Major
-
Resolution: Unresolved
-
11.4.10
-
None
-
Q2/2026 Galera Maintenance
Description
Summary
MariaDB 11.4.10 hangs indefinitely during systemctl stop mariadb on Galera cluster nodes that also run async replication. The shutdown sequence gets stuck in an infinite loop in close_connections() at mysqld.cc:1837 because THD_count::connection_thd_count() never reaches zero. The functions that signal background threads to exit (stop_background_thread(), stop_handle_manager()) are only called in clean_up() which runs AFTER close_connections() returns — creating a deadlock when any THD count is leaked.
Version / Platform
- MariaDB 11.4.10 (mariadb.org binary distribution)
- Ubuntu 22.04 (jammy), amd64
- Galera cluster (3-node) with bi-directional async replication via named channel
Steps to Reproduce
- Set up two 3-node Galera clusters with bi-directional async replication using a named channel (e.g., repl_channel)
- Run moderate write load (e.g., INSERT workload on either cluster)
- Stop MariaDB with systemctl stop mariadb on a node that has been running as async slave
- Approximately 50% of the time, the shutdown hangs indefinitely
The bug has been reproduced 5 times across different conditions:
- With and without innodb_use_native_aio=0 (eliminates io_uring as cause)
- With and without STOP ALL SLAVES before shutdown (eliminates running slave as requirement)
- On single-node bootstrap (--wsrep-new-cluster) and multi-node cluster
- All 5 incidents produce identical GDB thread state
Expected Behavior
Server shuts down cleanly within TimeoutStopSec.
Actual Behavior
Server hangs in "Shutdown in progress" state. InnoDB shutdown is never initiated (InnoDB: Starting shutdown... never appears in error.log). systemd SIGTERM timeout is reached, and with SendSIGKILL=no, the process remains running indefinitely.
Root Cause Analysis
The infinite loop (mysqld.cc:1837)
Thread 1 is stuck in close_connections() at an infinite while loop with no timeout:
// mysqld.cc:1837-1842
|
while (THD_count::connection_thd_count()) // infinite, no timeout, no break |
{
|
if (DBUG_IF("only_kill_system_threads_no_loop")) |
break; // debug-only, never active in release |
my_sleep(1000); // line 1841, where Thread 1 is stuck |
}
|
Compare with the preceding loop at line 1816 which has i < 1000 (20-second safety limit):
for (int i= 0; THD_count::connection_thd_count() && i < 1000; i++) // bounded |
my_sleep(20000);
|
The unbounded loop at 1837 will hang forever if connection_thd_count() is non-zero.
What keeps connection_thd_count() > 0
uint THD_count::connection_thd_count()
|
{
|
return value() - |
binlog_dump_thread_count -
|
local_connection_thread_count;
|
}
|
After the async slave threads exit (via WSREP "Node has dropped from cluster" error path), the THD count appears to remain elevated — a "ghost" count that prevents connection_thd_count() from reaching zero. No actual connection threads are visible in GDB, yet the count remains > 0.
Why background threads are never signaled
The shutdown sequence in mysqld_main() is:
close_connections(); // mysqld.cc:6097, stuck at line 1837 forever |
ha_pre_shutdown(); // never reached |
clean_up(1); // never reached |
clean_up() calls:
- stop_handle_manager() -> signals handle_manager thread to exit
- mysql_bin_log.cleanup() -> calls stop_background_thread() -> sets binlog_background_thread_stop = true
Since close_connections() never returns, these functions are never called. GDB confirms: stop = false in binlog_background_thread() (the shutdown flag was never set).
The two bugs
- Primary bug: The while (THD_count::connection_thd_count()) loop at line 1837 has no timeout. Even a modest timeout (similar to the preceding loop's 20 seconds) would prevent indefinite hangs.
- Secondary bug: Something in the async slave exit path (likely via WSREP: Slave error due to node going non-primary / wsrep_restart_slave logic) leaks a THD count. This leaves connection_thd_count() > 0 despite no connection threads being alive.
GDB Evidence (with debug symbols)
8 threads during hang, captured with mariadb-server-core-dbgsym installed:
Thread 1 — Main thread (the blocked loop)
#1 my_sleep(m_seconds=1000) at my_sleep.c:29
|
#2 close_connections() at mysqld.cc:1841
|
#3 mysqld_main() at mysqld.cc:6097
|
Thread 7 — binlog_background_thread (never received shutdown signal)
#4 pthread_cond_wait(cond=COND_binlog_background_thread <mysql_bin_log+3248>,
|
mutex=LOCK_binlog_background_thread <mysql_bin_log+3200>)
|
#5 inline_mysql_cond_wait() at mysql_thread.h:1072
|
#6 binlog_background_thread(arg=0x0) at log.cc:11433
|
stop = false <-- SHUTDOWN FLAG NEVER SET
|
queue = 0x0
|
abstime=0x0 = infinite wait. stop = false proves stop_background_thread() was never called.
Thread 6 — handle_manager (never received shutdown signal)
#4 pthread_cond_wait(cond=COND_manager, mutex=LOCK_manager)
|
#5 inline_mysql_cond_wait() at mysql_thread.h:1072
|
#6 handle_manager(arg=0x0) at sql_manager.cc:109
|
reset_flush_time = true
|
abstime=0x0 = infinite wait. abort_manager is still false since stop_handle_manager() (in clean_up()) was never reached.
Thread 5 — buf_flush_page_cleaner (InnoDB never told to shut down)
#4 pthread_cond_wait(cond=<buf_pool+768>, mutex=<buf_pool+640>)
|
#5 buf_flush_page_cleaner() at buf0flu.cc:2573
|
lsn_limit = 0
|
wseq=2 — entered condvar exactly once since startup, never woken. InnoDB shutdown was never initiated because ha_pre_shutdown() comes after close_connections().
Thread 4 — io_uring AIO (bystander)
#5 aio_uring::thread_routine(aio=...) at aio_liburing.cc:159
|
Thread 3 — Aria checkpoint (normal timed sleep)
#6 my_service_thread_sleep(sleep_time=30000000000) at ma_servicethread.c:115
|
#7 ma_checkpoint_background() at ma_checkpoint.c:725
|
Thread 2 — timer_handler (normal timed wait)
#6 timer_handler() at thr_timer.c:322
|
Thread 8 — tpool worker (normal timed wait)
#10 tpool::thread_pool_generic::get_task() at tpool_generic.cc:521
|
#12 tpool::thread_pool_generic::worker_main() at tpool_generic.cc:566
|
Error Log Sequence (typical reproduction)
10:51:42 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
|
10:51:42 [Note] WSREP: Shutdown replication
|
10:51:42 [Note] WSREP: Server status change synced -> disconnecting
|
10:51:42 [ERROR] Master 'repl_channel': Slave SQL: Node has dropped from cluster
|
10:51:42 [Note] Master 'repl_channel': WSREP: wsrep_restart_slave was set and therefore
|
slave will be automatically restarted when node joins back to cluster
|
...
|
10:52:03 [Note] WSREP: Deinitializing allowlist service v1
|
<-- silence. InnoDB "Starting shutdown..." NEVER appears.
|
Prior Reports
Related to MDEV-21120 ("Server hangs on shutdown in MYSQL_BIN_LOG::stop_background_thread") from 2019, which was filed against 10.4 and is currently Stalled.
MDEV-21120 is a different manifestation of the same subsystem:
- MDEV-21120 (10.4): Thread 1 reaches stop_background_thread() but the condvar signal is missed (race)
- This bug (11.4): Thread 1 never reaches stop_background_thread() at all — stuck earlier in close_connections() unbounded wait
Both bugs result in binlog_background_thread never exiting, but the 11.4 version has an additional root cause: the unbounded while loop at line 1837.
Reproduction Details
Environment:
- Two 3-node Galera clusters
- Bi-directional async replication via floating IPs (keepalived)
- Named replication channel (custom name)
- wsrep_restart_slave=ON (default)
- systemd config: SendSIGKILL=no, TimeoutStopSec=900
Workarounds attempted (none fully effective):
- innodb_use_native_aio=0 -> Same hang (eliminates io_uring)
- STOP ALL SLAVES before shutdown -> Sometimes works, sometimes doesn't
- wsrep_restart_slave=OFF -> Not yet tested; reduces trigger probability but does not fix the unbounded loop