[MDEV-39011] mariadbd hangs indefinitely on shutdown — leaked THD_count blocks close_connections() infinite loop - Jira

XML

Word

Printable

Details

Type: Bug
Status: In Progress (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 11.4.10
Fix Version/s: 11.4
Component/s: Galera
Labels:
None

Sprint:
Q2/2026 Galera Maintenance

Description

Summary

MariaDB 11.4.10 hangs indefinitely during systemctl stop mariadb on Galera cluster nodes that also run async replication. The shutdown sequence gets stuck in an infinite loop in close_connections() at mysqld.cc:1837 because THD_count::connection_thd_count() never reaches zero. The functions that signal background threads to exit (stop_background_thread(), stop_handle_manager()) are only called in clean_up() which runs AFTER close_connections() returns — creating a deadlock when any THD count is leaked.

Version / Platform

MariaDB 11.4.10 (mariadb.org binary distribution)
Ubuntu 22.04 (jammy), amd64
Galera cluster (3-node) with bi-directional async replication via named channel

Steps to Reproduce

Set up two 3-node Galera clusters with bi-directional async replication using a named channel (e.g., repl_channel)
Run moderate write load (e.g., INSERT workload on either cluster)
Stop MariaDB with systemctl stop mariadb on a node that has been running as async slave
Approximately 50% of the time, the shutdown hangs indefinitely

The bug has been reproduced 5 times across different conditions:

With and without innodb_use_native_aio=0 (eliminates io_uring as cause)
With and without STOP ALL SLAVES before shutdown (eliminates running slave as requirement)
On single-node bootstrap (--wsrep-new-cluster) and multi-node cluster
All 5 incidents produce identical GDB thread state

Expected Behavior

Server shuts down cleanly within TimeoutStopSec.

Actual Behavior

Server hangs in "Shutdown in progress" state. InnoDB shutdown is never initiated (InnoDB: Starting shutdown... never appears in error.log). systemd SIGTERM timeout is reached, and with SendSIGKILL=no, the process remains running indefinitely.

Root Cause Analysis

The infinite loop (mysqld.cc:1837)

Thread 1 is stuck in close_connections() at an infinite while loop with no timeout:

// mysqld.cc:1837-1842

while (THD_count::connection_thd_count())   // infinite, no timeout, no break

    if (DBUG_IF("only_kill_system_threads_no_loop"))

      break;                                 // debug-only, never active in release

    my_sleep(1000);                          // line 1841, where Thread 1 is stuck

Compare with the preceding loop at line 1816 which has i < 1000 (20-second safety limit):

for (int i= 0; THD_count::connection_thd_count() && i < 1000; i++)  // bounded

    my_sleep(20000);

The unbounded loop at 1837 will hang forever if connection_thd_count() is non-zero.

What keeps connection_thd_count() > 0

uint THD_count::connection_thd_count()

  return value() -

    binlog_dump_thread_count -

    local_connection_thread_count;

After the async slave threads exit (via WSREP "Node has dropped from cluster" error path), the THD count appears to remain elevated — a "ghost" count that prevents connection_thd_count() from reaching zero. No actual connection threads are visible in GDB, yet the count remains > 0.

Why background threads are never signaled

The shutdown sequence in mysqld_main() is:

close_connections();    // mysqld.cc:6097, stuck at line 1837 forever

ha_pre_shutdown();      // never reached

clean_up(1);            // never reached

clean_up() calls:

stop_handle_manager() -> signals handle_manager thread to exit
mysql_bin_log.cleanup() -> calls stop_background_thread() -> sets binlog_background_thread_stop = true

Since close_connections() never returns, these functions are never called. GDB confirms: stop = false in binlog_background_thread() (the shutdown flag was never set).

The two bugs

Primary bug: The while (THD_count::connection_thd_count()) loop at line 1837 has no timeout. Even a modest timeout (similar to the preceding loop's 20 seconds) would prevent indefinite hangs.
Secondary bug: Something in the async slave exit path (likely via WSREP: Slave error due to node going non-primary / wsrep_restart_slave logic) leaks a THD count. This leaves connection_thd_count() > 0 despite no connection threads being alive.

GDB Evidence (with debug symbols)

8 threads during hang, captured with mariadb-server-core-dbgsym installed:

Thread 1 — Main thread (the blocked loop)

#1  my_sleep(m_seconds=1000) at my_sleep.c:29

#2  close_connections() at mysqld.cc:1841

#3  mysqld_main() at mysqld.cc:6097

Thread 7 — binlog_background_thread (never received shutdown signal)

#4  pthread_cond_wait(cond=COND_binlog_background_thread <mysql_bin_log+3248>,

                      mutex=LOCK_binlog_background_thread <mysql_bin_log+3200>)

#5  inline_mysql_cond_wait() at mysql_thread.h:1072

#6  binlog_background_thread(arg=0x0) at log.cc:11433

      stop = false          <-- SHUTDOWN FLAG NEVER SET

      queue = 0x0

abstime=0x0 = infinite wait. stop = false proves stop_background_thread() was never called.

Thread 6 — handle_manager (never received shutdown signal)

#4  pthread_cond_wait(cond=COND_manager, mutex=LOCK_manager)

#5  inline_mysql_cond_wait() at mysql_thread.h:1072

#6  handle_manager(arg=0x0) at sql_manager.cc:109

      reset_flush_time = true

abstime=0x0 = infinite wait. abort_manager is still false since stop_handle_manager() (in clean_up()) was never reached.

Thread 5 — buf_flush_page_cleaner (InnoDB never told to shut down)

#4  pthread_cond_wait(cond=<buf_pool+768>, mutex=<buf_pool+640>)

#5  buf_flush_page_cleaner() at buf0flu.cc:2573

      lsn_limit = 0

wseq=2 — entered condvar exactly once since startup, never woken. InnoDB shutdown was never initiated because ha_pre_shutdown() comes after close_connections().

Thread 4 — io_uring AIO (bystander)

#5  aio_uring::thread_routine(aio=...) at aio_liburing.cc:159

Thread 3 — Aria checkpoint (normal timed sleep)

#6  my_service_thread_sleep(sleep_time=30000000000) at ma_servicethread.c:115

#7  ma_checkpoint_background() at ma_checkpoint.c:725

Thread 2 — timer_handler (normal timed wait)

#6  timer_handler() at thr_timer.c:322

Thread 8 — tpool worker (normal timed wait)

#10 tpool::thread_pool_generic::get_task() at tpool_generic.cc:521

#12 tpool::thread_pool_generic::worker_main() at tpool_generic.cc:566

Error Log Sequence (typical reproduction)

10:51:42 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown

10:51:42 [Note] WSREP: Shutdown replication

10:51:42 [Note] WSREP: Server status change synced -> disconnecting

10:51:42 [ERROR] Master 'repl_channel': Slave SQL: Node has dropped from cluster

10:51:42 [Note] Master 'repl_channel': WSREP: wsrep_restart_slave was set and therefore

                slave will be automatically restarted when node joins back to cluster

...

10:52:03 [Note] WSREP: Deinitializing allowlist service v1

         <-- silence. InnoDB "Starting shutdown..." NEVER appears.

Prior Reports

Related to MDEV-21120 ("Server hangs on shutdown in MYSQL_BIN_LOG::stop_background_thread") from 2019, which was filed against 10.4 and is currently Stalled.

MDEV-21120 is a different manifestation of the same subsystem:

MDEV-21120 (10.4): Thread 1 reaches stop_background_thread() but the condvar signal is missed (race)
This bug (11.4): Thread 1 never reaches stop_background_thread() at all — stuck earlier in close_connections() unbounded wait

Both bugs result in binlog_background_thread never exiting, but the 11.4 version has an additional root cause: the unbounded while loop at line 1837.

Reproduction Details

Environment:

Two 3-node Galera clusters
Bi-directional async replication via floating IPs (keepalived)
Named replication channel (custom name)
wsrep_restart_slave=ON (default)
systemd config: SendSIGKILL=no, TimeoutStopSec=900

Workarounds attempted (none fully effective):

innodb_use_native_aio=0 -> Same hang (eliminates io_uring)
STOP ALL SLAVES before shutdown -> Sometimes works, sometimes doesn't
wsrep_restart_slave=OFF -> Not yet tested; reduces trigger probability but does not fix the unbounded loop

Attachments

Activity

People

Assignee:: Denis Protivensky

Reporter:: Claudio Nanni

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 1 week ago 15:57

Updated:: 5 hours ago

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.