Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-39011

mariadbd hangs indefinitely on shutdown — leaked THD_count blocks close_connections() infinite loop

    XMLWordPrintable

Details

    • Bug
    • Status: In Progress (View Workflow)
    • Major
    • Resolution: Unresolved
    • 11.4.10
    • 11.4
    • Galera
    • None
    • Q2/2026 Galera Maintenance

    Description

      Summary

      MariaDB 11.4.10 hangs indefinitely during systemctl stop mariadb on Galera cluster nodes that also run async replication. The shutdown sequence gets stuck in an infinite loop in close_connections() at mysqld.cc:1837 because THD_count::connection_thd_count() never reaches zero. The functions that signal background threads to exit (stop_background_thread(), stop_handle_manager()) are only called in clean_up() which runs AFTER close_connections() returns — creating a deadlock when any THD count is leaked.

      Version / Platform

      • MariaDB 11.4.10 (mariadb.org binary distribution)
      • Ubuntu 22.04 (jammy), amd64
      • Galera cluster (3-node) with bi-directional async replication via named channel

      Steps to Reproduce

      1. Set up two 3-node Galera clusters with bi-directional async replication using a named channel (e.g., repl_channel)
      2. Run moderate write load (e.g., INSERT workload on either cluster)
      3. Stop MariaDB with systemctl stop mariadb on a node that has been running as async slave
      4. Approximately 50% of the time, the shutdown hangs indefinitely

      The bug has been reproduced 5 times across different conditions:

      • With and without innodb_use_native_aio=0 (eliminates io_uring as cause)
      • With and without STOP ALL SLAVES before shutdown (eliminates running slave as requirement)
      • On single-node bootstrap (--wsrep-new-cluster) and multi-node cluster
      • All 5 incidents produce identical GDB thread state

      Expected Behavior

      Server shuts down cleanly within TimeoutStopSec.

      Actual Behavior

      Server hangs in "Shutdown in progress" state. InnoDB shutdown is never initiated (InnoDB: Starting shutdown... never appears in error.log). systemd SIGTERM timeout is reached, and with SendSIGKILL=no, the process remains running indefinitely.

      Root Cause Analysis

      The infinite loop (mysqld.cc:1837)

      Thread 1 is stuck in close_connections() at an infinite while loop with no timeout:

      // mysqld.cc:1837-1842
      while (THD_count::connection_thd_count())   // infinite, no timeout, no break
      {
          if (DBUG_IF("only_kill_system_threads_no_loop"))
            break;                                 // debug-only, never active in release
          my_sleep(1000);                          // line 1841, where Thread 1 is stuck
      }
      

      Compare with the preceding loop at line 1816 which has i < 1000 (20-second safety limit):

      for (int i= 0; THD_count::connection_thd_count() && i < 1000; i++)  // bounded
          my_sleep(20000);
      

      The unbounded loop at 1837 will hang forever if connection_thd_count() is non-zero.

      What keeps connection_thd_count() > 0

      uint THD_count::connection_thd_count()
      {
        return value() -
          binlog_dump_thread_count -
          local_connection_thread_count;
      }
      

      After the async slave threads exit (via WSREP "Node has dropped from cluster" error path), the THD count appears to remain elevated — a "ghost" count that prevents connection_thd_count() from reaching zero. No actual connection threads are visible in GDB, yet the count remains > 0.

      Why background threads are never signaled

      The shutdown sequence in mysqld_main() is:

      close_connections();    // mysqld.cc:6097, stuck at line 1837 forever
      ha_pre_shutdown();      // never reached
      clean_up(1);            // never reached
      

      clean_up() calls:

      • stop_handle_manager() -> signals handle_manager thread to exit
      • mysql_bin_log.cleanup() -> calls stop_background_thread() -> sets binlog_background_thread_stop = true

      Since close_connections() never returns, these functions are never called. GDB confirms: stop = false in binlog_background_thread() (the shutdown flag was never set).

      The two bugs

      1. Primary bug: The while (THD_count::connection_thd_count()) loop at line 1837 has no timeout. Even a modest timeout (similar to the preceding loop's 20 seconds) would prevent indefinite hangs.
      2. Secondary bug: Something in the async slave exit path (likely via WSREP: Slave error due to node going non-primary / wsrep_restart_slave logic) leaks a THD count. This leaves connection_thd_count() > 0 despite no connection threads being alive.

      GDB Evidence (with debug symbols)

      8 threads during hang, captured with mariadb-server-core-dbgsym installed:

      Thread 1 — Main thread (the blocked loop)

      #1  my_sleep(m_seconds=1000) at my_sleep.c:29
      #2  close_connections() at mysqld.cc:1841
      #3  mysqld_main() at mysqld.cc:6097
      

      Thread 7 — binlog_background_thread (never received shutdown signal)

      #4  pthread_cond_wait(cond=COND_binlog_background_thread <mysql_bin_log+3248>,
                            mutex=LOCK_binlog_background_thread <mysql_bin_log+3200>)
      #5  inline_mysql_cond_wait() at mysql_thread.h:1072
      #6  binlog_background_thread(arg=0x0) at log.cc:11433
            stop = false          <-- SHUTDOWN FLAG NEVER SET
            queue = 0x0
      

      abstime=0x0 = infinite wait. stop = false proves stop_background_thread() was never called.

      Thread 6 — handle_manager (never received shutdown signal)

      #4  pthread_cond_wait(cond=COND_manager, mutex=LOCK_manager)
      #5  inline_mysql_cond_wait() at mysql_thread.h:1072
      #6  handle_manager(arg=0x0) at sql_manager.cc:109
            reset_flush_time = true
      

      abstime=0x0 = infinite wait. abort_manager is still false since stop_handle_manager() (in clean_up()) was never reached.

      Thread 5 — buf_flush_page_cleaner (InnoDB never told to shut down)

      #4  pthread_cond_wait(cond=<buf_pool+768>, mutex=<buf_pool+640>)
      #5  buf_flush_page_cleaner() at buf0flu.cc:2573
            lsn_limit = 0
      

      wseq=2 — entered condvar exactly once since startup, never woken. InnoDB shutdown was never initiated because ha_pre_shutdown() comes after close_connections().

      Thread 4 — io_uring AIO (bystander)

      #5  aio_uring::thread_routine(aio=...) at aio_liburing.cc:159
      

      Thread 3 — Aria checkpoint (normal timed sleep)

      #6  my_service_thread_sleep(sleep_time=30000000000) at ma_servicethread.c:115
      #7  ma_checkpoint_background() at ma_checkpoint.c:725
      

      Thread 2 — timer_handler (normal timed wait)

      #6  timer_handler() at thr_timer.c:322
      

      Thread 8 — tpool worker (normal timed wait)

      #10 tpool::thread_pool_generic::get_task() at tpool_generic.cc:521
      #12 tpool::thread_pool_generic::worker_main() at tpool_generic.cc:566
      

      Error Log Sequence (typical reproduction)

      10:51:42 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
      10:51:42 [Note] WSREP: Shutdown replication
      10:51:42 [Note] WSREP: Server status change synced -> disconnecting
      10:51:42 [ERROR] Master 'repl_channel': Slave SQL: Node has dropped from cluster
      10:51:42 [Note] Master 'repl_channel': WSREP: wsrep_restart_slave was set and therefore
                      slave will be automatically restarted when node joins back to cluster
      ...
      10:52:03 [Note] WSREP: Deinitializing allowlist service v1
               <-- silence. InnoDB "Starting shutdown..." NEVER appears.
      

      Prior Reports

      Related to MDEV-21120 ("Server hangs on shutdown in MYSQL_BIN_LOG::stop_background_thread") from 2019, which was filed against 10.4 and is currently Stalled.

      MDEV-21120 is a different manifestation of the same subsystem:

      • MDEV-21120 (10.4): Thread 1 reaches stop_background_thread() but the condvar signal is missed (race)
      • This bug (11.4): Thread 1 never reaches stop_background_thread() at all — stuck earlier in close_connections() unbounded wait

      Both bugs result in binlog_background_thread never exiting, but the 11.4 version has an additional root cause: the unbounded while loop at line 1837.

      Reproduction Details

      Environment:

      • Two 3-node Galera clusters
      • Bi-directional async replication via floating IPs (keepalived)
      • Named replication channel (custom name)
      • wsrep_restart_slave=ON (default)
      • systemd config: SendSIGKILL=no, TimeoutStopSec=900

      Workarounds attempted (none fully effective):

      • innodb_use_native_aio=0 -> Same hang (eliminates io_uring)
      • STOP ALL SLAVES before shutdown -> Sometimes works, sometimes doesn't
      • wsrep_restart_slave=OFF -> Not yet tested; reduces trigger probability but does not fix the unbounded loop

      Attachments

        Activity

          People

            denis.protivensky Denis Protivensky
            claudio.nanni Claudio Nanni
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.