Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-39243

Galera joiner node segfaults in libgalera_smm.so during abort after IST donor fails DNS resolution

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 11.4.10
    • None
    • Galera SST
    • None
    • Linux 6.6.32, x86-64, Docker containers (Podman 5.4.1), Ubuntu Noble 24.04 base image

    Description

      Summary
      A Galera joiner node (db02) segfaults with signal 11 inside libgalera_smm.so
      during its shutdown/abort sequence after IST fails due to a transient DNS
      resolution failure on the donor side. The crash occurs in the error-handling
      path, not during normal operation. The node should exit cleanly with a non-zero
      error code and allow the container runtime to restart and retry — instead it
      segfaults with exit code 139.


      Environment

      Field Value
      MariaDB 11.4.10-MariaDB-ubu2404
      Source revision 054a893f1645b77e52a329a7fc8cf614eebd1fad
      Galera 26.4.25 (r7387a566) by Codership Oy
      OS (container) Ubuntu Noble 24.04
      Kernel (host) Linux 6.6.32 x86-64
      Container runtime Podman 5.4.1 via Docker Compose
      Cluster size 3-node Galera cluster (db00, db01, db02)
      SST method mariabackup
      IST port 4568 (default, no custom configuration applied)

      Steps to Reproduce

      1. Set up a 3-node Galera cluster using the mariadb:11.4 Docker image with
        --wsrep-sst-method=mariabackup (see attached docker-compose.yaml).
      2. Bootstrap db00 with --wsrep-cluster-address=gcomm:// as the single seed node.
      3. Start db01. It joins via SST from db00 — this succeeds normally.
      4. Start db02 while DNS resolution for the hostname db02 is transiently
        unavailable from db01's perspective. This occurs naturally in containerized
        environments where DNS propagation has a slight delay after container creation.
        No special workload or manual network manipulation is required.
      5. db01 is selected as IST donor for db02.
      6. IST fails: db01 cannot resolve tcp://db02:4568.
      7. db02 receives the failure notification and enters its abort/shutdown path.
      8. db02 segfaults (signal 11). Exit code 139.

      *Reproducibility: *Consistent in containerized environments with sequential
      container startup. Does not require any special workload — crash occurs during
      cluster formation only.


      *
      Expected Result*

      db02 logs the IST failure, shuts down cleanly with a non-zero exit code, and
      is restarted by the container runtime to retry joining (at which point DNS has
      typically propagated and the join succeeds).
      *
      Actual Result*

      db02 crashes with signal 11 (SIGSEGV). Exit code 139. The container runtime
      restarts the process but the crash itself represents incorrect behavior — an
      abort in an error-handling path should never produce a segfault.


      *
      Log Sequence*

      db02 — begins join:

      2026-03-07 12:27:55 2 [Note] WSREP: State transfer required:
          Group state: fe29eb94-1a20-11f1-8cce-db564dd310fd:3
          Local state: 00000000-0000-0000-0000-000000000000:-1
      2026-03-07 12:27:55 2 [Note] WSREP: Server status change connected -> joiner
      2026-03-07 12:27:55 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role joiner ...'
      

      db01 — IST sender fails DNS resolution for db02:

      2026-03-07 12:28:15 2 [Warning] WSREP: IST failed: IST sender, failed to connect 'tcp://db02:4568':
          Failed to connect 'tcp://db02:4568': resolve: Host not found (non-authoritative), try again later:
          System error: 2 (No such file or directory)
          at ./galerautils/src/gu_asio_stream_react.cpp:connect():229
          at ./galera/src/ist.cpp:Sender():618
      2026-03-07 12:28:15 0 [Warning] WSREP: 0.0 (db01): State transfer to 1.0 (db02) failed: 
          No such file or directory
      

      db02 — receives failure notification and aborts:

      2026-03-07 12:28:15 0 [Warning] WSREP: 0.0 (db01): State transfer to 1.0 (db02) failed: 
          No such file or directory
      2026-03-07 12:28:15 0 [ERROR] WSREP: ./gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1285: 
          Will never receive state. Need to abort.
      2026-03-07 12:28:15 0 [Note] WSREP: gcomm: terminating thread
      2026-03-07 12:28:15 0 [Note] WSREP: gcomm: closing backend
      2026-03-07 12:28:15 0 [Note] WSREP: gcomm: closed
      2026-03-07 12:28:15 0 [Note] WSREP: mariadbd: Terminated.
      

      db02 — segfault:

      260307 12:28:15 [ERROR] mariadbd got signal 11 ;
       
      Server version: 11.4.10-MariaDB-ubu2404
      Source revision: 054a893f1645b77e52a329a7fc8cf614eebd1fad
       
      Thread pointer: 0x0
      stack_bottom = 0x0 thread_stack 0x49000
      mariadbd(my_print_stacktrace+0x30)[0x5592ed49f520]
      mariadbd(handle_fatal_signal+0x2a1)[0x5592ed023291]
      /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f59805eb330]
      /lib/x86_64-linux-gnu/libc.so.6(abort+0x182)[0x7f59805ce9a2]
      /usr/lib/galera/libgalera_smm.so(+0x16a097)[0x7f597fd5b097]
      /usr/lib/galera/libgalera_smm.so(+0xab528)[0x7f597fc9c528]
      /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f5980642aa4]
      /lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7f59806cfa64]
      


      Analysis

      The call chain leading to the crash is:

      1. gcs_group_handle_join_msg() at line 1285 determines the joiner will never
        receive state and calls abort().
      2. The abort triggers Galera's signal handler inside libgalera_smm.so
        (offset +0x16a097).
      3. The crash output shows Thread pointer: 0x0, indicating the abort is
        initiated before a THD (thread) context has been fully established on the
        joiner side.
      4. The signal handler or abort cleanup path inside libgalera_smm.so appears
        to not handle the null-THD case, causing a second fault inside the handler itself.

      This is a hypothesis based on the crash output. Confirming the exact
      dereference site would require a debug build and a gdb session. The primary
      assertion is: Galera's abort path in the IST failure case does not degrade
      gracefully when thread context is uninitialized.


      Fault Injection & Reproducibility (Antithesis)

      This issue was discovered using Antithesis, a
      deterministic simulation testing platform. Antithesis is able to:

      • Reproduce this fault deterministically — the exact network conditions
        leading to the segfault can be replayed on demand, without relying on
        natural DNS propagation delays in a container environment.
      • Identify the escalation point precisely — by injecting a network fault
        (ping delay) at specific points in the join sequence, Antithesis pinpoints
        where bug probability escalates: the window between
        gcs_group_handle_join_msg() determining state will never arrive and
        the joiner's THD context being fully initialized.
      • Provide a reproducible test case to the MariaDB/Codership team — if the
        team has access to Antithesis or would like to collaborate, the fault
        injection scenario can be shared directly to assist in debugging and
        validating a fix.

      The specific fault injected was a *ping delay on the network path between
      db01 and db02* during the IST handshake window, which causes db01's DNS
      resolution of tcp://db02:4568 to fail non-authoritatively — exactly
      the condition described in this report.

      *
      Workaround*

      None confirmed. Restarting db02 (which the container runtime does automatically
      on exit code 139) typically succeeds once DNS has propagated — usually within
      a few seconds of the first attempt. The crash itself is the bug; the retry
      behavior is coincidentally correct.


      Related Issues

      • MDEV-33349 — relates to (crash when a new node attempts to join the Galera cluster)
      • MDEV-26295 — relates to (instance crash, signal 11 segfault)

      Attachments

      • docker-compose.yaml — minimal 3-node reproducer (attached)

      Attachments

        Activity

          People

            seppo Seppo Jaakola
            mangeshc3 Mangesh Chaudhari
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.