[MDEV-39243] Galera joiner node segfaults in libgalera_smm.so during abort after IST donor fails DNS resolution - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 11.4.10
Fix Version/s: 11.4, 11.8
Component/s: Galera SST
Labels:
None
Environment:
Linux 6.6.32, x86-64, Docker containers (Podman 5.4.1), Ubuntu Noble 24.04 base image

Description

Summary
A Galera joiner node (db02) segfaults with signal 11 inside libgalera_smm.so
during its shutdown/abort sequence after IST fails due to a transient DNS
resolution failure on the donor side. The crash occurs in the error-handling
path, not during normal operation. The node should exit cleanly with a non-zero
error code and allow the container runtime to restart and retry — instead it
segfaults with exit code 139.

Environment

Field	Value
MariaDB	11.4.10-MariaDB-ubu2404
Source revision	054a893f1645b77e52a329a7fc8cf614eebd1fad
Galera	26.4.25 (r7387a566) by Codership Oy
OS (container)	Ubuntu Noble 24.04
Kernel (host)	Linux 6.6.32 x86-64
Container runtime	Podman 5.4.1 via Docker Compose
Cluster size	3-node Galera cluster (db00, db01, db02)
SST method	mariabackup
IST port	4568 (default, no custom configuration applied)

Steps to Reproduce

Set up a 3-node Galera cluster using the mariadb:11.4 Docker image with
--wsrep-sst-method=mariabackup (see attached docker-compose.yaml).
Bootstrap db00 with --wsrep-cluster-address=gcomm:// as the single seed node.
Start db01. It joins via SST from db00 — this succeeds normally.
Start db02 while DNS resolution for the hostname db02 is transiently
unavailable from db01's perspective. This occurs naturally in containerized
environments where DNS propagation has a slight delay after container creation.
No special workload or manual network manipulation is required.
db01 is selected as IST donor for db02.
IST fails: db01 cannot resolve tcp://db02:4568.
db02 receives the failure notification and enters its abort/shutdown path.
db02 segfaults (signal 11). Exit code 139.

*Reproducibility: *Consistent in containerized environments with sequential
container startup. Does not require any special workload — crash occurs during
cluster formation only.

*
Expected Result*

db02 logs the IST failure, shuts down cleanly with a non-zero exit code, and
is restarted by the container runtime to retry joining (at which point DNS has
typically propagated and the join succeeds).
*
Actual Result*

db02 crashes with signal 11 (SIGSEGV). Exit code 139. The container runtime
restarts the process but the crash itself represents incorrect behavior — an
abort in an error-handling path should never produce a segfault.

*
Log Sequence*

db02 — begins join:

2026-03-07 12:27:55 2 [Note] WSREP: State transfer required:

    Group state: fe29eb94-1a20-11f1-8cce-db564dd310fd:3

    Local state: 00000000-0000-0000-0000-000000000000:-1

2026-03-07 12:27:55 2 [Note] WSREP: Server status change connected -> joiner

2026-03-07 12:27:55 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role joiner ...'

db01 — IST sender fails DNS resolution for db02:

2026-03-07 12:28:15 2 [Warning] WSREP: IST failed: IST sender, failed to connect 'tcp://db02:4568':

    Failed to connect 'tcp://db02:4568': resolve: Host not found (non-authoritative), try again later:

    System error: 2 (No such file or directory)

    at ./galerautils/src/gu_asio_stream_react.cpp:connect():229

    at ./galera/src/ist.cpp:Sender():618

2026-03-07 12:28:15 0 [Warning] WSREP: 0.0 (db01): State transfer to 1.0 (db02) failed:

    No such file or directory

db02 — receives failure notification and aborts:

2026-03-07 12:28:15 0 [Warning] WSREP: 0.0 (db01): State transfer to 1.0 (db02) failed:

    No such file or directory

2026-03-07 12:28:15 0 [ERROR] WSREP: ./gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1285:

    Will never receive state. Need to abort.

2026-03-07 12:28:15 0 [Note] WSREP: gcomm: terminating thread

2026-03-07 12:28:15 0 [Note] WSREP: gcomm: closing backend

2026-03-07 12:28:15 0 [Note] WSREP: gcomm: closed

2026-03-07 12:28:15 0 [Note] WSREP: mariadbd: Terminated.

db02 — segfault:

260307 12:28:15 [ERROR] mariadbd got signal 11 ;

Server version: 11.4.10-MariaDB-ubu2404

Source revision: 054a893f1645b77e52a329a7fc8cf614eebd1fad

Thread pointer: 0x0

stack_bottom = 0x0 thread_stack 0x49000

mariadbd(my_print_stacktrace+0x30)[0x5592ed49f520]

mariadbd(handle_fatal_signal+0x2a1)[0x5592ed023291]

/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f59805eb330]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x182)[0x7f59805ce9a2]

/usr/lib/galera/libgalera_smm.so(+0x16a097)[0x7f597fd5b097]

/usr/lib/galera/libgalera_smm.so(+0xab528)[0x7f597fc9c528]

/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f5980642aa4]

/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7f59806cfa64]

Analysis

The call chain leading to the crash is:

gcs_group_handle_join_msg() at line 1285 determines the joiner will never
receive state and calls abort().
The abort triggers Galera's signal handler inside libgalera_smm.so
(offset +0x16a097).
The crash output shows Thread pointer: 0x0, indicating the abort is
initiated before a THD (thread) context has been fully established on the
joiner side.
The signal handler or abort cleanup path inside libgalera_smm.so appears
to not handle the null-THD case, causing a second fault inside the handler itself.

This is a hypothesis based on the crash output. Confirming the exact
dereference site would require a debug build and a gdb session. The primary
assertion is: Galera's abort path in the IST failure case does not degrade
gracefully when thread context is uninitialized.

Fault Injection & Reproducibility (Antithesis)

This issue was discovered using Antithesis, a
deterministic simulation testing platform. Antithesis is able to:

Reproduce this fault deterministically — the exact network conditions
leading to the segfault can be replayed on demand, without relying on
natural DNS propagation delays in a container environment.

Identify the escalation point precisely — by injecting a network fault
(ping delay) at specific points in the join sequence, Antithesis pinpoints
where bug probability escalates: the window between
gcs_group_handle_join_msg() determining state will never arrive and
the joiner's THD context being fully initialized.

Provide a reproducible test case to the MariaDB/Codership team — if the
team has access to Antithesis or would like to collaborate, the fault
injection scenario can be shared directly to assist in debugging and
validating a fix.

The specific fault injected was a *ping delay on the network path between
db01 and db02* during the IST handshake window, which causes db01's DNS
resolution of tcp://db02:4568 to fail non-authoritatively — exactly
the condition described in this report.

*
Workaround*

None confirmed. Restarting db02 (which the container runtime does automatically
on exit code 139) typically succeeds once DNS has propagated — usually within
a few seconds of the first attempt. The crash itself is the bug; the retry
behavior is coincidentally correct.

Related Issues

~~MDEV-33349~~ — relates to (crash when a new node attempts to join the Galera cluster)
~~MDEV-26295~~ — relates to (instance crash, signal 11 segfault)

Attachments

docker-compose.yaml — minimal 3-node reproducer (attached)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

docker-compose.yaml
5 kB
2026-04-02 06:09

Galera joiner node segfaults in libgalera_smm.so during abort after IST donor fails DNS resolution

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration