Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
11.4.10
-
None
-
None
-
Linux 6.6.32, x86-64, Docker containers (Podman 5.4.1), Ubuntu Noble 24.04 base image
Description
Summary
A Galera joiner node (db02) segfaults with signal 11 inside libgalera_smm.so
during its shutdown/abort sequence after IST fails due to a transient DNS
resolution failure on the donor side. The crash occurs in the error-handling
path, not during normal operation. The node should exit cleanly with a non-zero
error code and allow the container runtime to restart and retry — instead it
segfaults with exit code 139.
Environment
| Field | Value |
|---|---|
| MariaDB | 11.4.10-MariaDB-ubu2404 |
| Source revision | 054a893f1645b77e52a329a7fc8cf614eebd1fad |
| Galera | 26.4.25 (r7387a566) by Codership Oy |
| OS (container) | Ubuntu Noble 24.04 |
| Kernel (host) | Linux 6.6.32 x86-64 |
| Container runtime | Podman 5.4.1 via Docker Compose |
| Cluster size | 3-node Galera cluster (db00, db01, db02) |
| SST method | mariabackup |
| IST port | 4568 (default, no custom configuration applied) |
Steps to Reproduce
- Set up a 3-node Galera cluster using the mariadb:11.4 Docker image with
--wsrep-sst-method=mariabackup (see attached docker-compose.yaml). - Bootstrap db00 with --wsrep-cluster-address=gcomm:// as the single seed node.
- Start db01. It joins via SST from db00 — this succeeds normally.
- Start db02 while DNS resolution for the hostname db02 is transiently
unavailable from db01's perspective. This occurs naturally in containerized
environments where DNS propagation has a slight delay after container creation.
No special workload or manual network manipulation is required. - db01 is selected as IST donor for db02.
- IST fails: db01 cannot resolve tcp://db02:4568.
- db02 receives the failure notification and enters its abort/shutdown path.
- db02 segfaults (signal 11). Exit code 139.
*Reproducibility: *Consistent in containerized environments with sequential
container startup. Does not require any special workload — crash occurs during
cluster formation only.
*
Expected Result*
db02 logs the IST failure, shuts down cleanly with a non-zero exit code, and
is restarted by the container runtime to retry joining (at which point DNS has
typically propagated and the join succeeds).
*
Actual Result*
db02 crashes with signal 11 (SIGSEGV). Exit code 139. The container runtime
restarts the process but the crash itself represents incorrect behavior — an
abort in an error-handling path should never produce a segfault.
*
Log Sequence*
db02 — begins join:
2026-03-07 12:27:55 2 [Note] WSREP: State transfer required:
|
Group state: fe29eb94-1a20-11f1-8cce-db564dd310fd:3
|
Local state: 00000000-0000-0000-0000-000000000000:-1
|
2026-03-07 12:27:55 2 [Note] WSREP: Server status change connected -> joiner
|
2026-03-07 12:27:55 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role joiner ...'
|
db01 — IST sender fails DNS resolution for db02:
2026-03-07 12:28:15 2 [Warning] WSREP: IST failed: IST sender, failed to connect 'tcp://db02:4568':
|
Failed to connect 'tcp://db02:4568': resolve: Host not found (non-authoritative), try again later:
|
System error: 2 (No such file or directory)
|
at ./galerautils/src/gu_asio_stream_react.cpp:connect():229
|
at ./galera/src/ist.cpp:Sender():618
|
2026-03-07 12:28:15 0 [Warning] WSREP: 0.0 (db01): State transfer to 1.0 (db02) failed:
|
No such file or directory
|
db02 — receives failure notification and aborts:
2026-03-07 12:28:15 0 [Warning] WSREP: 0.0 (db01): State transfer to 1.0 (db02) failed:
|
No such file or directory
|
2026-03-07 12:28:15 0 [ERROR] WSREP: ./gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1285:
|
Will never receive state. Need to abort.
|
2026-03-07 12:28:15 0 [Note] WSREP: gcomm: terminating thread
|
2026-03-07 12:28:15 0 [Note] WSREP: gcomm: closing backend
|
2026-03-07 12:28:15 0 [Note] WSREP: gcomm: closed
|
2026-03-07 12:28:15 0 [Note] WSREP: mariadbd: Terminated.
|
db02 — segfault:
260307 12:28:15 [ERROR] mariadbd got signal 11 ;
|
|
|
Server version: 11.4.10-MariaDB-ubu2404
|
Source revision: 054a893f1645b77e52a329a7fc8cf614eebd1fad
|
|
|
Thread pointer: 0x0
|
stack_bottom = 0x0 thread_stack 0x49000
|
mariadbd(my_print_stacktrace+0x30)[0x5592ed49f520]
|
mariadbd(handle_fatal_signal+0x2a1)[0x5592ed023291]
|
/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f59805eb330]
|
/lib/x86_64-linux-gnu/libc.so.6(abort+0x182)[0x7f59805ce9a2]
|
/usr/lib/galera/libgalera_smm.so(+0x16a097)[0x7f597fd5b097]
|
/usr/lib/galera/libgalera_smm.so(+0xab528)[0x7f597fc9c528]
|
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f5980642aa4]
|
/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7f59806cfa64]
|
Analysis
The call chain leading to the crash is:
- gcs_group_handle_join_msg() at line 1285 determines the joiner will never
receive state and calls abort(). - The abort triggers Galera's signal handler inside libgalera_smm.so
(offset +0x16a097). - The crash output shows Thread pointer: 0x0, indicating the abort is
initiated before a THD (thread) context has been fully established on the
joiner side. - The signal handler or abort cleanup path inside libgalera_smm.so appears
to not handle the null-THD case, causing a second fault inside the handler itself.
This is a hypothesis based on the crash output. Confirming the exact
dereference site would require a debug build and a gdb session. The primary
assertion is: Galera's abort path in the IST failure case does not degrade
gracefully when thread context is uninitialized.
Fault Injection & Reproducibility (Antithesis)
This issue was discovered using Antithesis, a
deterministic simulation testing platform. Antithesis is able to:
- Reproduce this fault deterministically — the exact network conditions
leading to the segfault can be replayed on demand, without relying on
natural DNS propagation delays in a container environment.
- Identify the escalation point precisely — by injecting a network fault
(ping delay) at specific points in the join sequence, Antithesis pinpoints
where bug probability escalates: the window between
gcs_group_handle_join_msg() determining state will never arrive and
the joiner's THD context being fully initialized.
- Provide a reproducible test case to the MariaDB/Codership team — if the
team has access to Antithesis or would like to collaborate, the fault
injection scenario can be shared directly to assist in debugging and
validating a fix.
The specific fault injected was a *ping delay on the network path between
db01 and db02* during the IST handshake window, which causes db01's DNS
resolution of tcp://db02:4568 to fail non-authoritatively — exactly
the condition described in this report.
*
Workaround*
None confirmed. Restarting db02 (which the container runtime does automatically
on exit code 139) typically succeeds once DNS has propagated — usually within
a few seconds of the first attempt. The crash itself is the bug; the retry
behavior is coincidentally correct.
Related Issues
MDEV-33349— relates to (crash when a new node attempts to join the Galera cluster)MDEV-26295— relates to (instance crash, signal 11 segfault)
Attachments
- docker-compose.yaml — minimal 3-node reproducer (attached)