Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.24
-
None
-
None
-
Q3/2026 Galera Maintenance
Description
In a multi-segment Galera cluster, when a single node fails ungracefully, all remaining nodes enter NON_PRIM state despite having quorum. The cluster cannot self-heal and requires manual
intervention (pc.bootstrap=YES) or the failed node to return.
Environment
2022 incident:
- 5-node cluster (4 data nodes + 1 arbitrator)
- 2 segments (2 datacenters)
- MariaDB 10.4.22, Galera 26.4.x
2026 incident:
- 7-node cluster (6 data nodes + 1 active arbitrator)
- 3 segments (3 datacenters)
- MariaDB 10.6.24, Galera 26.4.18
Configuration
wsrep_provider_options="evs.suspect_timeout=PT1M; evs.inactive_timeout=PT1M; evs.install_timeout=PT7.5S (default); evs.max_install_timeouts=3; gmcast.segment=1|2|3;"
What led to the issue:
- Configure multi-segment Galera cluster with nodes across datacenters
- Ungracefully shutdown one node (power off or network isolation)
- Observe all remaining nodes enter NON_PRIM state
Actual Behavior
From error logs (2022 incident, node1-dc2):
2022-09-05 12:16:26 [Note] WSREP: connection to peer xxxxxxxx-xxxx with addr tcp://10.x.x.x:4567 timed out
2022-09-05 12:16:57 [Warning] WSREP: evs::proto(xxxxxxxx-xxxx, GATHER, view_id(REG,xxxxxxxx-xxxx,309)) install timer expired
2022-09-05 12:16:57 [Note] WSREP: no install message received
2022-09-05 12:16:57 [Note] WSREP: view(view_id(NON_PRIM,xxxxxxxx-xxxx,309) memb
2022-09-05 12:16:57 [Note] WSREP: New COMPONENT: primary = no
All 4 surviving nodes went NON_PRIM simultaneously at 12:16:57:
┌─────────────┬──────────┬──────────────┐
│ Node │ State │ Members seen │
├─────────────┼──────────┼──────────────┤
│ node1-dc2 │ NON_PRIM │ 2 │
├─────────────┼──────────┼──────────────┤
│ node2-dc2 │ NON_PRIM │ 2 │
├─────────────┼──────────┼──────────────┤
│ node2-dc1 │ NON_PRIM │ 1 │
├─────────────┼──────────┼──────────────┤
│ Arbitrator │ NON_PRIM │ 1 │
└─────────────┴──────────┴──────────────┘
At 12:16:59, nodes re-discovered each other forming a 4-node NON_PRIM group, but could not transition to PRIM despite having 4 of 5 votes (80% quorum).
Cluster only recovered at 12:20:14 when the failed node rejoined.
—
From error logs (2026 incident, primary node):
Network link to one datacenter (segment 3) was lost. 5 of 7 nodes remained reachable but cluster went NON_PRIM.
2026-02-12 21:37:47 [Note] WSREP: connection to peer xxxxxxxx-xxxx timed out
2026-02-12 21:37:47 [Note] WSREP: turning message relay requesting on, nonlive peers: tcp://10.x.x.x:4567
2026-02-12 21:37:48 [Note] WSREP: connection to peer xxxxxxxx-xxxx timed out
2026-02-12 21:37:49 [Note] WSREP: reconnecting to xxxxxxxx-xxxx
Cluster remained in partitioned state for ~17 minutes until automatic recovery:
2026-02-12 21:54:25 [Note] WSREP: re-bootstrapping prim from partitioned components
2026-02-12 21:54:25 [Note] WSREP: view(view_id(PRIM,...,409) memb
2026-02-12 21:54:25 [Note] WSREP: New COMPONENT: primary = yes
Timeline: 21:37:47 (link lost) → 21:54:25 (auto-recovery) = ~17 minutes of unnecessary downtime
In this case, 5 of 7 nodes (71% quorum) should have maintained PRIMARY status immediately.
Expected Behavior
With 4 of 5 nodes surviving (80% > 50% quorum), the cluster should:
- Detect the single node failure
- Form a new PRIMARY view with the 4 surviving nodes
- Continue operating without manual intervention
Source Code that seems to be interested and supposed behaviour (needs engineering review):
The issue occurs in gcomm/src/evs_proto.cpp in handle_install_timer() (line 677+):
void gcomm::evs::Proto::handle_install_timer()
{
log_info << self_string() << " install timer expired";
if (install_timeout_count_ < max_install_timeouts_)
{
for (NodeMap::iterator i = known_.begin(); i != known_.end(); ++i)
{
if (node.join_message() == 0 ||
consensus_.is_consistent(*node.join_message()) == false)
}
}
When is_consistent() returns false on the first install timeout, nodes are immediately marked inactive. The max_install_timeouts=3 retry mechanism only applies when consensus succeeds but
install message delivery fails.
In multi-segment deployments, inter-segment JOIN message latency during GATHER phase causes is_consistent() to fail, triggering immediate partition before quorum calculation occurs at the PC
layer.
Related Issues
- Codership/Galera GitHub Issue #638: Same EVS consensus failure pattern, different outcome (node FATAL exit vs NON_PRIM)
- Both issues show nodes marking each other as operational=false, suspected=true during GATHER despite being reachable