[MDEV-38920] EVS consensus failure causes all surviving nodes to enter NON_PRIM state after single node failure in multi-segment cluster - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6.24
Fix Version/s: 10.6
Component/s: Galera
Labels:
- analysis_needed

Sprint:
Q3/2026 Replic. Maintenance

Description

In a multi-segment Galera cluster, when a single node fails ungracefully, all remaining nodes enter NON_PRIM state despite having quorum. The cluster cannot self-heal and requires manual
intervention (pc.bootstrap=YES) or the failed node to return.

Environment

2022 incident:

5-node cluster (4 data nodes + 1 arbitrator)
2 segments (2 datacenters)
MariaDB 10.4.22, Galera 26.4.x

2026 incident:

7-node cluster (6 data nodes + 1 active arbitrator)
3 segments (3 datacenters)
MariaDB 10.6.24, Galera 26.4.18

Configuration

wsrep_provider_options="evs.suspect_timeout=PT1M; evs.inactive_timeout=PT1M; evs.install_timeout=PT7.5S (default); evs.max_install_timeouts=3; gmcast.segment=1|2|3;"

What led to the issue:

Configure multi-segment Galera cluster with nodes across datacenters
Ungracefully shutdown one node (power off or network isolation)
Observe all remaining nodes enter NON_PRIM state

Actual Behavior

From error logs (2022 incident, node1-dc2):

2022-09-05 12:16:26 [Note] WSREP: connection to peer xxxxxxxx-xxxx with addr tcp://10.x.x.x:4567 timed out
2022-09-05 12:16:57 [Warning] WSREP: evs::proto(xxxxxxxx-xxxx, GATHER, view_id(REG,xxxxxxxx-xxxx,309)) install timer expired
2022-09-05 12:16:57 [Note] WSREP: no install message received
2022-09-05 12:16:57 [Note] WSREP: view(view_id(NON_PRIM,xxxxxxxx-xxxx,309) memb

{ ... }

2022-09-05 12:16:57 [Note] WSREP: New COMPONENT: primary = no

All 4 surviving nodes went NON_PRIM simultaneously at 12:16:57:

┌─────────────┬──────────┬──────────────┐
│ Node │ State │ Members seen │
├─────────────┼──────────┼──────────────┤
│ node1-dc2 │ NON_PRIM │ 2 │
├─────────────┼──────────┼──────────────┤
│ node2-dc2 │ NON_PRIM │ 2 │
├─────────────┼──────────┼──────────────┤
│ node2-dc1 │ NON_PRIM │ 1 │
├─────────────┼──────────┼──────────────┤
│ Arbitrator │ NON_PRIM │ 1 │
└─────────────┴──────────┴──────────────┘

At 12:16:59, nodes re-discovered each other forming a 4-node NON_PRIM group, but could not transition to PRIM despite having 4 of 5 votes (80% quorum).

Cluster only recovered at 12:20:14 when the failed node rejoined.

—

From error logs (2026 incident, primary node):

Network link to one datacenter (segment 3) was lost. 5 of 7 nodes remained reachable but cluster went NON_PRIM.

2026-02-12 21:37:47 [Note] WSREP: connection to peer xxxxxxxx-xxxx timed out
2026-02-12 21:37:47 [Note] WSREP: turning message relay requesting on, nonlive peers: tcp://10.x.x.x:4567
2026-02-12 21:37:48 [Note] WSREP: connection to peer xxxxxxxx-xxxx timed out
2026-02-12 21:37:49 [Note] WSREP: reconnecting to xxxxxxxx-xxxx

Cluster remained in partitioned state for ~17 minutes until automatic recovery:

2026-02-12 21:54:25 [Note] WSREP: re-bootstrapping prim from partitioned components
2026-02-12 21:54:25 [Note] WSREP: view(view_id(PRIM,...,409) memb

{ 7 nodes }

2026-02-12 21:54:25 [Note] WSREP: New COMPONENT: primary = yes

Timeline: 21:37:47 (link lost) → 21:54:25 (auto-recovery) = ~17 minutes of unnecessary downtime

In this case, 5 of 7 nodes (71% quorum) should have maintained PRIMARY status immediately.

Expected Behavior

With 4 of 5 nodes surviving (80% > 50% quorum), the cluster should:

Detect the single node failure
Form a new PRIMARY view with the 4 surviving nodes
Continue operating without manual intervention

Source Code that seems to be interested and supposed behaviour (needs engineering review):

The issue occurs in gcomm/src/evs_proto.cpp in handle_install_timer() (line 677+):

void gcomm::evs::Proto::handle_install_timer()
{
log_info << self_string() << " install timer expired";

if (install_timeout_count_ < max_install_timeouts_)
{
for (NodeMap::iterator i = known_.begin(); i != known_.end(); ++i)
{
if (node.join_message() == 0 ||
consensus_.is_consistent(*node.join_message()) == false)

{ set_inactive(NodeMap::key(i)); }

}
}

When is_consistent() returns false on the first install timeout, nodes are immediately marked inactive. The max_install_timeouts=3 retry mechanism only applies when consensus succeeds but
install message delivery fails.

In multi-segment deployments, inter-segment JOIN message latency during GATHER phase causes is_consistent() to fail, triggering immediate partition before quorum calculation occurs at the PC
layer.

Related Issues

Codership/Galera GitHub Issue #638: Same EVS consensus failure pattern, different outcome (node FATAL exit vs NON_PRIM)
Both issues show nodes marking each other as operational=false, suspected=true during GATHER despite being reachable

Attachments

Activity

People

Assignee:: Alexey Yurchenko

Reporter:: Claudio Nanni

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2026-02-26 16:19

Updated:: 4 days ago 08:10

Time Tracking

Estimated:

1d 5h 55m

Remaining:

1d 5h 55m

Logged:

Not Specified

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.