Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38920

EVS consensus failure causes all surviving nodes to enter NON_PRIM state after single node failure in multi-segment cluster

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.24
    • None
    • Galera
    • None
    • Q3/2026 Galera Maintenance

    Description

      In a multi-segment Galera cluster, when a single node fails ungracefully, all remaining nodes enter NON_PRIM state despite having quorum. The cluster cannot self-heal and requires manual
      intervention (pc.bootstrap=YES) or the failed node to return.

      Environment

      2022 incident:

      • 5-node cluster (4 data nodes + 1 arbitrator)
      • 2 segments (2 datacenters)
      • MariaDB 10.4.22, Galera 26.4.x

      2026 incident:

      • 7-node cluster (6 data nodes + 1 active arbitrator)
      • 3 segments (3 datacenters)
      • MariaDB 10.6.24, Galera 26.4.18

      Configuration

      wsrep_provider_options="evs.suspect_timeout=PT1M; evs.inactive_timeout=PT1M; evs.install_timeout=PT7.5S (default); evs.max_install_timeouts=3; gmcast.segment=1|2|3;"

      What led to the issue:

      • Configure multi-segment Galera cluster with nodes across datacenters
      • Ungracefully shutdown one node (power off or network isolation)
      • Observe all remaining nodes enter NON_PRIM state

      Actual Behavior

      From error logs (2022 incident, node1-dc2):

      2022-09-05 12:16:26 [Note] WSREP: connection to peer xxxxxxxx-xxxx with addr tcp://10.x.x.x:4567 timed out
      2022-09-05 12:16:57 [Warning] WSREP: evs::proto(xxxxxxxx-xxxx, GATHER, view_id(REG,xxxxxxxx-xxxx,309)) install timer expired
      2022-09-05 12:16:57 [Note] WSREP: no install message received
      2022-09-05 12:16:57 [Note] WSREP: view(view_id(NON_PRIM,xxxxxxxx-xxxx,309) memb

      { ... }

      2022-09-05 12:16:57 [Note] WSREP: New COMPONENT: primary = no

      All 4 surviving nodes went NON_PRIM simultaneously at 12:16:57:

      ┌─────────────┬──────────┬──────────────┐
      │ Node │ State │ Members seen │
      ├─────────────┼──────────┼──────────────┤
      │ node1-dc2 │ NON_PRIM │ 2 │
      ├─────────────┼──────────┼──────────────┤
      │ node2-dc2 │ NON_PRIM │ 2 │
      ├─────────────┼──────────┼──────────────┤
      │ node2-dc1 │ NON_PRIM │ 1 │
      ├─────────────┼──────────┼──────────────┤
      │ Arbitrator │ NON_PRIM │ 1 │
      └─────────────┴──────────┴──────────────┘

      At 12:16:59, nodes re-discovered each other forming a 4-node NON_PRIM group, but could not transition to PRIM despite having 4 of 5 votes (80% quorum).

      Cluster only recovered at 12:20:14 when the failed node rejoined.

      From error logs (2026 incident, primary node):

      Network link to one datacenter (segment 3) was lost. 5 of 7 nodes remained reachable but cluster went NON_PRIM.

      2026-02-12 21:37:47 [Note] WSREP: connection to peer xxxxxxxx-xxxx timed out
      2026-02-12 21:37:47 [Note] WSREP: turning message relay requesting on, nonlive peers: tcp://10.x.x.x:4567
      2026-02-12 21:37:48 [Note] WSREP: connection to peer xxxxxxxx-xxxx timed out
      2026-02-12 21:37:49 [Note] WSREP: reconnecting to xxxxxxxx-xxxx

      Cluster remained in partitioned state for ~17 minutes until automatic recovery:

      2026-02-12 21:54:25 [Note] WSREP: re-bootstrapping prim from partitioned components
      2026-02-12 21:54:25 [Note] WSREP: view(view_id(PRIM,...,409) memb

      { 7 nodes }

      2026-02-12 21:54:25 [Note] WSREP: New COMPONENT: primary = yes

      Timeline: 21:37:47 (link lost) → 21:54:25 (auto-recovery) = ~17 minutes of unnecessary downtime

      In this case, 5 of 7 nodes (71% quorum) should have maintained PRIMARY status immediately.

      Expected Behavior

      With 4 of 5 nodes surviving (80% > 50% quorum), the cluster should:

      • Detect the single node failure
      • Form a new PRIMARY view with the 4 surviving nodes
      • Continue operating without manual intervention

      Source Code that seems to be interested and supposed behaviour (needs engineering review):

      The issue occurs in gcomm/src/evs_proto.cpp in handle_install_timer() (line 677+):

      void gcomm::evs::Proto::handle_install_timer()
      {
      log_info << self_string() << " install timer expired";

      if (install_timeout_count_ < max_install_timeouts_)
      {
      for (NodeMap::iterator i = known_.begin(); i != known_.end(); ++i)
      {
      if (node.join_message() == 0 ||
      consensus_.is_consistent(*node.join_message()) == false)

      { set_inactive(NodeMap::key(i)); }

      }
      }

      When is_consistent() returns false on the first install timeout, nodes are immediately marked inactive. The max_install_timeouts=3 retry mechanism only applies when consensus succeeds but
      install message delivery fails.

      In multi-segment deployments, inter-segment JOIN message latency during GATHER phase causes is_consistent() to fail, triggering immediate partition before quorum calculation occurs at the PC
      layer.

      Related Issues

      • Codership/Galera GitHub Issue #638: Same EVS consensus failure pattern, different outcome (node FATAL exit vs NON_PRIM)
      • Both issues show nodes marking each other as operational=false, suspected=true during GATHER despite being reachable

      Attachments

        Activity

          People

            seppo Seppo Jaakola
            claudio.nanni Claudio Nanni
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.