[MDEV-36140] Mariadb Galera node not able to join the primary component if it looses connectivity with one of node in the primary component - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 11.4.4
Fix Version/s: 11.4
Component/s: Galera
Labels:
None

Description

We are using 11.4.4 version of Mariadb Galera, but that is also seen in 10.6.4 release as well.

We have 3 node cluster. We ran into a situation where there's a connectivity loss between only 2 nodes out of 3 nodes. Say N1 , N2 and N3 are three nodes. N1 - N3 connectivity got lost but N1 - N2 and N2 - N3 are still intact.

In that process, N1 got restarted, but now it doesn't join the cluster and it just keeps restarting.

N1 - 11.127.4.37
N2 - 11.127.5.37
N3 - 11.127.6.37
Here is cluster address parameter on 11.127.4.37 :
wsrep_cluster_address = gcomm://11.127.4.37,11.127.5.37,11.127.6.37

N2 and N3 are connected and cluster members shows them as connected with 2 nodes in the cluster. When N1 is restarting, it is trying to connect to both N2 and N3, but as N1 and N3 connectivity is down, they cannot connect. N1 is able to connect to N2.

There is an error - failed to open gcomm backend connection: 110: failed to reach primary view

Here are few related logs:
*******************************************************************************************
2025-02-20 11:48:26 0 [Note] WSREP: (d4c47019-bc8c, 'tcp://0.0.0.0:4567') connection established to bdf18832-92a5 tcp://11.127.5.37:4567
2025-02-20 11:48:29 0 [Note] WSREP: (d4c47019-bc8c, 'tcp://0.0.0.0:4567') turning message relay requesting off
2025-02-20 11:48:30 0 [Note] WSREP: (d4c47019-bc8c, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://11.127.6.37:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 165364883 cwnd: 1 last_queued_since: 165664882973365 last_delivered_since: 165664882973365 send_queue_length: 0 send_queue_bytes: 0
2025-02-20 11:48:30 0 [Note] WSREP: Failed to establish connection: Operation aborted.
2025-02-20 11:48:30 0 [Note] WSREP: view(view_id(NON_PRIM,d4c47019-bc8c,27) memb

{ d4c47019-bc8c,0 }

joined {
} left {
} partitioned

{ 252ce51f-aee4,0 bdf18832-92a5,0 }

)
2025-02-20 11:48:33 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view
at /bitnami/blacksmith-sandox/libgalera-26.4.21/gcomm/src/pc.cpp:connect():160
2025-02-20 11:48:33 0 [ERROR] WSREP: /bitnami/blacksmith-sandox/libgalera-26.4.21/gcs/src/gcs_core.cpp:gcs_core_open():256: Failed to open backend connection: -110 (Connection timed out)
2025-02-20 11:48:33 0 [Note] WSREP: Failed to establish connection: Operation aborted.
2025-02-20 11:48:54 0 [ERROR] WSREP: /bitnami/blacksmith-sandox/libgalera-26.4.21/gcs/src/gcs.cpp:gcs_open():1701: Failed to open channel 'nrdGalera' at 'gcomm://11.127.4.37,11.127.5.37,11.127.6.37': -110 (Connection timed out)
2025-02-20 11:48:54 0 [ERROR] WSREP: gcs connect failed: Operation timed out
2025-02-20 11:48:54 0 [ERROR] WSREP: wsrep::connect(gcomm://11.127.4.37,11.127.5.37,11.127.6.37) failed: 7
2025-02-20 11:48:54 0 [ERROR] Aborting
*******************************************************************************************

Is it expected behavior ? If so, please provide the documentation link for the galera arbitration process. Thanks!

Attachments

Activity

People

Assignee:: Julius Goryavsky

Reporter:: Har Gagan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2025-02-21 16:48

Updated:: 2025-03-19 15:25

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server