Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
11.4.4
-
None
Description
We are using 11.4.4 version of Mariadb Galera, but that is also seen in 10.6.4 release as well.
We have 3 node cluster. We ran into a situation where there's a connectivity loss between only 2 nodes out of 3 nodes. Say N1 , N2 and N3 are three nodes. N1 - N3 connectivity got lost but N1 - N2 and N2 - N3 are still intact.
In that process, N1 got restarted, but now it doesn't join the cluster and it just keeps restarting.
N1 - 11.127.4.37
N2 - 11.127.5.37
N3 - 11.127.6.37
Here is cluster address parameter on 11.127.4.37 :
wsrep_cluster_address = gcomm://11.127.4.37,11.127.5.37,11.127.6.37
N2 and N3 are connected and cluster members shows them as connected with 2 nodes in the cluster. When N1 is restarting, it is trying to connect to both N2 and N3, but as N1 and N3 connectivity is down, they cannot connect. N1 is able to connect to N2.
There is an error - failed to open gcomm backend connection: 110: failed to reach primary view
Here are few related logs:
*******************************************************************************************
2025-02-20 11:48:26 0 [Note] WSREP: (d4c47019-bc8c, 'tcp://0.0.0.0:4567') connection established to bdf18832-92a5 tcp://11.127.5.37:4567
2025-02-20 11:48:29 0 [Note] WSREP: (d4c47019-bc8c, 'tcp://0.0.0.0:4567') turning message relay requesting off
2025-02-20 11:48:30 0 [Note] WSREP: (d4c47019-bc8c, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://11.127.6.37:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 165364883 cwnd: 1 last_queued_since: 165664882973365 last_delivered_since: 165664882973365 send_queue_length: 0 send_queue_bytes: 0
2025-02-20 11:48:30 0 [Note] WSREP: Failed to establish connection: Operation aborted.
2025-02-20 11:48:30 0 [Note] WSREP: view(view_id(NON_PRIM,d4c47019-bc8c,27) memb
joined {
} left {
} partitioned
)
2025-02-20 11:48:33 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view
at /bitnami/blacksmith-sandox/libgalera-26.4.21/gcomm/src/pc.cpp:connect():160
2025-02-20 11:48:33 0 [ERROR] WSREP: /bitnami/blacksmith-sandox/libgalera-26.4.21/gcs/src/gcs_core.cpp:gcs_core_open():256: Failed to open backend connection: -110 (Connection timed out)
2025-02-20 11:48:33 0 [Note] WSREP: Failed to establish connection: Operation aborted.
2025-02-20 11:48:54 0 [ERROR] WSREP: /bitnami/blacksmith-sandox/libgalera-26.4.21/gcs/src/gcs.cpp:gcs_open():1701: Failed to open channel 'nrdGalera' at 'gcomm://11.127.4.37,11.127.5.37,11.127.6.37': -110 (Connection timed out)
2025-02-20 11:48:54 0 [ERROR] WSREP: gcs connect failed: Operation timed out
2025-02-20 11:48:54 0 [ERROR] WSREP: wsrep::connect(gcomm://11.127.4.37,11.127.5.37,11.127.6.37) failed: 7
2025-02-20 11:48:54 0 [ERROR] Aborting
*******************************************************************************************
Is it expected behavior ? If so, please provide the documentation link for the galera arbitration process. Thanks!
Any update on this is appreciated. Thanks!