[MDEV-23965] MariaDB Galera lose nodes frequently (less than 1 hour) and they can't rejoin the 3 node cluster Created: 2020-10-15  Updated: 2021-04-15  Resolved: 2021-04-15

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.4
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Stephane Labelle Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 1
Labels: None
Environment:

Kubenetes Client:v1.15.11
Kubernetes Server: v1.15.6
OS: Ubuntu 16.04.6


Attachments: Zip Archive Archive.zip     Zip Archive show_variables.txt.zip    

 Description   

We are encountering issues of nodes leaving the cluster and not being able to rejoin most of the time crashing the cluster. These node are leaving the cluster rapidly (less than 1 hour) of running one fast query application (only read no write) on a table called MeasurementThreshold.

The delay varies but everything is going well on the logs output of the 3 nodes (log level 2), when suddenly 2 of the nodes see another leaving the cluster. The log of that node that is leaving is not showing any errors at that time and after a while we see it restart in Kubenetes and it does not get back in the cluster.

These are the principal lines for that restart failure, but we included the whole logs in attachments.

2020-10-15 11:00:21 0 [Warning] WSREP: access file(/bitnami/mariadb/data//gvwstate.dat) failed(No such file or directory)
2020-10-15 11:00:21 0 [Note] WSREP: restore pc from disk failed

2020-10-15 11:00:21 0 [Note] WSREP: gcomm: connecting to group ‘galera’, peer ‘pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local:’
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to fb790655-b3ae tcp://10.42.99.196:4567
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.42.76.132:4567
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to 26a892d3-841b tcp://10.42.76.132:4567
2020-10-15 11:00:22 0 [Note] WSREP: EVS version upgrade 0 -> 1
2020-10-15 11:00:22 0 [Note] WSREP: declaring 26a892d3-841b at tcp://10.42.76.132:4567 stable
2020-10-15 11:00:22 0 [Note] WSREP: declaring fb790655-b3ae at tcp://10.42.99.196:4567 stable
2020-10-15 11:00:22 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2020-10-15 11:00:22 0 [Note] WSREP: view(view_id(NON_PRIM,26a892d3-841b,19) memb

{ 26a892d3-841b,0 9e65909b-880f,0 fb790655-b3ae,0 }

joined {
} left {
} partitioned

{ 2606729a-b745,0 26a892d3-8419,0 26a892d3-841a,0 }

)
2020-10-15 11:00:24 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2020-10-15 11:00:51 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():160
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel ‘galera’ at ‘gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local’: -110 (Connection timed out)
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2020-10-15 11:00:51 0 [ERROR] WSREP: wsrep::connect(gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local) failed: 7
2020-10-15 11:00:51 0 [ERROR] Aborting
Warning: Memory not freed: 72

We understand that we can loses nodes and that recovery is for that but we are finding that it's a recurring issue. It's been a week of diagnosis we keep losing the database in 2 environments as soon as we put query load on it. We need to find WHY and how to solve this and hat recovery can also work

In the trace included, we lost node 1 first , then node 2 followed later and finally all 3 nodes where ini crashLoopBackOff

As a note we were on MariaDB single node 10.1.11 before and that same program doing queries works without any issues in the same environments



 Comments   
Comment by Jan Lindström (Inactive) [ 2020-10-20 ]

There seems to be some mysterious node partitioning e.g.

2020-10-15 11:00:22 0 [Note] WSREP: view(view_id(NON_PRIM,26a892d3-841b,19) memb {
	26a892d3-841b,0
	9e65909b-880f,0
	fb790655-b3ae,0
} joined {
} left {
} partitioned {
	2606729a-b745,0
	26a892d3-8419,0
	26a892d3-841a,0

So you have 6-node cluster ? What Galera library version you have?

Comment by Stephane Labelle [ 2020-10-20 ]

The library from the startup trace seem to be Galera 4.5(r0) and we have a 3 nodes cluster. We tried 4 nodes also and we had the same issues.

Comment by Diane Strachan [ 2020-10-22 ]

Hi Jan. Do you need more information from us for this one? Thx

Comment by Diane Strachan [ 2020-10-27 ]

Hi Jan. This issue is very critical for us and is blocking deployment to production. We are also experiencing crashes on our standalone DB so it is critical we achieve some stability with our database. Appreciate any support you can provide in diagnosing the issue. Thx

Comment by Stephane Labelle [ 2020-11-24 ]

Hi

Is somebody actively working on this issue that is critical for us? If you need more information we can try to bring it to you.

Comment by Stephane Labelle [ 2020-12-11 ]

Hi any progress on this, we are stuck waiting...

Comment by Jan Lindström (Inactive) [ 2020-12-19 ]

Hi,

I would need more information to understand this situation. Can you provide some network status information when all 3 nodes are up and running and again when one node decides to leave cluster. Is leaving node able to rejoin the cluster ? What kind of network there is between these 3 nodes? Have you monitored network operation when cluster is up and running and do you see something when one of the nodes decides to leave cluster ?

If node crashes I would need stack trace where symbols are resolved.

Comment by dani g [ 2021-01-03 ]

Hi,

Exactly same issue here.
Version 10.5.5.
cluster size is 3

Cluster is installed from fresh and after a few hours node 0 goes down is state 0/2 crashloopback with the above error, and node 1 stuck in 1/2 state

Comment by Stephane Labelle [ 2021-01-18 ]

Hi
For the network status information what kind of command you want me to execute?

As for coming back in the cluster normally it is not normally able to join back.

It's a local network, no multiple site if that's whats your wondering.

I have not done any network monitoring per say so I don't know what is the state of the network when this occurs. But from what I have seen of it running when there is not much activity on it, it will keep going a long time. But as soon as you start putting even just a lot of read activity, it will soon crash.

As for stack trace we are running on Kubernetes and when it crash it will be automatically restarted. We normally don't get to see the stack trace and the pods are not with persistent space.

Comment by Stephane Labelle [ 2021-01-20 ]

Closing this ticket since we are moving to Oracle DB for our HA solution

Generated at Thu Feb 08 09:26:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.