[MDEV-23965] MariaDB Galera lose nodes frequently (less than 1 hour) and they can't rejoin the 3 node cluster Created: 2020-10-15 Updated: 2021-04-15 Resolved: 2021-04-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.5.4 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Labelle | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Kubenetes Client:v1.15.11 |
||
| Attachments: |
|
| Description |
|
We are encountering issues of nodes leaving the cluster and not being able to rejoin most of the time crashing the cluster. These node are leaving the cluster rapidly (less than 1 hour) of running one fast query application (only read no write) on a table called MeasurementThreshold. The delay varies but everything is going well on the logs output of the 3 nodes (log level 2), when suddenly 2 of the nodes see another leaving the cluster. The log of that node that is leaving is not showing any errors at that time and after a while we see it restart in Kubenetes and it does not get back in the cluster. These are the principal lines for that restart failure, but we included the whole logs in attachments. 2020-10-15 11:00:21 0 [Warning] WSREP: access file(/bitnami/mariadb/data//gvwstate.dat) failed(No such file or directory) 2020-10-15 11:00:21 0 [Note] WSREP: gcomm: connecting to group ‘galera’, peer ‘pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local:’ joined { ) We understand that we can loses nodes and that recovery is for that but we are finding that it's a recurring issue. It's been a week of diagnosis we keep losing the database in 2 environments as soon as we put query load on it. We need to find WHY and how to solve this and hat recovery can also work In the trace included, we lost node 1 first , then node 2 followed later and finally all 3 nodes where ini crashLoopBackOff As a note we were on MariaDB single node 10.1.11 before and that same program doing queries works without any issues in the same environments |
| Comments |
| Comment by Jan Lindström (Inactive) [ 2020-10-20 ] | ||||||||||
|
There seems to be some mysterious node partitioning e.g.
So you have 6-node cluster ? What Galera library version you have? | ||||||||||
| Comment by Stephane Labelle [ 2020-10-20 ] | ||||||||||
|
The library from the startup trace seem to be Galera 4.5(r0) and we have a 3 nodes cluster. We tried 4 nodes also and we had the same issues. | ||||||||||
| Comment by Diane Strachan [ 2020-10-22 ] | ||||||||||
|
Hi Jan. Do you need more information from us for this one? Thx | ||||||||||
| Comment by Diane Strachan [ 2020-10-27 ] | ||||||||||
|
Hi Jan. This issue is very critical for us and is blocking deployment to production. We are also experiencing crashes on our standalone DB so it is critical we achieve some stability with our database. Appreciate any support you can provide in diagnosing the issue. Thx | ||||||||||
| Comment by Stephane Labelle [ 2020-11-24 ] | ||||||||||
|
Hi Is somebody actively working on this issue that is critical for us? If you need more information we can try to bring it to you. | ||||||||||
| Comment by Stephane Labelle [ 2020-12-11 ] | ||||||||||
|
Hi any progress on this, we are stuck waiting... | ||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-12-19 ] | ||||||||||
|
Hi, I would need more information to understand this situation. Can you provide some network status information when all 3 nodes are up and running and again when one node decides to leave cluster. Is leaving node able to rejoin the cluster ? What kind of network there is between these 3 nodes? Have you monitored network operation when cluster is up and running and do you see something when one of the nodes decides to leave cluster ? If node crashes I would need stack trace where symbols are resolved. | ||||||||||
| Comment by dani g [ 2021-01-03 ] | ||||||||||
|
Hi, Exactly same issue here. Cluster is installed from fresh and after a few hours node 0 goes down is state 0/2 crashloopback with the above error, and node 1 stuck in 1/2 state | ||||||||||
| Comment by Stephane Labelle [ 2021-01-18 ] | ||||||||||
|
Hi As for coming back in the cluster normally it is not normally able to join back. It's a local network, no multiple site if that's whats your wondering. I have not done any network monitoring per say so I don't know what is the state of the network when this occurs. But from what I have seen of it running when there is not much activity on it, it will keep going a long time. But as soon as you start putting even just a lot of read activity, it will soon crash. As for stack trace we are running on Kubernetes and when it crash it will be automatically restarted. We normally don't get to see the stack trace and the pods are not with persistent space. | ||||||||||
| Comment by Stephane Labelle [ 2021-01-20 ] | ||||||||||
|
Closing this ticket since we are moving to Oracle DB for our HA solution |