Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Incomplete
-
10.5.4
-
None
-
Kubenetes Client:v1.15.11
Kubernetes Server: v1.15.6
OS: Ubuntu 16.04.6
Description
We are encountering issues of nodes leaving the cluster and not being able to rejoin most of the time crashing the cluster. These node are leaving the cluster rapidly (less than 1 hour) of running one fast query application (only read no write) on a table called MeasurementThreshold.
The delay varies but everything is going well on the logs output of the 3 nodes (log level 2), when suddenly 2 of the nodes see another leaving the cluster. The log of that node that is leaving is not showing any errors at that time and after a while we see it restart in Kubenetes and it does not get back in the cluster.
These are the principal lines for that restart failure, but we included the whole logs in attachments.
2020-10-15 11:00:21 0 [Warning] WSREP: access file(/bitnami/mariadb/data//gvwstate.dat) failed(No such file or directory)
2020-10-15 11:00:21 0 [Note] WSREP: restore pc from disk failed
2020-10-15 11:00:21 0 [Note] WSREP: gcomm: connecting to group ‘galera’, peer ‘pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local:’
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to fb790655-b3ae tcp://10.42.99.196:4567
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.42.76.132:4567
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to 26a892d3-841b tcp://10.42.76.132:4567
2020-10-15 11:00:22 0 [Note] WSREP: EVS version upgrade 0 -> 1
2020-10-15 11:00:22 0 [Note] WSREP: declaring 26a892d3-841b at tcp://10.42.76.132:4567 stable
2020-10-15 11:00:22 0 [Note] WSREP: declaring fb790655-b3ae at tcp://10.42.99.196:4567 stable
2020-10-15 11:00:22 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2020-10-15 11:00:22 0 [Note] WSREP: view(view_id(NON_PRIM,26a892d3-841b,19) memb
joined {
} left {
} partitioned
)
2020-10-15 11:00:24 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2020-10-15 11:00:51 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():160
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel ‘galera’ at ‘gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local’: -110 (Connection timed out)
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2020-10-15 11:00:51 0 [ERROR] WSREP: wsrep::connect(gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local) failed: 7
2020-10-15 11:00:51 0 [ERROR] Aborting
Warning: Memory not freed: 72
We understand that we can loses nodes and that recovery is for that but we are finding that it's a recurring issue. It's been a week of diagnosis we keep losing the database in 2 environments as soon as we put query load on it. We need to find WHY and how to solve this and hat recovery can also work
In the trace included, we lost node 1 first , then node 2 followed later and finally all 3 nodes where ini crashLoopBackOff
As a note we were on MariaDB single node 10.1.11 before and that same program doing queries works without any issues in the same environments