[MDEV-23965] MariaDB Galera lose nodes frequently (less than 1 hour) and they can't rejoin the 3 node cluster - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.5.4
Fix Version/s: N/A
Component/s: Galera
Labels:
None
Environment:
Kubenetes Client:v1.15.11
Kubernetes Server: v1.15.6
OS: Ubuntu 16.04.6

Description

We are encountering issues of nodes leaving the cluster and not being able to rejoin most of the time crashing the cluster. These node are leaving the cluster rapidly (less than 1 hour) of running one fast query application (only read no write) on a table called MeasurementThreshold.

The delay varies but everything is going well on the logs output of the 3 nodes (log level 2), when suddenly 2 of the nodes see another leaving the cluster. The log of that node that is leaving is not showing any errors at that time and after a while we see it restart in Kubenetes and it does not get back in the cluster.

These are the principal lines for that restart failure, but we included the whole logs in attachments.

2020-10-15 11:00:21 0 [Warning] WSREP: access file(/bitnami/mariadb/data//gvwstate.dat) failed(No such file or directory)
2020-10-15 11:00:21 0 [Note] WSREP: restore pc from disk failed

2020-10-15 11:00:21 0 [Note] WSREP: gcomm: connecting to group ‘galera’, peer ‘pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local:’
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to fb790655-b3ae tcp://10.42.99.196:4567
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.42.76.132:4567
2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to 26a892d3-841b tcp://10.42.76.132:4567
2020-10-15 11:00:22 0 [Note] WSREP: EVS version upgrade 0 -> 1
2020-10-15 11:00:22 0 [Note] WSREP: declaring 26a892d3-841b at tcp://10.42.76.132:4567 stable
2020-10-15 11:00:22 0 [Note] WSREP: declaring fb790655-b3ae at tcp://10.42.99.196:4567 stable
2020-10-15 11:00:22 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2020-10-15 11:00:22 0 [Note] WSREP: view(view_id(NON_PRIM,26a892d3-841b,19) memb

{ 26a892d3-841b,0 9e65909b-880f,0 fb790655-b3ae,0 }

joined {
} left {
} partitioned

{ 2606729a-b745,0 26a892d3-8419,0 26a892d3-841a,0 }

)
2020-10-15 11:00:24 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2020-10-15 11:00:51 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():160
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel ‘galera’ at ‘gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local’: -110 (Connection timed out)
2020-10-15 11:00:51 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2020-10-15 11:00:51 0 [ERROR] WSREP: wsrep::connect(gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local) failed: 7
2020-10-15 11:00:51 0 [ERROR] Aborting
Warning: Memory not freed: 72

We understand that we can loses nodes and that recovery is for that but we are finding that it's a recurring issue. It's been a week of diagnosis we keep losing the database in 2 environments as soon as we put query load on it. We need to find WHY and how to solve this and hat recovery can also work

In the trace included, we lost node 1 first , then node 2 followed later and finally all 3 nodes where ini crashLoopBackOff

As a note we were on MariaDB single node 10.1.11 before and that same program doing queries works without any issues in the same environments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Archive.zip
1.65 MB
2020-10-15 12:43
show_variables.txt.zip
15 kB
2020-10-15 12:48

Activity

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Stephane Labelle

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020-10-15 12:56

Updated:: 2021-04-15 05:18

Resolved:: 2021-04-15 05:18

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.