Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23965

MariaDB Galera lose nodes frequently (less than 1 hour) and they can't rejoin the 3 node cluster

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Incomplete
    • 10.5.4
    • N/A
    • Galera
    • None
    • Kubenetes Client:v1.15.11
      Kubernetes Server: v1.15.6
      OS: Ubuntu 16.04.6

    Description

      We are encountering issues of nodes leaving the cluster and not being able to rejoin most of the time crashing the cluster. These node are leaving the cluster rapidly (less than 1 hour) of running one fast query application (only read no write) on a table called MeasurementThreshold.

      The delay varies but everything is going well on the logs output of the 3 nodes (log level 2), when suddenly 2 of the nodes see another leaving the cluster. The log of that node that is leaving is not showing any errors at that time and after a while we see it restart in Kubenetes and it does not get back in the cluster.

      These are the principal lines for that restart failure, but we included the whole logs in attachments.

      2020-10-15 11:00:21 0 [Warning] WSREP: access file(/bitnami/mariadb/data//gvwstate.dat) failed(No such file or directory)
      2020-10-15 11:00:21 0 [Note] WSREP: restore pc from disk failed

      2020-10-15 11:00:21 0 [Note] WSREP: gcomm: connecting to group ‘galera’, peer ‘pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local:’
      2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to fb790655-b3ae tcp://10.42.99.196:4567
      2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.42.76.132:4567
      2020-10-15 11:00:21 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) connection established to 26a892d3-841b tcp://10.42.76.132:4567
      2020-10-15 11:00:22 0 [Note] WSREP: EVS version upgrade 0 -> 1
      2020-10-15 11:00:22 0 [Note] WSREP: declaring 26a892d3-841b at tcp://10.42.76.132:4567 stable
      2020-10-15 11:00:22 0 [Note] WSREP: declaring fb790655-b3ae at tcp://10.42.99.196:4567 stable
      2020-10-15 11:00:22 0 [Note] WSREP: PC protocol upgrade 0 -> 1
      2020-10-15 11:00:22 0 [Note] WSREP: view(view_id(NON_PRIM,26a892d3-841b,19) memb

      { 26a892d3-841b,0 9e65909b-880f,0 fb790655-b3ae,0 }

      joined {
      } left {
      } partitioned

      { 2606729a-b745,0 26a892d3-8419,0 26a892d3-841a,0 }

      )
      2020-10-15 11:00:24 0 [Note] WSREP: (9e65909b-880f, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
      2020-10-15 11:00:51 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
      at gcomm/src/pc.cpp:connect():160
      2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
      2020-10-15 11:00:51 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel ‘galera’ at ‘gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local’: -110 (Connection timed out)
      2020-10-15 11:00:51 0 [ERROR] WSREP: gcs connect failed: Connection timed out
      2020-10-15 11:00:51 0 [ERROR] WSREP: wsrep::connect(gcomm://pstn-mariadb-ha-headless.pstn-dev.svc.cluster.local) failed: 7
      2020-10-15 11:00:51 0 [ERROR] Aborting
      Warning: Memory not freed: 72

      We understand that we can loses nodes and that recovery is for that but we are finding that it's a recurring issue. It's been a week of diagnosis we keep losing the database in 2 environments as soon as we put query load on it. We need to find WHY and how to solve this and hat recovery can also work

      In the trace included, we lost node 1 first , then node 2 followed later and finally all 3 nodes where ini crashLoopBackOff

      As a note we were on MariaDB single node 10.1.11 before and that same program doing queries works without any issues in the same environments

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            SLabelle Stephane Labelle
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.