Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21002

Galera Cluster Node During IST Goes from "Synced" to "Joining: receiving State transfer" (stuck, requires kill -9)

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.4.8, 10.4.9, 10.4.10, 10.4.11
    • 10.4.13, 10.5.2
    • Galera
    • None
    • RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. 5-node cluster.

    Description

      Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns (and subsequent IST) and then mysqld becomes unstable (unable to stop gracefully).

      Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

      Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

      Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. SST doesn't seem to have an issue.

      Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to IST re-sync.

      Attachments

        1. mariadb.txt
          73 kB
        2. my.cnf
          0.4 kB
        3. screenshot-1.png
          screenshot-1.png
          76 kB

        Issue Links

          Activity

            Hi jyusb! 10.4.13 comes with Galera library 26.4.4, which includes the fix for MDEV-21002. The fix is in the Galera library.

            I verified that it's in the download from https://mariadb.com/downloads/#mariadb_platform-mariadb_server

            If you use downloads.mariadb.org, it's also there first in the list: https://downloads.mariadb.org/mariadb/10.4.13/

            We have a discussion ongoing about including release notes from 3rd party components.

            ratzpo Rasmus Johansson (Inactive) added a comment - Hi jyusb ! 10.4.13 comes with Galera library 26.4.4, which includes the fix for MDEV-21002 . The fix is in the Galera library. I verified that it's in the download from https://mariadb.com/downloads/#mariadb_platform-mariadb_server If you use downloads.mariadb.org, it's also there first in the list: https://downloads.mariadb.org/mariadb/10.4.13/ We have a discussion ongoing about including release notes from 3rd party components.
            dbart Daniel Bartholomew added a comment - I've updated the MariaDB 10.4.13 release notes
            mihaQ MikaH added a comment -

            Justin, I share you pain. We have had serious issues with 10.4 and we were forced to go back 10.3 (10.3.22). MariaDB needs to put more effort to testing and and also performance testing.

            Sorry, but i had to comment...

            mihaQ MikaH added a comment - Justin, I share you pain. We have had serious issues with 10.4 and we were forced to go back 10.3 (10.3.22). MariaDB needs to put more effort to testing and and also performance testing. Sorry, but i had to comment...
            rgpublic Ranjan Ghosh added a comment -

            Sorry, I also have to comment, because I think this bug shows quite well a deficiency of the current release model: 10.4.13 contained the fix for quite some time now, but binary packages became available only very recently (apart from the somewhat eerie fix to mix in packages from 10.5 which I was a bit reluctant to do on a production server). This means, we had to live with a bug like this for months which brought the whole cluster down and caused me (and it seems many other) a lot of gray hairs

            Perhaps you might want to reevaluate how you ship such stuff in similar cases. I have the feeling (might be wrong) that if this wasn't a cluster/Galera problem, but a bug affecting also normal DB users it would have justified an emergency fix. Cluster is imporant and shouldn't be put on the back-burner. Perhaps it might be advisable to release some kind of 10.4.12.1 in such urgent cases. Just my 2 cents - sorry for the spam.

            rgpublic Ranjan Ghosh added a comment - Sorry, I also have to comment, because I think this bug shows quite well a deficiency of the current release model: 10.4.13 contained the fix for quite some time now, but binary packages became available only very recently (apart from the somewhat eerie fix to mix in packages from 10.5 which I was a bit reluctant to do on a production server). This means, we had to live with a bug like this for months which brought the whole cluster down and caused me (and it seems many other) a lot of gray hairs Perhaps you might want to reevaluate how you ship such stuff in similar cases. I have the feeling (might be wrong) that if this wasn't a cluster/Galera problem, but a bug affecting also normal DB users it would have justified an emergency fix. Cluster is imporant and shouldn't be put on the back-burner. Perhaps it might be advisable to release some kind of 10.4.12.1 in such urgent cases. Just my 2 cents - sorry for the spam.
            imrejonk Imre Jonk added a comment -

            I experienced this issue with Galera 25.3.32 and MariaDB 10.3: https://bugzilla.redhat.com/show_bug.cgi?id=2072442

            imrejonk Imre Jonk added a comment - I experienced this issue with Galera 25.3.32 and MariaDB 10.3: https://bugzilla.redhat.com/show_bug.cgi?id=2072442

            People

              jplindst Jan Lindström (Inactive)
              jyusb Justin Y
              Votes:
              6 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.