Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21002

Galera Cluster Node During IST Goes from "Synced" to "Joining: receiving State transfer" (stuck, requires kill -9)

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.4.8, 10.4.9, 10.4.10, 10.4.11
    • 10.4.13, 10.5.2
    • Galera
    • None
    • RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. 5-node cluster.

    Description

      Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns (and subsequent IST) and then mysqld becomes unstable (unable to stop gracefully).

      Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

      Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

      Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. SST doesn't seem to have an issue.

      Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to IST re-sync.

      Attachments

        1. mariadb.txt
          73 kB
        2. my.cnf
          0.4 kB
        3. screenshot-1.png
          screenshot-1.png
          76 kB

        Issue Links

          Activity

            jyusb Justin Y created issue -
            jyusb Justin Y made changes -
            Field Original Value New Value
            Description We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the Sync completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seems to reduce the frequency of it happening, but it still happens.

            Irritated by temporary network loss and generally reproducible by the use of iptables to block cluster replication.

            Appears Galera has difficulty switching from Joining: receiving State transfer" back to "Synced".
            We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the Sync completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seems to reduce the frequency of it happening, but it still happens.

            Irritated by temporary network loss and generally reproducible by the use of iptables to block cluster replication.

            Appears Galera has difficulty switching from "Joining: receiving State transfer" back to "Synced".
            jyusb Justin Y made changes -
            Description We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the Sync completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seems to reduce the frequency of it happening, but it still happens.

            Irritated by temporary network loss and generally reproducible by the use of iptables to block cluster replication.

            Appears Galera has difficulty switching from "Joining: receiving State transfer" back to "Synced".
            Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).

            Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

            Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens.

            Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync.

            Environment RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. 5-node cluster.
            jyusb Justin Y made changes -
            Description Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).

            Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

            Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens.

            Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync.

            Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).

            Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

            Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens.

            Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync.
            jyusb Justin Y made changes -
            Description Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).

            Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

            Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens.

            Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync.
            Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns (and subsequent IST) and then mysqld becomes unstable (unable to stop gracefully).

            Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.

            Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.

            Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. SST doesn't seem to have an issue.

            Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to IST re-sync.
            Summary Galera Cluster Node Goes from "Synced" to "Joining: receiving State transfer" (stuck, requires kill -9) Galera Cluster Node During IST Goes from "Synced" to "Joining: receiving State transfer" (stuck, requires kill -9)
            jyusb Justin Y made changes -
            elenst Elena Stepanova made changes -
            Fix Version/s 10.4 [ 22408 ]
            Assignee Jan Lindström [ jplindst ]
            jyusb Justin Y made changes -
            Affects Version/s 10.4.10 [ 23907 ]
            jplindst Jan Lindström (Inactive) made changes -
            Labels need_feedback
            bpatterson Brandon Patterson made changes -
            Attachment mariadb.txt [ 50247 ]
            bpatterson Brandon Patterson made changes -
            Attachment my.cnf [ 50248 ]
            jyusb Justin Y made changes -
            Labels need_feedback
            Priority Major [ 3 ] Critical [ 2 ]
            jyusb Justin Y made changes -
            Affects Version/s 10.4.11 [ 24013 ]
            jplindst Jan Lindström (Inactive) made changes -
            issue.field.resolutiondate 2020-02-05 07:51:55.0 2020-02-05 07:51:55.942
            jplindst Jan Lindström (Inactive) made changes -
            Fix Version/s 10.4.13 [ 24223 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            ngavali Nilesh made changes -
            Attachment screenshot-1.png [ 50746 ]
            dbart Daniel Bartholomew made changes -
            Fix Version/s 10.5.2 [ 24030 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 100918 ] MariaDB v4 [ 156955 ]
            Shalev Ben Shalev made changes -
            Shalev Ben Shalev made changes -
            Shalev Ben Shalev made changes -
            Shalev Ben Shalev made changes -

            People

              jplindst Jan Lindström (Inactive)
              jyusb Justin Y
              Votes:
              6 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.