Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.4.8, 10.4.9, 10.4.10, 10.4.11
-
None
-
RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. 5-node cluster.
Description
Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns (and subsequent IST) and then mysqld becomes unstable (unable to stop gracefully).
Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster.
Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.
Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. SST doesn't seem to have an issue.
Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to IST re-sync.
Attachments
Issue Links
- relates to
-
MDEV-21008 Node Stuck in joining State
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Description |
We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the Sync completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.
Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seems to reduce the frequency of it happening, but it still happens. Irritated by temporary network loss and generally reproducible by the use of iptables to block cluster replication. Appears Galera has difficulty switching from Joining: receiving State transfer" back to "Synced". |
We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the Sync completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.
Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seems to reduce the frequency of it happening, but it still happens. Irritated by temporary network loss and generally reproducible by the use of iptables to block cluster replication. Appears Galera has difficulty switching from "Joining: receiving State transfer" back to "Synced". |
Description |
We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the Sync completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9.
Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seems to reduce the frequency of it happening, but it still happens. Irritated by temporary network loss and generally reproducible by the use of iptables to block cluster replication. Appears Galera has difficulty switching from "Joining: receiving State transfer" back to "Synced". |
Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).
Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster. Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9. Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync. |
Environment | RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. | RHEL 7 on x64 VM. MariaDB from MariaDB repo via Artifactory (not RHEL repo). No docker. 5-node cluster. |
Description |
Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).
Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster. Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9. Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync. |
Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).
Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster. Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9. Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync. |
Description |
Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns and then mysqld becomes unstable (unable to stop gracefully).
Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster. Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9. Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to re-sync. |
Summary: Appears Galera has difficulty switching the value for wsrep_local_state_comment from "Joining: receiving State transfer" back to "Synced" during network slowdowns (and subsequent IST) and then mysqld becomes unstable (unable to stop gracefully).
Solution that works most of the time: kill -9 the process, delete entire datastore on a cluster node, re-join the cluster. Background: We built a new cluster from scratch using a fresh install of 10.4.8 (and 10.4.9). Imported data and grants fresh from SQL (no carry over of any data files). During some brief network outages, a random node will switch its value for wsrep_local_state_comment from "Synced" to "Joining: receiving State transfer" and stay there. No errors in the logs from either the donor or the random node and it looks like the sync successfully completed (no SST rsync processes or any other evidence transfer still in motion). Debug logging unhelpful. Trying to nicely stop MariaDB on random node not possible without kill -9. Troubleshooting: Mixing and matching MariaDB 10.4.8 (older) and Galera 26.4.3 (newer) seemed to reduce the frequency of it happening, but it still happens. SST doesn't seem to have an issue. Problem trigger: This situation is irritated into happening by temporary network loss and generally reproducible by the use of iptables to block cluster replication for a short period of time and then allowing the system to IST re-sync. |
Summary | Galera Cluster Node Goes from "Synced" to "Joining: receiving State transfer" (stuck, requires kill -9) | Galera Cluster Node During IST Goes from "Synced" to "Joining: receiving State transfer" (stuck, requires kill -9) |
Link |
This issue relates to |
Fix Version/s | 10.4 [ 22408 ] | |
Assignee | Jan Lindström [ jplindst ] |
Affects Version/s | 10.4.10 [ 23907 ] |
Labels | need_feedback |
Attachment | mariadb.txt [ 50247 ] |
Attachment | my.cnf [ 50248 ] |
Labels | need_feedback | |
Priority | Major [ 3 ] | Critical [ 2 ] |
Affects Version/s | 10.4.11 [ 24013 ] |
issue.field.resolutiondate | 2020-02-05 07:51:55.0 | 2020-02-05 07:51:55.942 |
Fix Version/s | 10.4.13 [ 24223 ] | |
Fix Version/s | 10.4 [ 22408 ] | |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Closed [ 6 ] |
Attachment | screenshot-1.png [ 50746 ] |
Fix Version/s | 10.5.2 [ 24030 ] |
Workflow | MariaDB v3 [ 100918 ] | MariaDB v4 [ 156955 ] |
Link |
This issue includes |
Link | This issue includes MDEV-30888 [ MDEV-30888 ] |
Link | This issue includes MDEV-30888 [ MDEV-30888 ] |
Link |
This issue includes |