[MDEV-24294] MariaDB - Cluster freezes if node hangs Created: 2020-11-26 Updated: 2022-12-16 Resolved: 2021-10-30 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.5.8 |
| Fix Version/s: | 10.5.13 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Malte Bastian | Assignee: | Seppo Jaakola |
| Resolution: | Fixed | Votes: | 11 |
| Labels: | crash, failover_issues, galera, galera_4, hang | ||
| Environment: |
Ubuntu 20.04 LTS, |
||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Currently I have a recurring problem. Our database cluster, consisting of three nodes, currently fails almost daily. The reason is repeated that one of the three nodes hangs and thus somehow hangs the whole cluster. But... we have the cluster to protect us against failures. The problem behaves in such a way that every connection attempt is timed out. I connect via ssh to each of the nodes and execute the command "mariadb" or "mysql". So far it was always the case that the command worked on 2 of 3 nodes, one node (the hanging one) is not responding. If I now restart the hanging node via "reboot -f", the cluster is healthy again after a few seconds. A reboot without "-f" does not work because the MariaDB service cannot be stopped. Even after several hours the frozen node is not removed from the cluster. So far once the first and twice the third node hangs. Each time the whole cluster was no longer usable. The command "mysqlcheck -A -e" displays "OK" for all tables. So i hope that no one is corrupted. Before we upgraded to version 10.5.8, we did not have this problem. I don't know if this problem is related to the new version, so I'm reporting it here. We have two tables with 3 to 5 millions data records. The other tables (about 10 more) have 1 to 60.000 data records. The database is accessed about 20-100 times a second. I'm desperate about this, because the database has always been very stable. Does anyone have an idea? Following the configration: The innodb_buffer_pool_size is set to 22G and the max connections to 800 (up to now, a maximum of 120 were used simultaneously).
|
| Comments |
| Comment by Renaud Keradac [ 2020-12-23 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
We have exactly the same issue since update to MariaDB 10.3.27 (Ubuntu 18.04). For now, the only workaround we found is to restart MariaDB daemon every weeks... Here is the output of the log while happening : LOG
There was no issue on version 10.3.23 | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Renaud Keradac [ 2020-12-23 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Maybe linked to # | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Renaud Keradac [ 2020-12-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I've just rollback to v10.3.23 in order to confirm there is no issue and the bug is only related to MariaDB update, will keep you up. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Renaud Keradac [ 2021-01-26 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I confirm the issue is not reproduced in version 10.3.23. There is a regression between 10.3.23 and 10.3.27. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2021-01-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Which part have you determined to be a regression, the initial hang / long semaphore wait on one node, or the eventual lock up of the whole cluster due to a single node hang? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Malte Bastian [ 2021-02-02 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
As a regression, I would say that the entire cluster hangs for several hours until the causing node is determined and rebooted. I wish here that the Galera cluster, which used for high availability, removes this node from the cluster by itself to restore the availability of the database. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Iosif Peterfi [ 2021-03-22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm the same is happening with 10.4.18 | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Renaud Keradac [ 2021-03-22 ] | ||||||||||||||||||||||||||||||||||||||||||||
There is no "long semaphore wait" on version 10.3.23. We can probably consider that the 2 issues (long semaphore & cluster hang) are linked but cannot be sure since both have appeared together. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Florian Bezdeka [ 2021-03-26 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
After updating from 10.4.17 to 10.4.18 last week I run into this problem twice now, complete cluster hang. My guess is that the fix for | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kóczán Ákos [ 2021-04-14 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
We have the same issue with 10.4.18, almost daily, sometimes twice a day. Does anyone have a solution, workaround, downgrade, upgrade? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matt Le Fevre [ 2021-04-14 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Possibly related to https://jira.mariadb.org/browse/MDEV-25368 | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daan van Gorkum [ 2021-05-21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Is this problem resolved in 10.4.19? I do see mentions of lock related bugs that are resolved already but this ticket is still open and no recent updates. We're planning our upgrade from 10.4.13 but we currently do not see a version that seems stable enough for production. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
As pointed out in earlier comments,
BC-M reports this issue with 10.5.8, so it should not be affected of | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Cousins [ 2021-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Seppo. Does this mean that you're confident that 10.6.x is not affected, or may this also be affected? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ers Sein [ 2021-09-23 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Just wanted to add a +1 as we have seen multiple environments in production where the cluster completely hangs. It doesn't happen often, at most every other week or so and not in all environments. But regardless is giving a lot of frustration having to reboot machines and bootstrap to get the cluster going again. This is since we upgraded from 10.4.17 (10.4.18 and 10.4.20 so far). | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-10-25 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
violuke erwin_se 10.6 has refactored high priority transaction conflict resolution and is not affected by | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rob [ 2021-10-29 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Is it recommended to downgrade to 10.4.17? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-10-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
commit ef2dbb8dbc3ee42b59adcd2ee4b9967ff55867a1 |