[MDEV-24294] MariaDB - Cluster freezes if node hangs Created: 2020-11-26  Updated: 2022-12-16  Resolved: 2021-10-30

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.8
Fix Version/s: 10.5.13

Type: Bug Priority: Critical
Reporter: Malte Bastian Assignee: Seppo Jaakola
Resolution: Fixed Votes: 11
Labels: crash, failover_issues, galera, galera_4, hang
Environment:

Ubuntu 20.04 LTS,
MariaDB 10.5.8,
3 Nodes á 8 CPU cores, 32 GB RAM and SSD


Issue Links:
Problem/Incident
is caused by MDEV-25114 Crash: WSREP: invalid state ROLLED_BA... Closed
Relates
relates to MDEV-25048 semaphore has too many locks Closed
relates to MDEV-25368 Galera cluster hangs on Freeing items Closed

 Description   

Currently I have a recurring problem. Our database cluster, consisting of three nodes, currently fails almost daily. The reason is repeated that one of the three nodes hangs and thus somehow hangs the whole cluster. But... we have the cluster to protect us against failures.

The problem behaves in such a way that every connection attempt is timed out. I connect via ssh to each of the nodes and execute the command "mariadb" or "mysql". So far it was always the case that the command worked on 2 of 3 nodes, one node (the hanging one) is not responding. If I now restart the hanging node via "reboot -f", the cluster is healthy again after a few seconds.

A reboot without "-f" does not work because the MariaDB service cannot be stopped. Even after several hours the frozen node is not removed from the cluster.

So far once the first and twice the third node hangs. Each time the whole cluster was no longer usable.

The command "mysqlcheck -A -e" displays "OK" for all tables. So i hope that no one is corrupted.

Before we upgraded to version 10.5.8, we did not have this problem. I don't know if this problem is related to the new version, so I'm reporting it here.

We have two tables with 3 to 5 millions data records. The other tables (about 10 more) have 1 to 60.000 data records. The database is accessed about 20-100 times a second.

I'm desperate about this, because the database has always been very stable.

Does anyone have an idea?

Following the configration:

The innodb_buffer_pool_size is set to 22G and the max connections to 800 (up to now, a maximum of 120 were used simultaneously).

# /etc/mysql/mariadb.conf.d/60-galera.cnf
#
# * Galera-related settings
#
# See the examples of server wsrep.cnf files in /usr/share/mysql
 
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://10.0.0.3,10.0.0.4,10.0.0.2"
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
 
# Allow server to accept connections on all interfaces.
bind-address=0.0.0.0
 
# Optional settings
#wsrep_slave_threads=1
#innodb_flush_log_at_trx_commit=0
 
wsrep_cluster_name="mariadb-galera-cluster"
wsrep_sst_method=rsync
 
# Cluster node configuration
wsrep_node_address="10.0.0.3"
wsrep_node_name="db-1"



 Comments   
Comment by Renaud Keradac [ 2020-12-23 ]

We have exactly the same issue since update to MariaDB 10.3.27 (Ubuntu 18.04).
Nodes hang after 10 days, a single node locks all the cluster. Incoming connections never connects when the cluster is KO.
If we force kill (kill -9) MariaDB on the blocked node, the cluster comes back immediately.

For now, the only workaround we found is to restart MariaDB daemon every weeks...

Here is the output of the log while happening :

LOG

2020-12-22  6:25:57 52775238 [Warning] Aborted connection 52775238 to db: '*' user: '*' host: '192.168.*' (Got an error reading communication packets)
2020-12-22  6:29:10 52784139 [Warning] Aborted connection 52784139 to db: '*' user: '*' host: '192.168.*' (Got an error reading communication packets)
2020-12-22  6:41:46 52792616 [Warning] Aborted connection 52792616 to db: '*' user: '*' host: '192.168.*' (Got timeout reading communication packets)
2020-12-22  6:52:40 52855824 [Warning] Aborted connection 52855824 to db: '*' user: '*' host: '192.168.*' (Got an error reading communication packets)
2020-12-22  7:18:29 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140389075212032 has waited at btr0cur.cc line 1357 for 241.00 seconds the semaphore:
SX-lock on RW-latch at 0x7fabd0010a20 created in file dict0dict.cc line 2130
a writer (thread id 140377167730432) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time write locked in file dict0stats.cc line 1969
2020-12-22  7:18:29 0 [Note] InnoDB: A semaphore wait:
--Thread 140389075212032 has waited at btr0cur.cc line 1357 for 241.00 seconds the semaphore:
SX-lock on RW-latch at 0x7fabd0010a20 created in file dict0dict.cc line 2130
a writer (thread id 140377167730432) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time write locked in file dict0stats.cc line 1969
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending reads 1, writes 0
 
=====================================
2020-12-22 07:18:40 0x7fac1bb7a700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 65 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 940644 srv_active, 0 srv_shutdown, 218 srv_idle
srv_master_thread log flush and writes: 940861
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 11449434
--Thread 140389075212032 has waited at btr0cur.cc line 1357 for 252.00 seconds the semaphore:
SX-lock on RW-latch at 0x7fabd0010a20 created in file dict0dict.cc line 2130
a writer (thread id 140377167730432) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time write locked in file dict0stats.cc line 1969
OS WAIT ARRAY INFO: signal count 62871945
RW-shared spins 108130280, rounds 230713571, OS waits 2501362
RW-excl spins 54211076, rounds 272419071, OS waits 6357534
RW-sx spins 6559785, rounds 30314820, OS waits 464584
Spin rounds per wait: 2.13 RW-shared, 5.03 RW-excl, 4.62 RW-sx
------------------------
LATEST FOREIGN KEY ERROR

There was no issue on version 10.3.23

Comment by Renaud Keradac [ 2020-12-23 ]

Maybe linked to #MDEV-22890

Comment by Renaud Keradac [ 2020-12-30 ]

I've just rollback to v10.3.23 in order to confirm there is no issue and the bug is only related to MariaDB update, will keep you up.

Comment by Renaud Keradac [ 2021-01-26 ]

I confirm the issue is not reproduced in version 10.3.23. There is a regression between 10.3.23 and 10.3.27.

Comment by Elena Stepanova [ 2021-01-31 ]

Which part have you determined to be a regression, the initial hang / long semaphore wait on one node, or the eventual lock up of the whole cluster due to a single node hang?

Comment by Malte Bastian [ 2021-02-02 ]

As a regression, I would say that the entire cluster hangs for several hours until the causing node is determined and rebooted. I wish here that the Galera cluster, which used for high availability, removes this node from the cluster by itself to restore the availability of the database.

Comment by Iosif Peterfi [ 2021-03-22 ]

I can confirm the same is happening with 10.4.18

Comment by Renaud Keradac [ 2021-03-22 ]

As a regression, I would say that the entire cluster hangs for several hours until the causing node is determined and rebooted. I wish here that the Galera cluster, which used for high availability, removes this node from the cluster by itself to restore the availability of the database.

There is no "long semaphore wait" on version 10.3.23. We can probably consider that the 2 issues (long semaphore & cluster hang) are linked but cannot be sure since both have appeared together.

Comment by Florian Bezdeka [ 2021-03-26 ]

After updating from 10.4.17 to 10.4.18 last week I run into this problem twice now, complete cluster hang. My guess is that the fix for MDEV-23328 introduced a regression.

Comment by Kóczán Ákos [ 2021-04-14 ]

We have the same issue with 10.4.18, almost daily, sometimes twice a day. Does anyone have a solution, workaround, downgrade, upgrade?

Comment by Matt Le Fevre [ 2021-04-14 ]

Possibly related to https://jira.mariadb.org/browse/MDEV-25368

Comment by Daan van Gorkum [ 2021-05-21 ]

Is this problem resolved in 10.4.19? I do see mentions of lock related bugs that are resolved already but this ticket is still open and no recent updates. We're planning our upgrade from 10.4.13 but we currently do not see a version that seems stable enough for production.

Comment by Seppo Jaakola [ 2021-09-22 ]

As pointed out in earlier comments, MDEV-23328 has caused regression with various symptoms, but it should only affect following versions:

  • 10.3.28 and later
  • 10.4.18 and later
  • 10.5.9 and later

BC-M reports this issue with 10.5.8, so it should not be affected of MDEV-23328. Please attach error logs from the problem time for analysis
rkc 's issue could be same as what Malte reports, please attach logs of this issue as well

Comment by Luke Cousins [ 2021-09-22 ]

Thanks Seppo. Does this mean that you're confident that 10.6.x is not affected, or may this also be affected?

Comment by Ers Sein [ 2021-09-23 ]

Just wanted to add a +1 as we have seen multiple environments in production where the cluster completely hangs. It doesn't happen often, at most every other week or so and not in all environments. But regardless is giving a lot of frustration having to reboot machines and bootstrap to get the cluster going again. This is since we upgraded from 10.4.17 (10.4.18 and 10.4.20 so far).

Comment by Seppo Jaakola [ 2021-10-25 ]

violuke erwin_se 10.6 has refactored high priority transaction conflict resolution and is not affected by MDEV-23328

Comment by Rob [ 2021-10-29 ]

Is it recommended to downgrade to 10.4.17?

Comment by Jan Lindström (Inactive) [ 2021-10-30 ]

commit ef2dbb8dbc3ee42b59adcd2ee4b9967ff55867a1

Generated at Thu Feb 08 09:28:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.