[MDEV-27689] Node hangs and complete galera cluster freezes Created: 2022-01-31  Updated: 2023-03-27

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.8, 10.5.12, 10.6.5
Fix Version/s: 10.5, 10.6

Type: Bug Priority: Major
Reporter: Ulrich Abelmann Assignee: Julius Goryavsky
Resolution: Unresolved Votes: 2
Labels: None
Environment:

3 nodes, all of them: Debian 10.11, 128GB Ram, 32 CPU, ~ 500 tables, ~ 420GB data, biggest table with ~500.000.000 rows


Issue Links:
Relates
relates to MDEV-29346 update_rows_log_event hung causing ga... Stalled
relates to MDEV-30418 Setting wsrep_slave_threads causes th... Stalled

 Description   

Every couple of days/weeks one of our 3 nodes hangs and freezes the complete galera cluster. It is a little bit like MDEV-24294 but with some differences.

When one of the nodes hangs like this, I can still connect with SSH. But I can not run the mysql client to get info about the server and ws_rep variables. I run the client command and nothing happens. It does not open the mariadb console. There are no log lines in /var/log/syslog (although it is usually chatty when a node connects for example). I can not stop the service (service mariadb stop). When I kill the service, the cluster goes back online and works fine. When I start the service again, the node synchronizes immediately and we have no more problems for a week or two or three.

It is not always the same node. Each of the nodes has this problem once in a while. It usually happens at night but not at the same time (sometimes 9pm, sometimes 4am) and there is not much load/memory used/network traffic when it happens. The mariadb server process also does not have a lot of load when I try to run the client and nothing happens. Today it was about 0.7%. And there are no othere processes with more load.

We are running this cluster for a couple of years now and have been passing quite a few mariadb versions. The performance is very good. We have this particular problem since 10.5.8 and still have it now with 10.6.5.

I have no clue how to further investigate or solve this problem. I would expect the cluster to exclude the hanging node and resume normal operations until the particular node joins again. Having no hanging node at all would even be better.

This is our configuration:

[client]
port            = 3306
socket          = /var/run/mysqld/mysqld.sock
 
[mysqld_safe]
socket          = /var/run/mysqld/mysqld.sock
nice            = 0
 
[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir=/data/tmp
lc_messages_dir = /usr/share/mysql
lc_messages     = en_US
skip-external-locking
 
character-set-server=utf8
collation-server=utf8_general_ci
 
bind-address=0.0.0.0
 
max_connections         = 750
connect_timeout         = 5
wait_timeout            = 10000
interactive_timeout     = 10000
max_allowed_packet      = 1073741824
thread_cache_size       = 128
sort_buffer_size        = 4M
bulk_insert_buffer_size = 16M
tmp_table_size          = 64M
max_heap_table_size     = 64M
 
# MyIsam
myisam_recover_options = BACKUP
key_buffer_size         = 128M
#open-files-limit       = 2000
table_open_cache        = 2000
myisam_sort_buffer_size = 512M
concurrent_insert       = 2
read_buffer_size        = 2M
read_rnd_buffer_size    = 1M
 
# Query Cache
query_cache_limit               = 256K
query_cache_size=0
query_cache_type=0
 
# Logging
log_warnings            = 2
slow_query_log          = 1
slow_query_log_file     = /var/log/mysql/mariadb-slow.log
long_query_time         = 2
log_slow_verbosity      = query_plan
log_queries_not_using_indexes = 0
 
log_bin                 = /data/log/mysql/mariadb-bin
log_bin_index           = /data/log/mysql/mariadb-bin.index
binlog_expire_logs_seconds        = 10000
max_binlog_size         = 100M
binlog_format=ROW
 
#Engine
sql_mode                = NO_ENGINE_SUBSTITUTION
 
#InnoDB
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
innodb_log_file_size    = 5G
innodb_buffer_pool_size = 56G
innodb_flush_log_at_trx_commit = 0
innodb_log_buffer_size  = 512M
innodb_file_per_table   = 1
innodb_open_files       = 10000
innodb_flush_method     = O_DIRECT
innodb_table_locks      = 0
innodb_lock_wait_timeout= 300
skip-innodb-doublewrite
 
[galera]
bind-address=0.0.0.0
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="my_wsrep_cluster"
wsrep_cluster_address=gcomm://10.200.0.7,10.200.0.6,10.200.0.9
wsrep_node_address="10.200.0.6"
wsrep_node_name="node_3"
wsrep_sst_donor="node_2,node_3,"
wsrep_sst_method=mariabackup
wsrep_sst_auth=XXXXX
wsrep_slave_threads=4
wsrep_provider_options="gcache.size=2G"
 
[mysqldump]
quick
quote-names
max_allowed_packet      = 16M
 
[mysql]
 
 
[isamchk]
key_buffer              = 16M



 Comments   
Comment by Khai Ping [ 2022-09-05 ]

@ulrich, are you seeing something similar to us as well? MDEV-29346

Generated at Thu Feb 08 09:54:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.