Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Incomplete
-
10.5.8, 10.5.12, 10.6.5
-
None
-
3 nodes, all of them: Debian 10.11, 128GB Ram, 32 CPU, ~ 500 tables, ~ 420GB data, biggest table with ~500.000.000 rows
Description
Every couple of days/weeks one of our 3 nodes hangs and freezes the complete galera cluster. It is a little bit like MDEV-24294 but with some differences.
When one of the nodes hangs like this, I can still connect with SSH. But I can not run the mysql client to get info about the server and ws_rep variables. I run the client command and nothing happens. It does not open the mariadb console. There are no log lines in /var/log/syslog (although it is usually chatty when a node connects for example). I can not stop the service (service mariadb stop). When I kill the service, the cluster goes back online and works fine. When I start the service again, the node synchronizes immediately and we have no more problems for a week or two or three.
It is not always the same node. Each of the nodes has this problem once in a while. It usually happens at night but not at the same time (sometimes 9pm, sometimes 4am) and there is not much load/memory used/network traffic when it happens. The mariadb server process also does not have a lot of load when I try to run the client and nothing happens. Today it was about 0.7%. And there are no othere processes with more load.
We are running this cluster for a couple of years now and have been passing quite a few mariadb versions. The performance is very good. We have this particular problem since 10.5.8 and still have it now with 10.6.5.
I have no clue how to further investigate or solve this problem. I would expect the cluster to exclude the hanging node and resume normal operations until the particular node joins again. Having no hanging node at all would even be better.
This is our configuration:
[client]
|
port = 3306
|
socket = /var/run/mysqld/mysqld.sock
|
|
[mysqld_safe]
|
socket = /var/run/mysqld/mysqld.sock
|
nice = 0
|
|
[mysqld]
|
user = mysql
|
pid-file = /var/run/mysqld/mysqld.pid
|
socket = /var/run/mysqld/mysqld.sock
|
port = 3306
|
basedir = /usr
|
datadir = /var/lib/mysql
|
tmpdir=/data/tmp
|
lc_messages_dir = /usr/share/mysql
|
lc_messages = en_US
|
skip-external-locking
|
|
character-set-server=utf8
|
collation-server=utf8_general_ci
|
|
bind-address=0.0.0.0
|
|
max_connections = 750
|
connect_timeout = 5
|
wait_timeout = 10000
|
interactive_timeout = 10000
|
max_allowed_packet = 1073741824
|
thread_cache_size = 128
|
sort_buffer_size = 4M
|
bulk_insert_buffer_size = 16M
|
tmp_table_size = 64M
|
max_heap_table_size = 64M
|
|
# MyIsam
|
myisam_recover_options = BACKUP
|
key_buffer_size = 128M
|
#open-files-limit = 2000
|
table_open_cache = 2000
|
myisam_sort_buffer_size = 512M
|
concurrent_insert = 2
|
read_buffer_size = 2M
|
read_rnd_buffer_size = 1M
|
|
# Query Cache
|
query_cache_limit = 256K
|
query_cache_size=0
|
query_cache_type=0
|
|
# Logging
|
log_warnings = 2
|
slow_query_log = 1
|
slow_query_log_file = /var/log/mysql/mariadb-slow.log
|
long_query_time = 2
|
log_slow_verbosity = query_plan
|
log_queries_not_using_indexes = 0
|
|
log_bin = /data/log/mysql/mariadb-bin
|
log_bin_index = /data/log/mysql/mariadb-bin.index
|
binlog_expire_logs_seconds = 10000
|
max_binlog_size = 100M
|
binlog_format=ROW
|
|
#Engine
|
sql_mode = NO_ENGINE_SUBSTITUTION
|
|
#InnoDB
|
default-storage-engine=innodb
|
innodb_autoinc_lock_mode=2
|
innodb_log_file_size = 5G
|
innodb_buffer_pool_size = 56G
|
innodb_flush_log_at_trx_commit = 0
|
innodb_log_buffer_size = 512M
|
innodb_file_per_table = 1
|
innodb_open_files = 10000
|
innodb_flush_method = O_DIRECT
|
innodb_table_locks = 0
|
innodb_lock_wait_timeout= 300
|
skip-innodb-doublewrite
|
|
[galera]
|
bind-address=0.0.0.0
|
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_name="my_wsrep_cluster"
|
wsrep_cluster_address=gcomm://10.200.0.7,10.200.0.6,10.200.0.9
|
wsrep_node_address="10.200.0.6"
|
wsrep_node_name="node_3"
|
wsrep_sst_donor="node_2,node_3,"
|
wsrep_sst_method=mariabackup
|
wsrep_sst_auth=XXXXX
|
wsrep_slave_threads=4
|
wsrep_provider_options="gcache.size=2G"
|
|
[mysqldump]
|
quick
|
quote-names
|
max_allowed_packet = 16M
|
|
[mysql]
|
|
|
[isamchk]
|
key_buffer = 16M
|
|
Attachments
Issue Links
- relates to
-
MDEV-29346 update_rows_log_event hung causing galera cluster failure
- Closed
-
MDEV-30418 Setting wsrep_slave_threads causes thread hang
- In Review