Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.3.14
-
Centos 7.6.1810, packages from http://yum.mariadb.org/10.3/centos, server has ample CPU (load is ever low), and SSD disks
Description
We have been running a galera cluster for well over a year. After user training and switching to SST via mariabackup we have had a very stable environment.
Overnight yesterday one of the nodes fell out of the cluster. We could not find any explanation for that, the only thing we found whas that a SST started and it run some loops during the night. The other nodes (one maria, one garb) were happy and working well.
In the morning it had stopped trying to SST and not understanding what went wrong we restarted it and looked around. We finaly looked in the mariabackup.backup.log file and found this at the end:
{{
[00] 2020-06-08 15:54:44 >> log scanned up to (2778837446166)
[01] 2020-06-08 15:54:44 ...done
[00] 2020-06-08 15:54:44 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...
[00] 2020-06-08 15:54:45 >> log scanned up to (2778837458765)
[00] 2020-06-08 15:54:46 >> log scanned up to (2778837501724)
[00] 2020-06-08 15:54:47 >> log scanned up to (2778837521685)
[00] 2020-06-08 15:54:47 Executing FLUSH TABLES WITH READ LOCK...
[00] 2020-06-08 15:54:48 >> log scanned up to (2778837561200)
[00] FATAL ERROR: 2020-06-08 15:54:48 failed to execute query FLUSH TABLES WITH READ LOCK: Lock wait timeout exceeded; try restarting transaction
}}
I experimented some with running "FLUSH TABLES WITH READ LOCK" on the running node and the command completed in 1-6 seconds, or just hung indefinitly, or failed at once with "Lock wait timeout exceeded; try restarting transaction".
As far as we could gather from the mariabackup documentation mariabackup will attemt to aquire the lock and fail if it's not immediately able to get it. My colleague approached this from a different direction and he made one change to our /etc/my.cnf.d/galera.cnf file: inserting a "innodb_lock_wait_timeout=100" before I could propose adding {--ftwrl-wait-timeout=#} in the script.
After this the SST completed without issue and we have lived happily ever after.
We think that perhaps the possible need to wait for the lock should be part of the galera documentation. Also not waiting at all is maybe a bit hash?