[MDEV-22845] wsrep_sst_mariabackup fails with locking timeout at the end - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.3.14
Fix Version/s: 10.3(EOL)
Component/s: Galera SST, mariabackup
Labels:
- galera
- innodb
- lock
Environment:
Centos 7.6.1810, packages from http://yum.mariadb.org/10.3/centos, server has ample CPU (load is ever low), and SSD disks

Description

We have been running a galera cluster for well over a year. After user training and switching to SST via mariabackup we have had a very stable environment.

Overnight yesterday one of the nodes fell out of the cluster. We could not find any explanation for that, the only thing we found whas that a SST started and it run some loops during the night. The other nodes (one maria, one garb) were happy and working well.

In the morning it had stopped trying to SST and not understanding what went wrong we restarted it and looked around. We finaly looked in the mariabackup.backup.log file and found this at the end:

{{
[00] 2020-06-08 15:54:44 >> log scanned up to (2778837446166)
[01] 2020-06-08 15:54:44 ...done
[00] 2020-06-08 15:54:44 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...
[00] 2020-06-08 15:54:45 >> log scanned up to (2778837458765)
[00] 2020-06-08 15:54:46 >> log scanned up to (2778837501724)
[00] 2020-06-08 15:54:47 >> log scanned up to (2778837521685)
[00] 2020-06-08 15:54:47 Executing FLUSH TABLES WITH READ LOCK...
[00] 2020-06-08 15:54:48 >> log scanned up to (2778837561200)
[00] FATAL ERROR: 2020-06-08 15:54:48 failed to execute query FLUSH TABLES WITH READ LOCK: Lock wait timeout exceeded; try restarting transaction
}}

I experimented some with running "FLUSH TABLES WITH READ LOCK" on the running node and the command completed in 1-6 seconds, or just hung indefinitly, or failed at once with "Lock wait timeout exceeded; try restarting transaction".

As far as we could gather from the mariabackup documentation mariabackup will attemt to aquire the lock and fail if it's not immediately able to get it. My colleague approached this from a different direction and he made one change to our /etc/my.cnf.d/galera.cnf file: inserting a "innodb_lock_wait_timeout=100" before I could propose adding {--ftwrl-wait-timeout=#} in the script.

After this the SST completed without issue and we have lived happily ever after.

We think that perhaps the possible need to wait for the lock should be part of the galera documentation. Also not waiting at all is maybe a bit hash?

Attachments

Activity

There are no comments yet on this issue.

People

Assignee:: Vladislav Lesin

Reporter:: Nicolai Langfeldt

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2020-06-09 13:26

Updated:: 2020-07-08 12:33

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server