[MDEV-15606] Galera can't perform SST in 10.2.13 if systemd in use due to timeout at startup Created: 2018-03-20  Updated: 2018-12-06  Resolved: 2018-09-12

Status: Closed
Project: MariaDB Server
Component/s: Configuration
Affects Version/s: 10.1, 10.2.13, 10.2.14, 10.3.6
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Jan Lindström (Inactive)
Resolution: Duplicate Votes: 4
Labels: None
Environment:

CentOS Linux release 7.4.1708 (Core)


Issue Links:
Relates
relates to MDEV-14705 systemd: EXTEND_TIMEOUT_USEC= to avoi... Closed
relates to MDEV-17571 Make systemd timeout behavior more co... Closed

 Description   

The second node can't join the first node because SST will get killed by systemd after the default timeout hits.

systemctl show mariadb.service | grep Timeout will show timeout set to 1m 30s for startup, but an SST can last hours with large dataset and/or slow disks and/or slow networks.

In fact, it is common for an SST to take several hours in production.

Setting TimeoutSec=0 under Services in the mariadb.service config file under systemd fixes the problem.

Right now, it is impossible to deploy Galera Cluster under 10.2.13 and CentOS 7 unless the above workaround is in place.



 Comments   
Comment by Zdravelina Sokolovska (Inactive) [ 2018-03-20 ]

the same issue was observed with data set of ~12G when 3rd Node was joining
sst failed with wsrep_sst_method=mariabackup but also with set rsync
joiner: => Rate:[ 39MiB/s] Avg:[32.9MiB/s] Elapsed:0:01:20
WSREP_SST: [ERROR] Removing /var/lib/mysql//.sst/xtrabackup_galera_info file due to signal (20180320 16:13:50.761)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20180320 16:13:50.765)
2018-03-20 16:13:50 140406339643136 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '192.168.104.193' --datadir '/var/lib/mysql/' --parent '13420' '' : 4 (Interrupted system call)

Comment by Aurélien LEQUOY [ 2018-03-20 ]

read this : https://jira.mariadb.org/browse/MDEV-15383?focusedCommentId=108624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-108624

Comment by Aurélien LEQUOY [ 2018-03-20 ]

i am not sure you can make a SST even without that, or you keeped your version of Client, but this version will fuck your IST and SST "libmariadbclient18 10.2.13"

Comment by Aurélien LEQUOY [ 2018-03-22 ]

i confirm this bug too on Debian 9.4 : i made a SST with a node of 1 To.

[....] Starting mysql (via systemctl): mysql.serviceJob for mariadb.service failed because a timeout was exceeded.

i add

TimeoutSec=0
in /etc/systemd/system/mysqld.service

echo 'TimeoutSec=0' >> /etc/systemd/system/mysqld.service
systemctl daemon-reload

Comment by Rick Pizzi [ 2018-04-11 ]

Guys, this needs a fix, just being bitten by this in a newly installed 10.2.14... please...

Comment by Alex Vorona [ 2018-04-12 ]

Same problem affects 10.1 version

Comment by Wayne Workman [ 2018-06-12 ]

These are the same:

Comment by Jan Lindström (Inactive) [ 2018-09-12 ]

MDEV-15607 should fix this issue.

Comment by brianr [ 2018-10-22 ]

This will not work any longer:

echo 'TimeoutSec=0' >> /etc/systemd/system/mysqld.service
systemctl daemon-reload

systemd will apparently silently ignore the fact that it only reacts now, to "TimeoutSec=infinity" , not =0

DAHMIKT

Generated at Thu Feb 08 08:22:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.