[MDEV-16425] New node in Galera can't fully sync - systemd timeout Created: 2018-06-07  Updated: 2018-09-12  Resolved: 2018-09-12

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.1.31
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Wayne Workman Assignee: Jan Lindström (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL 7



 Description   

So we have a galera 10.1.31 cluster - it has three nodes. They are running RHEL 7

We have about 10 databases hosted in that cluster - one of them is about 50GB.

We lost a node due to a mishap which is another story. But we cleaned up the lost node and were trying to restart mariadb with: systemctl restart mariadb
We observed that inside of /var/lib/mysql the size of this directory never got larger than about 11GB and that the rsync processes never completed. Looking further into journalctl, we saw that about every 90 seconds - we found that the mariadb.service would be restarted.

After some digging, I figured out that systemd has a default service start timeout of 90 seconds (at least on RHEL 7). Since the mariadb.service while syncing remains in the 'Activating' state and because there was so much data to sync while activating, the service would hit the timeout.

The way I fixed this was to edit this file:
/usr/lib/systemd/system/mariadb.service

And add these lines below the [Service] line:
RestartSec=86400
TimeoutSec=86400

Then ran:
systemctl daemon-reload
Systemctl restart mariadb

After about 5 minutes, the node was fully sync'd and operational - I then removed these timeouts.

This raises a concern though - a default installation of Galera should not timeout during initial sync of medium-sized databases.

I'm not sure what the best way to handle this is - I'm concerned about making the increased timeout part of the mariadb.service file permanently for all systemd users - because this would have negative outcomes if there were in-fact some kind of funk going on with the service.

Maybe systemd has other states that could be used for the syncronization phase that a new galera node goes through? Something we can set the timeout higher for?

Thanks,
Wayne



 Comments   
Comment by Wayne Workman [ 2018-06-12 ]

These are the same:

Comment by Jan Lindström (Inactive) [ 2018-09-12 ]

MDEV-15607 should fix this issue.

Generated at Thu Feb 08 08:28:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.