So we have a galera 10.1.31 cluster - it has three nodes. They are running RHEL 7
We have about 10 databases hosted in that cluster - one of them is about 50GB.
We lost a node due to a mishap which is another story. But we cleaned up the lost node and were trying to restart mariadb with: systemctl restart mariadb
We observed that inside of /var/lib/mysql the size of this directory never got larger than about 11GB and that the rsync processes never completed. Looking further into journalctl, we saw that about every 90 seconds - we found that the mariadb.service would be restarted.
After some digging, I figured out that systemd has a default service start timeout of 90 seconds (at least on RHEL 7). Since the mariadb.service while syncing remains in the 'Activating' state and because there was so much data to sync while activating, the service would hit the timeout.
The way I fixed this was to edit this file:
And add these lines below the [Service] line:
Systemctl restart mariadb
After about 5 minutes, the node was fully sync'd and operational - I then removed these timeouts.
This raises a concern though - a default installation of Galera should not timeout during initial sync of medium-sized databases.
I'm not sure what the best way to handle this is - I'm concerned about making the increased timeout part of the mariadb.service file permanently for all systemd users - because this would have negative outcomes if there were in-fact some kind of funk going on with the service.
Maybe systemd has other states that could be used for the syncronization phase that a new galera node goes through? Something we can set the timeout higher for?