Note. Recent Galera version affected is 1.0.4/26.4.4, but same is relevant for previous versions, at least 26.4.3.
We have cluster of 3 nodes working with large (above 1 TB) data volume. All the nodes have same hardware and software. Sometimes nodes run SST and IST to transfer data.
It was notices that frequently SST fails and has to be restarted due to error:
We tried to set bigger galera gcache size, but on some cases error happened again, on some it didn't. Moreover, sometimes simple restart of mysqld on receiving node (and thus restarting SST when donor node returned to synced state back) lead to successful SST and joiner managed to join the cluster, but sometimes it failed.
- It was noticed that gcache size and amount of transactions happening on cluster nodes has no effect on the issue.
- disabling or enabling compression of state transfer data and also attempts to flush logs has had no effect also
- It was also noticed that in case IST failed, it was always possible to find same error message logged at 20(+/- 1) minutes after starting mysqld on joining node (thus, 20 minutes after state transfer request). This error was:
Appearance of these last two lines (error+note) in mysqld log file always ended with state transfer failed with following errors logged:
So, questionable things are:
- How to avoid such situations - nodes require manual restart on failed transfers!
- Why this asio error is always logged 20 minutes after state transfer start?
- Reported failure is 'Connection timed out' while connection is stable and no service or monitoring tool reports connection issues
- Issue is floating: on some restarts it appears and on others it doesn't, this was actual for previous version on galera library, too. No configuration change seems to cause or solve this.
- Also it was noted that referred asio library that is used by galera is 1.10.8 and this version can't be changed - however version 1.18 is out already.