Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
6.2.4
-
None
-
RHEL, possibly all other
Description
MaxScale runs under systemd. Its unit file comes configured like this:
[Unit]
After = network.service
[Service]
Retry = on-abort
However, on RHEL at least (and possibly on Debian-like systems too) the network.service does not provide guarantees of actual network's availability. It may permit a bind to 0.0.0.0, but trying to bind to a specific IP address (like 1.2.3.4) will fail at this time - and so will MaxScale. Full availability of the network is only achieved after the completion of another service, the network-online.service.
Because systemd is asynchronous, by the time MaxScale reaches its bind to 1.2.3.4, the network-online.service may already be completed too, resulting in proper MaxScale start-up - but this is not guaranteed. As a result, one sees intermittent start-up failures of MaxScale with TCP port bind error messages. Worse, MaxScale in these case fails and does not abort (different exit codes), hence, due to the "Retry" setting in its unit file, it will not be restarted after "RestartSec" seconds by systemd - resulting in complete outage of MaxScale until manual restart.
Solutions:
- Set network-online.target instead of network.service in the unit file. This will delay the start of MaxScale by few seconds during boot, but will completely prevent the mentioned failures.
- Or, change the "Restart" setting from "on-abort" to "on-fail", which will make systemd restart MaxScale after "RestartSec" seconds (30 sec by default on RHEL) - by which time the network should be fully up.
- Or, change the exit code of MaxScale when a TCP port bind fails from failure to abort.