[MXS-4135] MaxScale fails to start when listeners bound to specific IP address Created: 2022-05-16  Updated: 2022-06-27  Resolved: 2022-06-27

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 6.2.4
Fix Version/s: 22.08.0

Type: Bug Priority: Major
Reporter: Assen Totin (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL, possibly all other



 Description   

MaxScale runs under systemd. Its unit file comes configured like this:

[Unit]
After = network.service
[Service]
Retry = on-abort

However, on RHEL at least (and possibly on Debian-like systems too) the network.service does not provide guarantees of actual network's availability. It may permit a bind to 0.0.0.0, but trying to bind to a specific IP address (like 1.2.3.4) will fail at this time - and so will MaxScale. Full availability of the network is only achieved after the completion of another service, the network-online.service.

Because systemd is asynchronous, by the time MaxScale reaches its bind to 1.2.3.4, the network-online.service may already be completed too, resulting in proper MaxScale start-up - but this is not guaranteed. As a result, one sees intermittent start-up failures of MaxScale with TCP port bind error messages. Worse, MaxScale in these case fails and does not abort (different exit codes), hence, due to the "Retry" setting in its unit file, it will not be restarted after "RestartSec" seconds by systemd - resulting in complete outage of MaxScale until manual restart.

Solutions:

  • Set network-online.target instead of network.service in the unit file. This will delay the start of MaxScale by few seconds during boot, but will completely prevent the mentioned failures.
  • Or, change the "Restart" setting from "on-abort" to "on-fail", which will make systemd restart MaxScale after "RestartSec" seconds (30 sec by default on RHEL) - by which time the network should be fully up.
  • Or, change the exit code of MaxScale when a TCP port bind fails from failure to abort.


 Comments   
Comment by markus makela [ 2022-06-07 ]

The MariaDB server seems to also use After=network.target. Is this a problem with the server as well or does it deal with the situation differently?

Using Restart=on-fail causes MaxScale to repeatedly restart on configuration errors as well as any other errors that need manual intervention. This isn't really a realistic option in my opinion.

MaxScale does already define a special exit code with which it will be restarted:

# MaxScale should be restarted if it exits with 75 (BSD's EX_TEMPFAIL)
RestartForceExitStatus=75

We could use this to trigger a restart of the process but it wouldn't help if the ports are used by something other than MaxScale and would again result in a permanent restart loop. This could be refined so that only if the networks do not exist, a restart is triggered.

An alternative would be to use IP_FREEBIND (as mentioned here) to allow binding to interfaces that do not exist yet.

Comment by Assen Totin (Inactive) [ 2022-06-07 ]

It's a matter of policy. I discussed this with RHEL support few years ago and their answer was "we do it this way, fix your software to bind to non-existing IP addresses or create a systemd extension to include network-online.target" - and I still do the latter whenever I install MariaDB somewhere.

As of the Server, it is a much more rare occasion for it to bind to a specific IP - it usually binds to 0.0.0.0 as the DB server would usually only have one network interface and one IP address. .

Since MaxScale is a back-to-back agent, it is much more natural for it to bind to only one of the (usually two) IP addresses on the machine (that would have a front-facing and read-facing interfaces).

I believe we should ship software that "just works" and not pay too much attention to vendors' conventions.

Comment by markus makela [ 2022-06-23 ]

I decided to add a second bind attempt but with IP_FREEBIND enabled. If the second bind succeeds, a warning is logged to explain that the might not be up yet. This should help with the case where a typo causes the initial bind to fail.

Generated at Thu Feb 08 04:26:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.