[MXS-4135] MaxScale fails to start when listeners bound to specific IP address Created: 2022-05-16 Updated: 2022-06-27 Resolved: 2022-06-27 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Core |
| Affects Version/s: | 6.2.4 |
| Fix Version/s: | 22.08.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Assen Totin (Inactive) | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL, possibly all other |
||
| Description |
|
MaxScale runs under systemd. Its unit file comes configured like this: [Unit] However, on RHEL at least (and possibly on Debian-like systems too) the network.service does not provide guarantees of actual network's availability. It may permit a bind to 0.0.0.0, but trying to bind to a specific IP address (like 1.2.3.4) will fail at this time - and so will MaxScale. Full availability of the network is only achieved after the completion of another service, the network-online.service. Because systemd is asynchronous, by the time MaxScale reaches its bind to 1.2.3.4, the network-online.service may already be completed too, resulting in proper MaxScale start-up - but this is not guaranteed. As a result, one sees intermittent start-up failures of MaxScale with TCP port bind error messages. Worse, MaxScale in these case fails and does not abort (different exit codes), hence, due to the "Retry" setting in its unit file, it will not be restarted after "RestartSec" seconds by systemd - resulting in complete outage of MaxScale until manual restart. Solutions:
|
| Comments |
| Comment by markus makela [ 2022-06-07 ] | ||
|
The MariaDB server seems to also use After=network.target. Is this a problem with the server as well or does it deal with the situation differently? Using Restart=on-fail causes MaxScale to repeatedly restart on configuration errors as well as any other errors that need manual intervention. This isn't really a realistic option in my opinion. MaxScale does already define a special exit code with which it will be restarted:
We could use this to trigger a restart of the process but it wouldn't help if the ports are used by something other than MaxScale and would again result in a permanent restart loop. This could be refined so that only if the networks do not exist, a restart is triggered. An alternative would be to use IP_FREEBIND (as mentioned here) to allow binding to interfaces that do not exist yet. | ||
| Comment by Assen Totin (Inactive) [ 2022-06-07 ] | ||
|
It's a matter of policy. I discussed this with RHEL support few years ago and their answer was "we do it this way, fix your software to bind to non-existing IP addresses or create a systemd extension to include network-online.target" - and I still do the latter whenever I install MariaDB somewhere. As of the Server, it is a much more rare occasion for it to bind to a specific IP - it usually binds to 0.0.0.0 as the DB server would usually only have one network interface and one IP address. . Since MaxScale is a back-to-back agent, it is much more natural for it to bind to only one of the (usually two) IP addresses on the machine (that would have a front-facing and read-facing interfaces). I believe we should ship software that "just works" and not pay too much attention to vendors' conventions. | ||
| Comment by markus makela [ 2022-06-23 ] | ||
|
I decided to add a second bind attempt but with IP_FREEBIND enabled. If the second bind succeeds, a warning is logged to explain that the might not be up yet. This should help with the case where a typo causes the initial bind to fail. |