Details
-
Bug
-
Status: Stalled (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.5, 10.4(EOL)
-
None
Description
Systemd uses a timeout counter to catch misbehaving services. If a service hasn't notified systemd that it reached the "READY" state within that timeout, systemd will assume something went wrong and kill the service. The timeout is 90 seconds (systemd default), but can be set in the unit file (900 seconds in recent MariaDB Server releases) or in a user configuration file (as discussed in this knowledge base article).
When a Galera node recovers or simply joins a running cluster, the SST can take much longer than 90 (or 900) seconds, resulting in systemd diagnosing a service start timeout. From the user point of view the Galera node will simply be unable to start.
MDEV-15607 implemented a solution by notifying systemd with EXTEND_TIMEOUT_USEC messages that the service startup is still ongoing and that systemd should just continue waiting. Early in MariaDB 10.4 development, MariaDB switched to Galera 4. When that happened, the work from MDEV-15607 was removed from the code base.
This issue is about re-adding the logic from MDEV-15607 to the SST code in MariaDB server 10.4 and up. If the SST is taking longer than 90 seconds, then systemd must be notified that the service start is delayed. Such messages must be sent continuously until the SST has finished. Some safety margin should be left to the 90 sec timeout.
Attachments
Issue Links
- relates to
-
MDEV-15607 mysqld crashed few after node is being joined with sst
- Closed
-
MDEV-17571 Make systemd timeout behavior more compatible with long Galera SSTs
- Closed