[MDEV-21231] notify systemd of long running SST to avoid timeout Created: 2019-12-05  Updated: 2023-12-05

Status: Stalled
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Axel Schwenke Assignee: Seppo Jaakola
Resolution: Unresolved Votes: 4
Labels: None

Issue Links:
Relates
relates to MDEV-15607 mysqld crashed few after node is bein... Closed
relates to MDEV-17571 Make systemd timeout behavior more co... Closed

 Description   

Systemd uses a timeout counter to catch misbehaving services. If a service hasn't notified systemd that it reached the "READY" state within that timeout, systemd will assume something went wrong and kill the service. The timeout is 90 seconds (systemd default), but can be set in the unit file (900 seconds in recent MariaDB Server releases) or in a user configuration file (as discussed in this knowledge base article).

When a Galera node recovers or simply joins a running cluster, the SST can take much longer than 90 (or 900) seconds, resulting in systemd diagnosing a service start timeout. From the user point of view the Galera node will simply be unable to start.

MDEV-15607 implemented a solution by notifying systemd with EXTEND_TIMEOUT_USEC messages that the service startup is still ongoing and that systemd should just continue waiting. Early in MariaDB 10.4 development, MariaDB switched to Galera 4. When that happened, the work from MDEV-15607 was removed from the code base.

This issue is about re-adding the logic from MDEV-15607 to the SST code in MariaDB server 10.4 and up. If the SST is taking longer than 90 seconds, then systemd must be notified that the service start is delayed. Such messages must be sent continuously until the SST has finished. Some safety margin should be left to the 90 sec timeout.



 Comments   
Comment by Julien Fritsch [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Comment by JiraAutomate [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Generated at Thu Feb 08 09:05:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.