SSTs can take several hours in many cases, but the current default value of TimeoutStartSec causes systemd to force the joiner node to timeout in less than a couple minutes. It might make sense to disable systemd service's timeout by default instead.
Depending on the systemd version, disabling this means setting either:
TimeoutStartSec=infinity (if systemd version >=229)
TimeoutStartSec=0 (if systemd version <=228)
The following documentation section that describes current behavior:
If we don't want to disable the systemd timeout by default, then it might make more sense to extend the timeout during SSTs by doing the following:
If a service of Type=notify sends "EXTEND_TIMEOUT_USEC=…", this may cause the start time to be extended beyond TimeoutStartSec=. The first receipt of this message must occur before TimeoutStartSec= is exceeded, and once the start time has exended beyond TimeoutStartSec=, the service manager will allow the service to continue to start, provided the service repeats "EXTEND_TIMEOUT_USEC=…" within the interval specified until the service startup status is finished by "READY=1". (see sd_notify(3)).
It looks like this approach was attempted while fixing
MDEV-15607, but users are still seeing timeouts during SSTs, so it may not be working properly. It looks like this is the relevant commit:
The relevant service_manager_extend_timeout function seems to be defined here:
And it sets the EXTEND_TIMEOUT_USEC environment variable mentioned in the systemd manual.