[MDEV-25233] Review shutdown patterns in systemd service units - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Packaging
Labels:
- foundation
- systemd

Description

When reviewing the units in the pull request for systemd socket activation, I suggested configuring the services using KillMode=mixed to gain more control over shutdown for anything that has a subprocess.

The units I reviewed use the default, which is KillMode=control-group, which sends SIGTERM signals to all processes in the control group simultaneously. Unless the daemon and its children are designed to have child processes stopped this way, this is often disruptive to orderly shutdown. For example, a parent process might have a child shut down and disappear before it attempts to send a signal to a now-missing PID.

So, I usually encourage projects to consider KillMode=mixed if there's any chance of having child processes and the behavior of KillMode=control-group isn't explicitly desired.

I've seen some remarks on using SendSIGKill=No, which I would discourage. If you need an unbounded shutdown time, I would instead use TimeoutStopSec=infinity, which is still undesirable. I don't actually see much use case behind SendSIGKill=No these days, as it's better to control the conditions that trigger forcible cleanup than to disable the mechanism for it.

Attachments

Issue Links

relates to

MDEV-5536 Support systemd socket activation

Closed

Activity

Ascending order - Click to sort in descending order

Daniel Black added a comment - 2021-03-23 23:40

Notes so far:

KillMode=mixed

I'm almost ok with changing the KillMode=mixed, just need to look at the details below to really justify it.

Seems compatible with f9179b36d313ef50240407fcb2737ac3a0aa3b9e and `SendSIGKill=No` (https://github.com/systemd/systemd/commit/5bcffb4b549c0d115d8e40137ea885b7568ec6cb).

TODO, see how killing mariadbd would propagate to the shutdown of:

Running galera SST scripts
PAM helper
minor: galera notify script (though it shouldn't do anything long running)
Anything else I missed?

TODO, on the case for the existing KillMode=cgroup, what is the handling of the termination of the above scripts?

SendSIGKill=No

The case behind this was mainly around a start up service that was slow due to recovery. With SendSIGKill=No the systemd concept of the service can terminate while the process continues to rollback from the undo log.

There is also the case of a service shutdown that is just a bit slow to do all the necessary cleanups. Having it linger a little longer to continue seems like a reasonable constraint.

There obviously consequences of this meaning a server shutdown by letting the mariadbd process continue. One being that a restart of the service won't see the existing process still running and would hit the mariadbd mechanisms in aria/innodb that lock files for exclusive access to prevent duplicate processes. https://github.com/systemd/systemd/commit/5bcffb4b549c0d115d8e40137ea885b7568ec6cb was written to ensure that protection was applied a little earlier (and obviously only applies systemd v242 and later).

So its largely here because the cleanup from a hard kill is quite expensive.

MariaDB does use the extend timeout type=notify API of systemd to try to avoid it getting into this state ~~MDEV-14705~~.

~~MDEV-17571~~, https://github.com/MariaDB/server/commit/d78f02d73d5b2f962c0ea6a1198e932c7355adc2#diff-8c0a9bb1f023e03364e3310d3a385ac726a3274d634eba036e64bcf4984555c4, changed this to 15 minutes.

Both the extend timeout and SendSIGKill=No where trying to avoid Timeout(Start,Stop)Sec=infinity and avoid areas of clean up code (that can take considerable time).

So competing requirements as I see it that have resulted in this balance:

User experience to make a service interoperate and behave within the systemd framework as expected, and try not to put excessive constraints like infinite timeout.
Avoid time consuming cleanup areas of code which are executed under hard terminated process (which are still getting improved, though focus is more so on productive running code), and hence improving service startup/recovery time.

So these are the tradeoffs made. I'm happy to have reflections on problems this has/might cause or better ways of balancing this.

Daniel Black added a comment - 2021-03-23 23:40 Notes so far: KillMode=mixed I'm almost ok with changing the KillMode=mixed, just need to look at the details below to really justify it. Seems compatible with f9179b36d313ef50240407fcb2737ac3a0aa3b9e and `SendSIGKill=No` ( https://github.com/systemd/systemd/commit/5bcffb4b549c0d115d8e40137ea885b7568ec6cb ). TODO, see how killing mariadbd would propagate to the shutdown of: Running galera SST scripts PAM helper minor: galera notify script (though it shouldn't do anything long running) Anything else I missed? TODO, on the case for the existing KillMode=cgroup, what is the handling of the termination of the above scripts? SendSIGKill=No The case behind this was mainly around a start up service that was slow due to recovery. With SendSIGKill=No the systemd concept of the service can terminate while the process continues to rollback from the undo log. There is also the case of a service shutdown that is just a bit slow to do all the necessary cleanups. Having it linger a little longer to continue seems like a reasonable constraint. There obviously consequences of this meaning a server shutdown by letting the mariadbd process continue. One being that a restart of the service won't see the existing process still running and would hit the mariadbd mechanisms in aria/innodb that lock files for exclusive access to prevent duplicate processes. https://github.com/systemd/systemd/commit/5bcffb4b549c0d115d8e40137ea885b7568ec6cb was written to ensure that protection was applied a little earlier (and obviously only applies systemd v242 and later). So its largely here because the cleanup from a hard kill is quite expensive. MariaDB does use the extend timeout type=notify API of systemd to try to avoid it getting into this state MDEV-14705 . MDEV-17571 , https://github.com/MariaDB/server/commit/d78f02d73d5b2f962c0ea6a1198e932c7355adc2#diff-8c0a9bb1f023e03364e3310d3a385ac726a3274d634eba036e64bcf4984555c4 , changed this to 15 minutes. Both the extend timeout and SendSIGKill=No where trying to avoid Timeout(Start,Stop)Sec=infinity and avoid areas of clean up code (that can take considerable time). So competing requirements as I see it that have resulted in this balance: User experience to make a service interoperate and behave within the systemd framework as expected, and try not to put excessive constraints like infinite timeout. Avoid time consuming cleanup areas of code which are executed under hard terminated process (which are still getting improved, though focus is more so on productive running code), and hence improving service startup/recovery time. So these are the tradeoffs made. I'm happy to have reflections on problems this has/might cause or better ways of balancing this.

Daniel Black added a comment - 2021-04-13 03:44

any comments on these tradeoffs?

Daniel Black added a comment - 2021-04-13 03:44 any comments on these tradeoffs?

David Strauss added a comment - 2021-05-24 13:53 - edited

Hi Daniel. Unfortunately, notifications from this Jira instance don't seem to be reaching the top of my inbox. I saw activity on GitHub, though, and noticed your reply here.

tl;dr: If MariaDB is using Type=notify (and I think it is, IIRC) then you should probably set SendSIGKill=Yes and leverage EXTEND_TIMEOUT_USEC to avoid timeouts when performing a long but orderly shutdown. KillMode=mixed remains my recommendation unless KillMode=cgroup is known to be correct for all processes.

My Detailed Thoughts

Regarding KillMode=mixed vs. KillMode=cgroup, I tend to feel that, unless the design of the daemon's process set intends KillMode=cgroup, that "mixed" is more appropriate because it provides the main PID with time to shut down children before systemd blankets all children in shutdown signals. Another way to look at this is what happens when there's a mismatch between daemon expectations and the systemd unit configuration; I'll cover each scenario and the cost of being wrong.

If KillMode=cgroup, but that's wrong for any process...

If the daemon expects to reap its own children, then "cgroup" mode will cause broken behavior because systemd will aggressively send signals to both main and child PID processes. It can be broken because it can break internal dependency expectations around signal propagation. That is, if a child PID expects that the shutdown signal comes from the main PID after the main PID is ready for children to shut down, then systemd sending it at the same time to main and child processes will cause premature shutdown in child processes, possibly confusing the main process. Daemons like MariaDB that are portable across systems tend to expect to reap their own children because KillMode=cgroup is a systemd-specific thing AFAIK; that's why I'm suspicious whether "cgroup" is the right choice (or even a safe one) for MariaDB.

So, KillMode=cgroup can introduce hard-to-notice shutdown race conditions when the signal propagation order is important but processes tend to shut down in an acceptable order due to other effects, like the duration spent on various phases of shutdown. For example, let's say a parent expects to send the shutdown signal to its child. If that child process runs some housekeeping at the beginning of shutdown – therefore keeping it around a bit longer typically – while the parent interacts with it on the parent's shutdown, then that parent process's expectations might be broken if the child process happens to have a particular quick housekeeping run and disappears because systemd told it to shut down.

This hazard could exist in the other direction, too. Let's say a parent process wants to open a pipe for child processes to communicate their final status during shutdown. It expects to open this pipe and only then send shutdown signals to children. If those children receive a premature shutdown signal from systemd, they might try to use the pipe before the parent process has created it.

If KillMode=mixed, but that's wrong for any process...

If the daemon expects systemd to reap its children, then using "mixed" will cause shutdown to hang, usually pending an eventual SIGKill. However, if MariaDB has SIGKill=No, then I can see why it might cause shutdown/restart to hang indefinitely. An indefinite hang on shutdown is a risk whenever SIGKill=No is the configuration, though.

Concluding Thoughts and How to Extend Shutdown

Given that you don't want SIGKill sweeping in prematurely to stop a shutdown-in-progress, it seems brazen to me to have systemd send all processes shutdown signals, bypassing any process shutdown topology you might otherwise expect to manage internally using signals. The risk of "mixed" seems to be hanging shutdowns, while the risk of "cgroup" seems to be undefined behavior (unless KillMode=cgroup is known to be well-defined). For a database, I'd choose the risk around "mixed" every time (again, unless everything is known to manage process dependencies correctly through other means).

Finally, and perhaps most importantly, I see that you're worried about SIGKill happening when an a shutdown is orderly but long. Is MariaDB using Type=notify? If so, I'd advise using EXTEND_TIMEOUT_USEC= so that orderly shutdowns don't get interrupted, but you can still have the guarantee of systemd reaping a failed shutdown.

David Strauss added a comment - 2021-05-24 13:53 - edited Hi Daniel. Unfortunately, notifications from this Jira instance don't seem to be reaching the top of my inbox. I saw activity on GitHub, though, and noticed your reply here. tl;dr: If MariaDB is using Type=notify (and I think it is, IIRC) then you should probably set SendSIGKill=Yes and leverage EXTEND_TIMEOUT_USEC to avoid timeouts when performing a long but orderly shutdown. KillMode=mixed remains my recommendation unless KillMode=cgroup is known to be correct for all processes. My Detailed Thoughts Regarding KillMode=mixed vs. KillMode=cgroup, I tend to feel that, unless the design of the daemon's process set intends KillMode=cgroup, that "mixed" is more appropriate because it provides the main PID with time to shut down children before systemd blankets all children in shutdown signals. Another way to look at this is what happens when there's a mismatch between daemon expectations and the systemd unit configuration; I'll cover each scenario and the cost of being wrong. If KillMode=cgroup, but that's wrong for any process... If the daemon expects to reap its own children, then "cgroup" mode will cause broken behavior because systemd will aggressively send signals to both main and child PID processes. It can be broken because it can break internal dependency expectations around signal propagation. That is, if a child PID expects that the shutdown signal comes from the main PID after the main PID is ready for children to shut down, then systemd sending it at the same time to main and child processes will cause premature shutdown in child processes, possibly confusing the main process. Daemons like MariaDB that are portable across systems tend to expect to reap their own children because KillMode=cgroup is a systemd-specific thing AFAIK; that's why I'm suspicious whether "cgroup" is the right choice (or even a safe one) for MariaDB. So, KillMode=cgroup can introduce hard-to-notice shutdown race conditions when the signal propagation order is important but processes tend to shut down in an acceptable order due to other effects, like the duration spent on various phases of shutdown. For example, let's say a parent expects to send the shutdown signal to its child. If that child process runs some housekeeping at the beginning of shutdown – therefore keeping it around a bit longer typically – while the parent interacts with it on the parent's shutdown, then that parent process's expectations might be broken if the child process happens to have a particular quick housekeeping run and disappears because systemd told it to shut down. This hazard could exist in the other direction, too. Let's say a parent process wants to open a pipe for child processes to communicate their final status during shutdown. It expects to open this pipe and only then send shutdown signals to children. If those children receive a premature shutdown signal from systemd, they might try to use the pipe before the parent process has created it. If KillMode=mixed, but that's wrong for any process... If the daemon expects systemd to reap its children, then using "mixed" will cause shutdown to hang, usually pending an eventual SIGKill. However, if MariaDB has SIGKill=No, then I can see why it might cause shutdown/restart to hang indefinitely. An indefinite hang on shutdown is a risk whenever SIGKill=No is the configuration, though. Concluding Thoughts and How to Extend Shutdown Given that you don't want SIGKill sweeping in prematurely to stop a shutdown-in-progress, it seems brazen to me to have systemd send all processes shutdown signals, bypassing any process shutdown topology you might otherwise expect to manage internally using signals. The risk of "mixed" seems to be hanging shutdowns, while the risk of "cgroup" seems to be undefined behavior (unless KillMode=cgroup is known to be well-defined). For a database, I'd choose the risk around "mixed" every time (again, unless everything is known to manage process dependencies correctly through other means). Finally, and perhaps most importantly, I see that you're worried about SIGKill happening when an a shutdown is orderly but long. Is MariaDB using Type=notify? If so, I'd advise using EXTEND_TIMEOUT_USEC= so that orderly shutdowns don't get interrupted, but you can still have the guarantee of systemd reaping a failed shutdown.

MariaDB Server

Review shutdown patterns in systemd service units