[MDEV-17571] Make systemd timeout behavior more compatible with long Galera SSTs Created: 2018-10-30 Updated: 2020-03-25 Resolved: 2020-01-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Galera SST, Packaging, wsrep |
| Affects Version/s: | 10.1, 10.1.36, 10.2.18, 10.3.9, 10.3.11, 10.1.38, 10.2, 10.3 |
| Fix Version/s: | 10.1.44, 10.2.31, 10.3.22, 10.4.12, 10.5.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Claudio Nanni | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 9 |
| Labels: | systemd | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
SSTs can take several hours in many cases, but the current default value of TimeoutStartSec causes systemd to force the joiner node to timeout in about 90 seconds. It might make sense to disable systemd service's timeout by default instead. Depending on the systemd version, disabling the startup timeout means setting either TimeoutStartSec=0 (if systemd version <=228) or TimeoutStartSec=infinity (if systemd version >=229). In systemd 236 and later, the startup timeout can be extended by setting EXTEND_TIMEOUT_USEC:
https://www.freedesktop.org/software/systemd/man/systemd.service.html It looks like this approach was used to extend the startup timeout during SSTs while fixing https://github.com/mariadb/server/commit/be5698265a4195586142d1a34fdd1cce9d95d8a1 The relevant service_manager_extend_timeout function seems to be defined here: And it sets the EXTEND_TIMEOUT_USEC environment variable mentioned in the systemd manual. However, a lot of users are still seeing startup timeouts during SSTs. The cause seems to be that most systemd installations are not yet using version 236 or later. The following documentation section that describes current behavior: https://mariadb.com/kb/en/library/introduction-to-state-snapshot-transfers-ssts/#ssts-and-systemd https://mariadb.com/kb/en/library/systemd/#configuring-the-systemd-service-timeout |
| Comments |
| Comment by Elena Stepanova [ 2018-10-30 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I would expect that instances where SST indeed takes several hours are minority comparing to all MariaDB server instances, so should we really adjust the configuration to something that can be considered a corner case? I don't have a strong opinion on the subject, serg, jplindst, what do you think? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2018-10-31 ] | |||||||||||||||||||||||||||||||||||||||||||
|
1. I agree with Elena, that the default configuration should be optimized for the default case | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Valerii Kravchuk [ 2018-10-31 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Default timeout is 90 seconds, there are just different ways to disable it depending on systemd version. I think that even for plan non-Gelera MySQL or MariaDB instance timeout should be larger than that (maybe 600 or 900 seconds even), if not infinite. Hence this request is to add explicit setting (and comment on how to disable timeout) to the configuratioin file we include in our packages targeting systemd-based Linux distributions. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Richard Stracke [ 2018-10-31 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Databases tend to be bigger. With a 10GB Network is highspeed 40MB/s, which is equivalent for 90 seconds = 3,6 GB without any overhead. 90 seconds limits the database size of MariaDB with galera without necessity. In addtion the failed SST is not very easy to spot, if you not a skilled DBA. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Hartmut Holzgraefe [ 2018-10-31 ] | |||||||||||||||||||||||||||||||||||||||||||
|
While I think that "infinity" is too long, the default 90 seconds is definitely too short for Galera. But it seems that SystemD developer(s) actually thought of such scenarios, allowing services that take longer to start up to extend startup timeout dynamically: "If a service of Type=notify sends "EXTEND_TIMEOUT_USEC=…", this may cause the start time to be extended beyond TimeoutStartSec=..." <https://www.freedesktop.org/software/systemd/man/systemd.service.html> So that actually looks like the correct way to go: having mysqld extend the systemd timeout while a still healthy SST is ongoing | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Geoff Montee (Inactive) [ 2018-12-06 ] | |||||||||||||||||||||||||||||||||||||||||||
|
jplindst tried to implement a fix using hholzgra's approach as part of https://github.com/mariadb/server/commit/be5698265a4195586142d1a34fdd1cce9d95d8a1 The relevant service_manager_extend_timeout function seems to be defined here: And it sets the EXTEND_TIMEOUT_USEC environment variable that hholzgra mentioned. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Geoff Montee (Inactive) [ 2018-12-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Loosely related: | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Geoff Montee (Inactive) [ 2018-12-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I just noticed that EXTEND_TIMEOUT_USEC was added in systemd version 236: https://lists.freedesktop.org/archives/systemd-devel/2017-December/039996.html The most common OS that we tend to see for MariaDB with Galera is RHEL 7, and that still has systemd version 219:
So even if | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Geoff Montee (Inactive) [ 2019-02-16 ] | |||||||||||||||||||||||||||||||||||||||||||
|
This Percona blog post is relevant: Somehow, they're under the impression that this timeout used to be 900 seconds in MariaDB:
But as far as I can tell from the git commit history, MariaDB's systemd unit file has never explicitly defined TimeoutSec or TimeoutStartSec, and it has used the systemd default value of 90 seconds for TimeoutStartSec since systemd support was added in MariaDB 10.1. I guess the 900 seconds is a reference to the service startup timeout in mysql.server, which is used as the init script on distributions that don't support systemd. It looks like Percona XtraDB Cluster sets an infinite timeout by default in its systemd unit file: Should we set TimeoutStartSec to a higher value than 90 seconds by default for systemd versions that don't support EXTEND_TIMEOUT_USEC? It probably wouldn't hurt to at least set it to the old 900 second value from mysql.server that some users are used to. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-05-06 ] | |||||||||||||||||||||||||||||||||||||||||||
|
axel Is there some way to get required settings for TimeoutStartSec and TimeoutSec to mariadb.service file ? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2019-12-02 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I've read a lot of code and documentation lately. Let me try to summarize things:
From the above I conclude the following steps:
|