[MDEV-17934] Make systemd timeout behavior more compatible with longer Galera recovery times Created: 2018-12-07  Updated: 2019-12-05  Resolved: 2019-12-05

Status: Closed
Project: MariaDB Server
Component/s: Galera, Packaging, wsrep
Affects Version/s: 10.1, 10.2.16, 10.2.18, 10.2, 10.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Rasmus Johansson (Inactive)
Resolution: Duplicate Votes: 0
Labels: galera, systemd, wsrep

Issue Links:
Relates
relates to MDEV-9202 Systemd timeout is not sufficient for... Closed
relates to MDEV-9520 xtrabackup-v2 to support systemd node... Closed
relates to MDEV-15607 mysqld crashed few after node is bein... Closed
relates to MDEV-17003 service_manager_extend_timeout() bein... Closed
relates to MDEV-14705 systemd: EXTEND_TIMEOUT_USEC= to avoi... Closed
relates to MDEV-15607 mysqld crashed few after node is bein... Closed
relates to MDEV-17571 Make systemd timeout behavior more co... Closed

 Description   

When Galera is enabled, MariaDB's systemd service executes the "galera_recovery" script as an ExecStartPre operation. See the following:

https://github.com/MariaDB/server/blob/ce8716a1ed786ff971b5e15c88385d50b649ec7f/support-files/mariadb.service.in#L71

The MariaDB systemd service has a default TimeoutStartSec value of 90 seconds, so if this ExecStartPre step takes longer than that, then this can cause startup to fail. For example, see the following failure from a syslog:

Sep 13 15:48:28 server1 systemd[1]: Starting MariaDB 10.2.16 database server...
Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Start-pre operation timed out. Terminating.
Sep 13 15:49:58 server1 systemd[1]: Failed to start MariaDB 10.2.16 database server.
Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Unit entered failed state.
Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Failed with result 'timeout'.

galera_recovery has to perform server startup, so this step can take a while, especially if the server previously crashed, and it has to perform crash recovery. However, it looks like systemd timeouts should have been extended during server startup as part of MDEV-14705. Despite that, server versions with the fix for MDEV-14705 still see timeouts during ExecStartPre. Is it likely that important long-running startup functions were missed?

See also MDEV-17571 as another case where systemd timeout extensions didn't seem to work as intended.



 Comments   
Comment by Geoff Montee (Inactive) [ 2018-12-07 ]

I just noticed that EXTEND_TIMEOUT_USEC was added in systemd version 236:

https://lists.freedesktop.org/archives/systemd-devel/2017-December/039996.html

The most common OS that we tend to see for MariaDB with Galera is RHEL 7, and that still has systemd version 219:

[ec2-user@ip-172-30-0-249 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)
[ec2-user@ip-172-30-0-249 ~]$ sudo yum info systemd
Loaded plugins: amazon-id, rhui-lb, search-disabled-repos
Repodata is over 2 weeks old. Install yum-cron? Or run: yum makecache fast
Installed Packages
Name        : systemd
Arch        : x86_64
Version     : 219
Release     : 19.el7
Size        : 21 M
Repo        : installed
From repo   : anaconda
Summary     : A System and Service Manager
URL         : http://www.freedesktop.org/wiki/Software/systemd
License     : LGPLv2+ and MIT and GPLv2+
Description : systemd is a system and service manager for Linux, compatible with
            : SysV and LSB init scripts. systemd provides aggressive parallelization
            : capabilities, uses socket and D-Bus activation for starting services,
            : offers on-demand starting of daemons, keeps track of processes using
            : Linux cgroups, supports snapshotting and restoring of the system
            : state, maintains mount and automount points and implements an
            : elaborate transactional dependency-based service control logic. It can
            : work as a drop-in replacement for sysvinit.
 
Available Packages
Name        : systemd
Arch        : x86_64
Version     : 219
Release     : 62.el7
Size        : 5.1 M
Repo        : rhui-REGION-rhel-server-releases/7Server/x86_64
Summary     : A System and Service Manager
URL         : http://www.freedesktop.org/wiki/Software/systemd
License     : LGPLv2+ and MIT and GPLv2+
Description : systemd is a system and service manager for Linux, compatible with
            : SysV and LSB init scripts. systemd provides aggressive parallelization
            : capabilities, uses socket and D-Bus activation for starting services,
            : offers on-demand starting of daemons, keeps track of processes using
            : Linux cgroups, supports snapshotting and restoring of the system
            : state, maintains mount and automount points and implements an
            : elaborate transactional dependency-based service control logic. It can
            : work as a drop-in replacement for sysvinit.

So even if MDEV-14705 fixed this problem for systemd versions 236 and above, a lot of users are using systemd versions that are much older, so they would not benefit from that functionality. That explains a lot.

Comment by Elena Stepanova [ 2018-12-07 ]

Since ratzpo previously assigned MDEV-17571 to himself, I'm assigning this one to him as well.

Comment by Axel Schwenke [ 2019-12-05 ]

This is a duplicate of MDEV-17571

Generated at Thu Feb 08 08:40:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.