[MDEV-10382] Using systemd, mariadb doesn't restart on crashes Created: 2016-07-16  Updated: 2016-12-07  Resolved: 2016-12-06

Status: Closed
Project: MariaDB Server
Component/s: Platform Debian, Scripts & Clients
Affects Version/s: 10.1
Fix Version/s: 10.1.20

Type: Bug Priority: Critical
Reporter: jocelyn fournier Assignee: Sergei Golubchik
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
includes MDEV-9282 Debian: the Lintian complains about "... Closed

 Description   

Hi,

After switching to debian jessie using /etc/systemd/system/mariadb.service instead of mysqld_safe, mariadb doesn't restart on crash. (official apt package, mariadb 10.1.14)
In the default /etc/systemd/system/mariadb.service I have to modify the default
Restart=on-abort
to
Restart=on-failure
to have a correct behaviour (restart after a crash).

Thanks,
Jocelyn



 Comments   
Comment by jocelyn fournier [ 2016-07-18 ]

It would also be great to add in this file a EnvironmentFile=/etc/sysconfig/mysql & create an empty file in /etc/sysconfig/mysql
(In my case, I use it to specify LD_PRELOAD=/usr/lib/mysql/libHotBackup.so to use hotbackup with tokudb)

Comment by Andrii Nikitin (Inactive) [ 2016-11-03 ]

I am not expert in systemd, but this request doesn't look valid to me.
According to Table 1 at: https://www.freedesktop.org/software/systemd/man/systemd.service.html : There is no difference between "on-abort" and "on-failure" regarding signal handling. (i.e. crash handling)

Please confirm if conclusions in this report are based on some source or solely from personal experience. Maybe some restart timeout caused misunderstanding? (It is 5 sec by default in my Ubuntu 16.04).
(The ticket will be closed as invalid if no further clarifications come.)

Comment by jocelyn fournier [ 2016-11-03 ]

Hi Andrii,

It's based only on personal experience : MariaDB was properly running, then it hits a crashes, and didn't restart (I reproduced this behaviour several times).
After switching from on-abort to on-failure and hitting a crash, MariaDB was properly restarting.

BTW on-failure is the recommended setting :

"Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors."

Comment by Sergey Vojtovich [ 2016-11-03 ]

We used to have on-failure but had to switch to on-abort for a reason, see comment in mariadb.service:

# Restart crashed server only, on-failure would also restart, for example, when
# my.cnf contains unknown option

Comment by Sergey Vojtovich [ 2016-11-03 ]

If all you worry about is crashes not caught by systemd, then I'd say it is systemd bug. Better to have it fixed there.

OTOH I tend to remember InnoDB may use exit() instead of abort(). In this case it may not be handled properly and we need to fix this in MariaDB. See MDEV-9282.

Comment by jocelyn fournier [ 2016-11-03 ]

Not having mysql restarted in case of a crash causes downtime which is not good. The switch from mysqld_safe (which was handling properly crashes) to systemd is fairly recent in the debian jessie package, AFAIK ?

Comment by Sergey Vojtovich [ 2016-11-03 ]

Not having mysqld restarted in case of a crash is certainly no good. But I'm afraid switching from on-abort to on-failure is no good either.

mysqld_safe distinguishes crashes from normal shutdown by checking pid file existence. I doubt systemd is flexible enough to perform such check. Even if it were, I'd do my best to avoid this.

If you could provide mysqld error log from one of such crashes, we could probably come up with better solution.

Comment by jocelyn fournier [ 2016-11-03 ]

I think the crash was caused by https://jira.mariadb.org/browse/MDEV-10410

Comment by Sergey Vojtovich [ 2016-11-03 ]

According to man systemd.service:

If set to on-abort, the service will be restarted only if the service process exits due to an uncaught signal not specified as a clean exit status.

Not sure how systemd determines "uncaught signal", but we definitely catch all signals and after handling those we call exit().

Comment by Andrii Nikitin (Inactive) [ 2016-11-04 ]

OK, if mysqld suppresses all signals (in signal_handler.cc) and calls exit(1); , then Restart=on-abort is for sure incorrect configuration here (because technically mysqld suppresses all signals which can be suppressed).

The closest behavior to mysqld_safe will be Restart=always.

To solve problem when mysqld restarts when incorrect option is used - we should return special exit code with incorrect configuration - e.g. 203 and configure RestartPreventExitStatus=203

So I set status to 'Confirmed' and increase priority to 'Critical'

Comment by Daniel Black [ 2016-11-28 ]

Or make the signal handler raise the signal again after removing the handlers and before exit. systemd will get the child signal from the waitid(2).

Comment by Sergei Golubchik [ 2016-11-30 ]

agree, it's probably the safest to use Restart=on-failure and RestartPreventExitStatus to fine tune what errors should cause a restart.

Comment by Marko Mäkelä [ 2016-12-07 ]

InnoDB calls exit() from I/O threads in certain situations. It would probably be better to call abort() to be consistent with other cases where InnoDB aborts when encountering a corrupted page.
That said, if InnoDB background tasks are trying to access a corrupted page, it could be possible to get into a loop where the server crashes and is restarted repeatedly. I wonder if there is or should be some mechanism to prevent the error log from filling the file system, or to prevent ‘hopeless’ restart attempts from occurring.

Generated at Thu Feb 08 07:41:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.