[MDEV-9202] Systemd timeout is not sufficient for larger servers Created: 2015-11-27 Updated: 2018-12-07 Resolved: 2016-03-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Documentation |
| Affects Version/s: | 10.1.8, 10.1.9, 10.1 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Michiel Hazelhof | Assignee: | Sergey Vojtovich |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Debian Jessie |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Sprint: | 10.1.13 | ||||||||||||||||||||||||||||
| Description |
|
On larger servers (thousands of tables) the start process can be very long (on our server about 10 minutes), with previous init systems this wasn't much of a problem. With systemd it is:
Adding a large TimeoutStartSec fixes the problem but as noted gets overwritten every upgrade. |
| Comments |
| Comment by Elena Stepanova [ 2015-11-27 ] | |||
|
ATTN danblack | |||
| Comment by Daniel Black [ 2015-11-27 ] | |||
|
The correct way to adjust systemd settings so they don't get overwritten is to create a directory and file as such: /etc/systemd/system/mariadb.service.d/timeout.conf
| |||
| Comment by Daniel Black [ 2015-11-28 ] | |||
|
created the documentation - https://mariadb.com/kb/en/mariadb/systemd/, still a bit rough. As I read this bug the major problem is that the configuration of TimeoutStartSec gets overwritten. Hopefully the correct configuration of user systemd settings is now adequately described. The default value for TimeoutStartSec appears to be 90 seconds. I don't think we should increase this by default. Justifications for an appropriate value welcome. As an aside 10 minutes for start up seem extraordinarily large. If innodb_buffer_load_at_startup is the cause perhaps | |||
| Comment by Michiel Hazelhof [ 2015-11-29 ] | |||
|
Nice work on the documentation, should help some administrators (like me) who are quite new to the whole systemd. I guess it should include "systemctl daemon-reload" to make sure the changes are properly taken into effect when manually editing files. Also our debian install does not honour the TimeoutStartSec setting when invoked using the old init.d commands (e.g. anyone comming from the older debian versions). The default of 90 seconds should suffice for most users, I guess we shall have to wait patiently for 10.2 as described in | |||
| Comment by Daniel Black [ 2015-11-29 ] | |||
|
We're all a bit new to systemd I think. Good idea with daemon-reload - added. Debian init scripts seem to respect MYSQLD_STARTUP_TIMEOUT as a mechanism for achieving the same thing. https://github.com/MariaDB/server/commit/7f19330c595e3183d079fe2c18eecc74740e8f83#diff-f0eed6025f4cb6b214beea14d2d68cd4 seems to have prevented the redirect (which I thought was going to be for 10.0 only, anyway, need to work out what the direction is there). I can't tell if | |||
| Comment by Karl E. Jørgensen [ 2015-12-08 ] | |||
|
Please note that for MariaDB Galera cluster, a node startup may require a full state transfer - which can take a considerable amount of time, while all of the databases are copied... | |||
| Comment by Sergey Vojtovich [ 2015-12-15 ] | |||
|
Closing issue according to danblack comment. Fixed by documenting systemd settings. Feel free to reopen if disagree. | |||
| Comment by Kolbe Kegel (Inactive) [ 2016-01-12 ] | |||
|
We should disable the timeout for server startup. This is especially important in the case of Galera. SST can take a very long time. Killing mysqld because SST is working is completely nonsensical. | |||
| Comment by Daniel Black [ 2016-01-12 ] | |||
|
or calling
earlier in wsrep_init_startup(true)/wsrep_sst_wait | |||
| Comment by Kolbe Kegel (Inactive) [ 2016-01-12 ] | |||
|
danblack, I don't think telling systemd that the service is ready before that's true is a good idea. There are a great many opportunities for SST to fail beyond the default 90s timeout. The only safe thing to do IMO is to remove the timeout entirely, since SST of a large dataset can take hours. | |||
| Comment by Daniel Black [ 2016-01-12 ] | |||
|
Yeh, I wasn't convinced of the earlier sd_notify fix either. TimeoutStartSec=0 appears to disable the timeout according to the documentation http://www.freedesktop.org/software/systemd/man/systemd.service.html | |||
| Comment by Kolbe Kegel (Inactive) [ 2016-01-12 ] | |||
|
I'd recommend using TimeoutSec=0 instead. It can also take >90s to stop the server if it is flushing the buffer pool or completing some other long-running operation. | |||
| Comment by Kolbe Kegel (Inactive) [ 2016-03-10 ] | |||
|
svoj, this issue is assigned to you. Do you object to setting TimeoutSec=0 in the mariadb.service unit file? This is causing serious problems. The fix is very simple. | |||
| Comment by Sergey Vojtovich [ 2016-03-11 ] | |||
|
No objections from my side, though it's a pity that we can't use this nice systemd feature. I added this task to next 10.1 sprint backlog. | |||
| Comment by Kolbe Kegel (Inactive) [ 2016-03-11 ] | |||
|
What "nice systemd feature" do you wish you could use? | |||
| Comment by Sergey Vojtovich [ 2016-03-11 ] | |||
|
start/stop timeouts | |||
| Comment by Kolbe Kegel (Inactive) [ 2016-03-11 ] | |||
|
What makes you think that start/stop timeouts are such a nice feature? What are the scenarios where MariaDB hasn't started after 1, 2, 5, 10, or 30 minutes and you think the OS should kill the process? SST, InnoDB recovery, etc., can all take a basically undefined amount of time, and timing out after some period of time just seems like it doesn't solve anything. I'm really missing the scenarios where this would be desirable/helpful behavior. | |||
| Comment by Sergey Vojtovich [ 2016-03-12 ] | |||
|
IMHO the fact that MariaDB isn't bug free makes this feature nice. I was able to find at least the following shutdown deadlocks almost instantly: | |||
| Comment by Sergey Vojtovich [ 2016-03-16 ] | |||
|
serg, please review fix for this bug. | |||
| Comment by Sergei Golubchik [ 2016-03-16 ] | |||
|
I'd prefer to keep the default timeout. The default is supposed to be good for most cases. And users should be able to adjust it as needed. See, for example | |||
| Comment by VAROQUI Stephane [ 2016-09-14 ] | |||
|
Hi , If a node deadlock is caused by a bug please kill -9 and open a bug /svar | |||
| Comment by Michiel Hazelhof [ 2016-09-14 ] | |||
|
It looks like Personally I'd recommend setting it to 3 minutes for 10.2+, the default behaviour appears to be to wait for a full MariaDB boot instead of a hard kill. Given this behaviour we only have to balance a logical timeout before warning the user that something Might be wrong (as the server will continue to boot and exit after it has done so, in a recovery situation a simple restart should suffice, for larger servers the override can be used). | |||
| Comment by Daniel Black [ 2017-05-01 ] | |||
|
RFE upstream in attempt to get workable solution: https://github.com/systemd/systemd/issues/5868 |