[MDEV-14705] systemd: EXTEND_TIMEOUT_USEC= to avoid startup and shutdown timeouts Created: 2017-12-19  Updated: 2020-03-25  Resolved: 2018-04-06

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.1.33, 10.2.15, 10.3.6

Type: Task Priority: Critical
Reporter: Daniel Black Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: contribution, foundation, patch

Issue Links:
Problem/Incident
causes MDEV-16149 Failing assertion: node->modificatio... Closed
causes MDEV-16150 Mariadb 10.3.6 Failing assertion on D... Closed
causes MDEV-17003 service_manager_extend_timeout() bein... Closed
Relates
relates to MDEV-11027 InnoDB log recovery is too noisy Closed
relates to MDEV-12323 Rollback progress log messages during... Closed
relates to MDEV-12352 InnoDB shutdown should not be blocked... Closed
relates to MDEV-15554 InnoDB page_cleaner shutdown sometime... Closed
relates to MDEV-15832 With innodb_fast_shutdown=3, skip the... Closed
relates to MDEV-17571 Make systemd timeout behavior more co... Closed
relates to MDEV-17934 Make systemd timeout behavior more co... Closed
relates to MDEV-18224 MTR's internal check of the test case... Closed
relates to MDEV-9202 Systemd timeout is not sufficient for... Closed
relates to MDEV-11035 Restore removed disallow-writes for G... Closed
relates to MDEV-15606 Galera can't perform SST in 10.2.13 i... Closed
relates to MDEV-15607 mysqld crashed few after node is bein... Closed
relates to MDEV-17571 Make systemd timeout behavior more co... Closed
Sprint: 10.3.6-1

 Description   

MDEV-9202, MDEV-8509 shows cases where the systemd timeout isn't sufficient to preform initialization/shut-down.

Since https://github.com/systemd/systemd/commit/a327431bd168b2f327f3cd422379e213c643f2a5 released in system v236 Type=notify service can now advice to the systemd service manager they are still working to avoid the service timing out.

The use of EXTEND_TIMEOUT_USEC= on older services has no effect and is therefore compatible.

This needs to be included in (feel free to correct/extend):

  • buffer pool dump - buf_dump (storage/innobase/buf/buf0dump.cc)
  • redo log recovery - log_group_read_log_seg (storage/innobase/log/log0log.cc)
  • undo recovery - recv_recover_page_func (storage/innobase/log/log0recv.cc)
  • change buffer?
  • merge buffer?

I was planning on making the 15 seconds of recv_sys_t->report() more general with respect to interval, and use a define INNODB_REPORT_INTERVAL (include/univ.i?) as the basis for this form of watchdog. I'd send notify messages of INNODB_REPORT_INTERVAL * 2 as an acceptable margin.

Anywhere else or other suggestions marko, jplindst?

  • galera SST scripts - donor and recipient

Any other server/engine slow points to account for?

Target 10.3 and then look at a backport?



 Comments   
Comment by Sergey Vojtovich [ 2018-03-13 ]

Overdue PR.

Comment by Marko Mäkelä [ 2018-03-14 ]

Startup and shutdown are quite different. I would consider them separately.

Startup

For startup, the biggest time consumers should be redo log apply and the buffer pool load. For the redo log apply, we cannot know in advance how much work there is to be done. MDEV-11027 introduced progress reporting for this, including a call to sd_notifyf().

Undo recovery (rolling back incomplete transactions) takes place in a background thread. Starting with MDEV-12323, we do report progress for that as well via sd_notifyf(). If the binlog is used, then the tc_heuristic_recover option can come into play, internally initiating XA COMMIT or XA ROLLBACK operations in some thread. I do not know if that would block user connections from starting at all. Commit is fast in InnoDB, while the rollback speed is proportional to the number of modified rows and the number of indexes.

Shutdown

In shutdown, there have been multiple hangs. Most recently, I fixed MDEV-15554.

There are different innodb_fast_shutdown modes. By default (innodb_fast_shutdown=1, a fast shutdown is done: all dirty pages are flushed to the buffer pool and a log checkpoint is created.

Some progress reporting for the flush phase would be nice. We know exactly how many dirty pages there are, so the progress measure would be linear.

There is also the crash-like innodb_fast_shutdown=2 which is supposed to complete instantly.

The slowest shutdown is innodb_fast_shutdown=0. It will:

  1. Wait for all transactions to complete (starting with MDEV-12352, normal shutdown would abort the rollback of recovered transactions)
  2. Purge the history of all completed transactions (well, except when it fails to do so due to MDEV-11802)
  3. Empty the change buffer (formerly known as the insert buffer) by merging changes to secondary index leaf pages.

Only after these tasks, it becomes feasible to flush the dirty pages into the data files and to create the final redo log checkpoint.

Comment by Daniel Black [ 2018-03-14 ]

quick note - for the purpose of systemd timing startup start from process creation and ends just before accepting the first connection.

On startup I haven't included the buffer pool load because as I understand it it continues in the background while the server is processing queries.

I have included the buffer pool save on shutdown too.

Exact time need not be known. Only that a loop iteration can progress within a estimated maximium time.

Comment by Rick Pizzi [ 2018-03-20 ]

@marko the startup has to consider Galera SST time - can be hours on larger datasets

Comment by Daniel Black [ 2018-03-20 ]

rpizzi, SST scripts will be able to communicate just as easily as the server. The tricky bit is identifying the lower bits of the SST and enacting on it. Like for rsync I was thinking of enabling a progress monitoring and getting a subfunction that takes any output on stdout and issues EXTEND_TIMEOUT=... at that point.

something like:

if systemd-notify --booted && [ -n "${NOTIFY_SOCKET}" ]
then
  rsync_output_extendtimeout()
  {
    # some rate limiting / variable timeout progress
    while read input
   do
       systemd-notify EXTEND_TIMEMOUT=100000 
   done
  }
else
  rsync_output_extendtimeout() { [ 1 ]; }  > /dev/null
fi
...
     rsync --progress   |  rsync_output_extendtimeout
...

Edit: $NOTIFY_SOCKET is a unix socket or an abstract socket. Probably best just to use systemd-notify and check if $NOTIFY_SOCKET is set

Assistance in identifying other slow SST components and how to identify progress would be much appreciated.

Comment by Rick Pizzi [ 2018-03-21 ]

I believe the script wsrep_sst_xtrabackup-v2 (or mariabackup equivalent) is just spawned via fork/exec, in this case in the parent process, run the loop you have mentioned until the child process completes.

Comment by Rick Pizzi [ 2018-03-21 ]

Looks like this SST issue isn't new, just overlooked.
See item #8 of https://mariadb.com/kb/en/library/upgrading-from-mariadb-galera-cluster-100-to-mariadb-101/

Comment by Marko Mäkelä [ 2018-04-06 ]

Some cleanup, and testing

Comment by Marko Mäkelä [ 2018-04-06 ]

rpizzi, I added the timeout extension to InnoDB and XtraDB. Please file a separate ticket for the Galera snapshot transfer, maybe linked to MDEV-11035 or some other ticket on innodb_disallow_writes.

Comment by Marko Mäkelä [ 2018-04-08 ]

rpizzi, fraggeln in MDEV-11035 mentioned systemd timeout on startup, related to Galera rsync snapshot transfer failure. I wonder if this fix would address those problems, or if some further change is needed. (If yes, please file a separate ticket, as I suggested earlier.)

Comment by Rick Pizzi [ 2018-04-11 ]

"Please file a separate ticket for the Galera snapshot transfer"
already there.... https://jira.mariadb.org/browse/MDEV-15606

Comment by Marko Mäkelä [ 2020-03-25 ]

It looks like the Galera issues were ultimately fixed in MDEV-17571.

Generated at Thu Feb 08 08:15:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.