Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14705

systemd: EXTEND_TIMEOUT_USEC= to avoid startup and shutdown timeouts

Details

    • 10.3.6-1

    Description

      MDEV-9202, MDEV-8509 shows cases where the systemd timeout isn't sufficient to preform initialization/shut-down.

      Since https://github.com/systemd/systemd/commit/a327431bd168b2f327f3cd422379e213c643f2a5 released in system v236 Type=notify service can now advice to the systemd service manager they are still working to avoid the service timing out.

      The use of EXTEND_TIMEOUT_USEC= on older services has no effect and is therefore compatible.

      This needs to be included in (feel free to correct/extend):

      • buffer pool dump - buf_dump (storage/innobase/buf/buf0dump.cc)
      • redo log recovery - log_group_read_log_seg (storage/innobase/log/log0log.cc)
      • undo recovery - recv_recover_page_func (storage/innobase/log/log0recv.cc)
      • change buffer?
      • merge buffer?

      I was planning on making the 15 seconds of recv_sys_t->report() more general with respect to interval, and use a define INNODB_REPORT_INTERVAL (include/univ.i?) as the basis for this form of watchdog. I'd send notify messages of INNODB_REPORT_INTERVAL * 2 as an acceptable margin.

      Anywhere else or other suggestions marko, jplindst?

      • galera SST scripts - donor and recipient

      Any other server/engine slow points to account for?

      Target 10.3 and then look at a backport?

      Attachments

        Issue Links

          Activity

            danblack Daniel Black created issue -
            danblack Daniel Black made changes -
            Field Original Value New Value
            danblack Daniel Black made changes -
            Description MDEV-9202, MDEV-8509 shows cases where the systemd timeout isn't sufficient to preform initialization/shut-down.

            Since https://github.com/systemd/systemd/commit/a327431bd168b2f327f3cd422379e213c643f2a5 released in system v236 Type=notify service can now advice to the systemd service manager they are still working to avoid the service timing out.

            The use of EXTEND_TIMEOUT_USEC= on older services has no effect and is therefore compatible.

            This needs to be included in (feel free to correct/extend):
            * buffer pool dump - buf_dump (storage/innobase/buf/buf0dump.cc)
            * redo log recovery - log_group_read_log_seg (storage/innobase/log/log0log.cc)
            * undo(?) recovery - recv_recover_page_func (storage/innobase/log/log0recv.cc)
            * change buffer?
            * merge buffer?

            I was planning on making the 15 seconds of recv_sys_t->report() more general with respect to interval, and use a define INNODB_REPORT_INTERVAL (include/univ.i?) as the basis for this form of watchdog. I'd send notify messages of INNODB_REPORT_INTERVAL * 2 as an acceptable margin.

            Anywhere else or other suggestions [~marko], [~jplindst]?

            Any other server/engine slow points to account for?

            Target 10.3 and then look at a backport?
            MDEV-9202, MDEV-8509 shows cases where the systemd timeout isn't sufficient to preform initialization/shut-down.

            Since https://github.com/systemd/systemd/commit/a327431bd168b2f327f3cd422379e213c643f2a5 released in system v236 Type=notify service can now advice to the systemd service manager they are still working to avoid the service timing out.

            The use of EXTEND_TIMEOUT_USEC= on older services has no effect and is therefore compatible.

            This needs to be included in (feel free to correct/extend):
            * buffer pool dump - buf_dump (storage/innobase/buf/buf0dump.cc)
            * redo log recovery - log_group_read_log_seg (storage/innobase/log/log0log.cc)
            * undo(?) recovery - recv_recover_page_func (storage/innobase/log/log0recv.cc)
            * change buffer?
            * merge buffer?

            I was planning on making the 15 seconds of recv_sys_t->report() more general with respect to interval, and use a define INNODB_REPORT_INTERVAL (include/univ.i?) as the basis for this form of watchdog. I'd send notify messages of INNODB_REPORT_INTERVAL * 2 as an acceptable margin.

            Anywhere else or other suggestions [~marko], [~jplindst]?
            * galera SST scripts - donor and recipient

            Any other server/engine slow points to account for?

            Target 10.3 and then look at a backport?
            danblack Daniel Black made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.1 [ 16100 ]
            danblack Daniel Black made changes -
            Labels contribution foundation patch
            svoj Sergey Vojtovich made changes -
            Assignee Marko Mäkelä [ marko ]
            svoj Sergey Vojtovich made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            Overdue PR.

            svoj Sergey Vojtovich added a comment - Overdue PR.
            marko Marko Mäkelä made changes -

            Startup and shutdown are quite different. I would consider them separately.

            Startup

            For startup, the biggest time consumers should be redo log apply and the buffer pool load. For the redo log apply, we cannot know in advance how much work there is to be done. MDEV-11027 introduced progress reporting for this, including a call to sd_notifyf().

            Undo recovery (rolling back incomplete transactions) takes place in a background thread. Starting with MDEV-12323, we do report progress for that as well via sd_notifyf(). If the binlog is used, then the tc_heuristic_recover option can come into play, internally initiating XA COMMIT or XA ROLLBACK operations in some thread. I do not know if that would block user connections from starting at all. Commit is fast in InnoDB, while the rollback speed is proportional to the number of modified rows and the number of indexes.

            Shutdown

            In shutdown, there have been multiple hangs. Most recently, I fixed MDEV-15554.

            There are different innodb_fast_shutdown modes. By default (innodb_fast_shutdown=1, a fast shutdown is done: all dirty pages are flushed to the buffer pool and a log checkpoint is created.

            Some progress reporting for the flush phase would be nice. We know exactly how many dirty pages there are, so the progress measure would be linear.

            There is also the crash-like innodb_fast_shutdown=2 which is supposed to complete instantly.

            The slowest shutdown is innodb_fast_shutdown=0. It will:

            1. Wait for all transactions to complete (starting with MDEV-12352, normal shutdown would abort the rollback of recovered transactions)
            2. Purge the history of all completed transactions (well, except when it fails to do so due to MDEV-11802)
            3. Empty the change buffer (formerly known as the insert buffer) by merging changes to secondary index leaf pages.

            Only after these tasks, it becomes feasible to flush the dirty pages into the data files and to create the final redo log checkpoint.

            marko Marko Mäkelä added a comment - Startup and shutdown are quite different. I would consider them separately. Startup For startup, the biggest time consumers should be redo log apply and the buffer pool load. For the redo log apply, we cannot know in advance how much work there is to be done. MDEV-11027 introduced progress reporting for this, including a call to sd_notifyf() . Undo recovery (rolling back incomplete transactions) takes place in a background thread. Starting with MDEV-12323 , we do report progress for that as well via sd_notifyf() . If the binlog is used, then the tc_heuristic_recover option can come into play, internally initiating XA COMMIT or XA ROLLBACK operations in some thread. I do not know if that would block user connections from starting at all. Commit is fast in InnoDB, while the rollback speed is proportional to the number of modified rows and the number of indexes. Shutdown In shutdown, there have been multiple hangs. Most recently, I fixed MDEV-15554 . There are different innodb_fast_shutdown modes. By default ( innodb_fast_shutdown=1 , a fast shutdown is done: all dirty pages are flushed to the buffer pool and a log checkpoint is created. Some progress reporting for the flush phase would be nice. We know exactly how many dirty pages there are, so the progress measure would be linear. There is also the crash-like innodb_fast_shutdown=2 which is supposed to complete instantly. The slowest shutdown is innodb_fast_shutdown=0 . It will: Wait for all transactions to complete (starting with MDEV-12352 , normal shutdown would abort the rollback of recovered transactions) Purge the history of all completed transactions (well, except when it fails to do so due to MDEV-11802 ) Empty the change buffer (formerly known as the insert buffer) by merging changes to secondary index leaf pages. Only after these tasks, it becomes feasible to flush the dirty pages into the data files and to create the final redo log checkpoint.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Sprint 10.1.32 [ 235 ]
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            danblack Daniel Black added a comment -

            quick note - for the purpose of systemd timing startup start from process creation and ends just before accepting the first connection.

            On startup I haven't included the buffer pool load because as I understand it it continues in the background while the server is processing queries.

            I have included the buffer pool save on shutdown too.

            Exact time need not be known. Only that a loop iteration can progress within a estimated maximium time.

            danblack Daniel Black added a comment - quick note - for the purpose of systemd timing startup start from process creation and ends just before accepting the first connection. startup time ends: https://github.com/MariaDB/server/blob/10.3/sql/mysqld.cc#L6680 The end shutdown time starts: https://github.com/MariaDB/server/blob/10.3/sql/mysqld.cc#L6882 until the final exit. On startup I haven't included the buffer pool load because as I understand it it continues in the background while the server is processing queries. I have included the buffer pool save on shutdown too. Exact time need not be known. Only that a loop iteration can progress within a estimated maximium time.

            @marko the startup has to consider Galera SST time - can be hours on larger datasets

            rpizzi Rick Pizzi (Inactive) added a comment - @marko the startup has to consider Galera SST time - can be hours on larger datasets
            marko Marko Mäkelä made changes -
            Sprint 10.1.32 [ 235 ] 10.3.6 [ 237 ]
            marko Marko Mäkelä made changes -
            danblack Daniel Black added a comment - - edited

            rpizzi, SST scripts will be able to communicate just as easily as the server. The tricky bit is identifying the lower bits of the SST and enacting on it. Like for rsync I was thinking of enabling a progress monitoring and getting a subfunction that takes any output on stdout and issues EXTEND_TIMEOUT=... at that point.

            something like:

            if systemd-notify --booted && [ -n "${NOTIFY_SOCKET}" ]
            then
              rsync_output_extendtimeout()
              {
                # some rate limiting / variable timeout progress
                while read input
               do
                   systemd-notify EXTEND_TIMEMOUT=100000 
               done
              }
            else
              rsync_output_extendtimeout() { [ 1 ]; }  > /dev/null
            fi
            ...
                 rsync --progress   |  rsync_output_extendtimeout
            ...
            

            Edit: $NOTIFY_SOCKET is a unix socket or an abstract socket. Probably best just to use systemd-notify and check if $NOTIFY_SOCKET is set

            Assistance in identifying other slow SST components and how to identify progress would be much appreciated.

            danblack Daniel Black added a comment - - edited rpizzi , SST scripts will be able to communicate just as easily as the server. The tricky bit is identifying the lower bits of the SST and enacting on it. Like for rsync I was thinking of enabling a progress monitoring and getting a subfunction that takes any output on stdout and issues EXTEND_TIMEOUT=... at that point. something like: if systemd-notify --booted && [ -n "${NOTIFY_SOCKET}" ] then rsync_output_extendtimeout() { # some rate limiting / variable timeout progress while read input do systemd-notify EXTEND_TIMEMOUT=100000 done } else rsync_output_extendtimeout() { [ 1 ]; } > /dev/null fi ... rsync --progress | rsync_output_extendtimeout ... Edit: $NOTIFY_SOCKET is a unix socket or an abstract socket. Probably best just to use systemd-notify and check if $NOTIFY_SOCKET is set Assistance in identifying other slow SST components and how to identify progress would be much appreciated.

            I believe the script wsrep_sst_xtrabackup-v2 (or mariabackup equivalent) is just spawned via fork/exec, in this case in the parent process, run the loop you have mentioned until the child process completes.

            rpizzi Rick Pizzi (Inactive) added a comment - I believe the script wsrep_sst_xtrabackup-v2 (or mariabackup equivalent) is just spawned via fork/exec, in this case in the parent process, run the loop you have mentioned until the child process completes.

            Looks like this SST issue isn't new, just overlooked.
            See item #8 of https://mariadb.com/kb/en/library/upgrading-from-mariadb-galera-cluster-100-to-mariadb-101/

            rpizzi Rick Pizzi (Inactive) added a comment - Looks like this SST issue isn't new, just overlooked. See item #8 of https://mariadb.com/kb/en/library/upgrading-from-mariadb-galera-cluster-100-to-mariadb-101/
            danblack Daniel Black made changes -

            Some cleanup, and testing

            marko Marko Mäkelä added a comment - Some cleanup, and testing
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2018-04-06 07:27:36.0 2018-04-06 07:27:36.08
            marko Marko Mäkelä made changes -
            Fix Version/s 10.1.33 [ 22909 ]
            Fix Version/s 10.2.15 [ 23006 ]
            Fix Version/s 10.3.6 [ 23003 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.3 [ 22126 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]

            rpizzi, I added the timeout extension to InnoDB and XtraDB. Please file a separate ticket for the Galera snapshot transfer, maybe linked to MDEV-11035 or some other ticket on innodb_disallow_writes.

            marko Marko Mäkelä added a comment - rpizzi , I added the timeout extension to InnoDB and XtraDB. Please file a separate ticket for the Galera snapshot transfer, maybe linked to MDEV-11035 or some other ticket on innodb_disallow_writes .
            marko Marko Mäkelä made changes -

            rpizzi, fraggeln in MDEV-11035 mentioned systemd timeout on startup, related to Galera rsync snapshot transfer failure. I wonder if this fix would address those problems, or if some further change is needed. (If yes, please file a separate ticket, as I suggested earlier.)

            marko Marko Mäkelä added a comment - rpizzi , fraggeln in MDEV-11035 mentioned systemd timeout on startup, related to Galera rsync snapshot transfer failure. I wonder if this fix would address those problems, or if some further change is needed. (If yes, please file a separate ticket, as I suggested earlier.)
            marko Marko Mäkelä made changes -

            "Please file a separate ticket for the Galera snapshot transfer"
            already there.... https://jira.mariadb.org/browse/MDEV-15606

            rpizzi Rick Pizzi (Inactive) added a comment - "Please file a separate ticket for the Galera snapshot transfer" already there.... https://jira.mariadb.org/browse/MDEV-15606
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            It looks like the Galera issues were ultimately fixed in MDEV-17571.

            marko Marko Mäkelä added a comment - It looks like the Galera issues were ultimately fixed in MDEV-17571 .
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 84505 ] MariaDB v4 [ 133422 ]

            People

              marko Marko Mäkelä
              danblack Daniel Black
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.