Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17934

Make systemd timeout behavior more compatible with longer Galera recovery times

Details

    Description

      When Galera is enabled, MariaDB's systemd service executes the "galera_recovery" script as an ExecStartPre operation. See the following:

      https://github.com/MariaDB/server/blob/ce8716a1ed786ff971b5e15c88385d50b649ec7f/support-files/mariadb.service.in#L71

      The MariaDB systemd service has a default TimeoutStartSec value of 90 seconds, so if this ExecStartPre step takes longer than that, then this can cause startup to fail. For example, see the following failure from a syslog:

      Sep 13 15:48:28 server1 systemd[1]: Starting MariaDB 10.2.16 database server...
      Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Start-pre operation timed out. Terminating.
      Sep 13 15:49:58 server1 systemd[1]: Failed to start MariaDB 10.2.16 database server.
      Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Unit entered failed state.
      Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Failed with result 'timeout'.
      

      galera_recovery has to perform server startup, so this step can take a while, especially if the server previously crashed, and it has to perform crash recovery. However, it looks like systemd timeouts should have been extended during server startup as part of MDEV-14705. Despite that, server versions with the fix for MDEV-14705 still see timeouts during ExecStartPre. Is it likely that important long-running startup functions were missed?

      See also MDEV-17571 as another case where systemd timeout extensions didn't seem to work as intended.

      Attachments

        Issue Links

          Activity

            GeoffMontee Geoff Montee (Inactive) created issue -
            GeoffMontee Geoff Montee (Inactive) made changes -
            Field Original Value New Value
            GeoffMontee Geoff Montee (Inactive) made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            elenst Elena Stepanova made changes -
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Assignee Rasmus Johansson [ ratzpo ]
            GeoffMontee Geoff Montee (Inactive) made changes -
            Description When Galera is enabled, MariaDB's systemd service executes the "galera_recovery" script as an ExecStartPre operation. See the following:

            https://github.com/MariaDB/server/blob/ce8716a1ed786ff971b5e15c88385d50b649ec7f/support-files/mariadb.service.in#L71

            The MariaDB systemd service has a default TimeoutStartSec value of 90 seconds, so if this ExecStartPre step takes longer than that, then this can cause startup to fail. For example, see the following failure from a syslog:

            {noformat}
            Sep 13 15:48:28 server1 systemd[1]: Starting MariaDB 10.2.16 database server...
            Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Start-pre operation timed out. Terminating.
            Sep 13 15:49:58 server1 systemd[1]: Failed to start MariaDB 10.2.16 database server.
            Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Unit entered failed state.
            Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Failed with result 'timeout'.
            {noformat}

            galera_recovery has to perform server startup, so this step can take a while, especially if the server previously crashed, and it has to perform crash recovery. However, it looks like systemd timeouts should have been extended during server startup as part of MDEV-14705. Despite that, server versions with the fix for MDEV-14705 still see timeouts during ExecStartPre. Is it likely that important long-running startup functions were mixed?

            See also MDEV-17571 as another case where systemd timeout extensions didn't seem to work as intended.
            When Galera is enabled, MariaDB's systemd service executes the "galera_recovery" script as an ExecStartPre operation. See the following:

            https://github.com/MariaDB/server/blob/ce8716a1ed786ff971b5e15c88385d50b649ec7f/support-files/mariadb.service.in#L71

            The MariaDB systemd service has a default TimeoutStartSec value of 90 seconds, so if this ExecStartPre step takes longer than that, then this can cause startup to fail. For example, see the following failure from a syslog:

            {noformat}
            Sep 13 15:48:28 server1 systemd[1]: Starting MariaDB 10.2.16 database server...
            Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Start-pre operation timed out. Terminating.
            Sep 13 15:49:58 server1 systemd[1]: Failed to start MariaDB 10.2.16 database server.
            Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Unit entered failed state.
            Sep 13 15:49:58 server1 systemd[1]: mariadb.service: Failed with result 'timeout'.
            {noformat}

            galera_recovery has to perform server startup, so this step can take a while, especially if the server previously crashed, and it has to perform crash recovery. However, it looks like systemd timeouts should have been extended during server startup as part of MDEV-14705. Despite that, server versions with the fix for MDEV-14705 still see timeouts during ExecStartPre. Is it likely that important long-running startup functions were missed?

            See also MDEV-17571 as another case where systemd timeout extensions didn't seem to work as intended.
            ratzpo Rasmus Johansson (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Assignee Rasmus Johansson [ ratzpo ] Jan Lindström [ jplindst ]
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Rasmus Johansson [ ratzpo ]
            axel Axel Schwenke made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Duplicate [ 3 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 91110 ] MariaDB v4 [ 155321 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 127027 156845 114054

            People

              ratzpo Rasmus Johansson (Inactive)
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.