Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16425

New node in Galera can't fully sync - systemd timeout

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 10.1.31
    • Fix Version/s: N/A
    • Component/s: Galera, Galera SST
    • Labels:
      None
    • Environment:
      RHEL 7

      Description

      So we have a galera 10.1.31 cluster - it has three nodes. They are running RHEL 7

      We have about 10 databases hosted in that cluster - one of them is about 50GB.

      We lost a node due to a mishap which is another story. But we cleaned up the lost node and were trying to restart mariadb with: systemctl restart mariadb
      We observed that inside of /var/lib/mysql the size of this directory never got larger than about 11GB and that the rsync processes never completed. Looking further into journalctl, we saw that about every 90 seconds - we found that the mariadb.service would be restarted.

      After some digging, I figured out that systemd has a default service start timeout of 90 seconds (at least on RHEL 7). Since the mariadb.service while syncing remains in the 'Activating' state and because there was so much data to sync while activating, the service would hit the timeout.

      The way I fixed this was to edit this file:
      /usr/lib/systemd/system/mariadb.service

      And add these lines below the [Service] line:
      RestartSec=86400
      TimeoutSec=86400

      Then ran:
      systemctl daemon-reload
      Systemctl restart mariadb

      After about 5 minutes, the node was fully sync'd and operational - I then removed these timeouts.

      This raises a concern though - a default installation of Galera should not timeout during initial sync of medium-sized databases.

      I'm not sure what the best way to handle this is - I'm concerned about making the increased timeout part of the mariadb.service file permanently for all systemd users - because this would have negative outcomes if there were in-fact some kind of funk going on with the service.

      Maybe systemd has other states that could be used for the syncronization phase that a new galera node goes through? Something we can set the timeout higher for?

      Thanks,
      Wayne

        Attachments

          Activity

            People

            • Assignee:
              jplindst Jan Lindström
              Reporter:
              wworkman@neces.com Wayne Workman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: