Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17527

Galera cluster "slowly deteriorating" (statistics?) after latest upgrade to Ubuntu16.04(only patches)/mariadb 10.2.18 (from 10.2.9)

    XMLWordPrintable

Details

    Description

      I have a 3-node Galera cluster DB-4 , DB-5 and DB-6
      It worked fine up until the recent upgrade Oct 05. (apt update / apt upgrade, which all went fine with no errors. )
      Rebooted servers, checked that cluster formed again with 3 nodes. All fine.
      Few days later, users started reporting their sessions hung.
      We searched high and low for clues, ended up shutting down all but the only node still able to respond to a certain query (select from 3-4 tables). That query still returns in 0.00secs from DB-4
      So, only DB-4 is running now.
      Then when we wanted to restart DB-5/DB-6 they failed miserably (during SST, no matter if rsync or the newer mariabackup I recently installed ), and nobody could find the reason. Checking network/apparmor and all kinds of possibilities. We found out yesterday that it's because the timeout for systemd was set to 1m30s, despite me setting TimoutSec=0 in galera.cnf. Whichever - I've now set it to "infinity" and the remaining two nodes were able to joing the cluster again using SST(mariabackup)
      (I think it's because all the recent smart changes to TimeoutSec for mariadb/galera doing SST has NOT gone into the systemd version 229 we still use on this 16.04 ??)

      Yesterday, right after the SST finished successfully, I ran the troubled query as I did after the upgrade. All three nodes responded immediately like DB-4 used to do, so after running it multiple times over some time, I decided to head on home with a smile on my face

      When I got in, this morning, I saw that DB-5 and DB-6 is now again back to the spot where they cannot answer the query. At first they took so long that I CTRL+C on it.
      Now, after some hours, they CAN respond, but in stead of 0.00secs, they use ~1.3 secs on it.

      I can see the EXPLAIN for that same query is identical on the slow nodes, but different on the working DB-4 node.
      Over time, it (explainplan) seems to have changed on DB-5/DB-6 so it is now "almost" the same as on DB-4, but in another sequence !?

      Now my questions, in all my bewilderment is:

      Can anyone suggest/clarify what I might have done to my cluster during the upgrade (upgrade Ubuntu with newest patches to 16.04, and MariaDB upgraded from 10.2.9 -> 10.2.18)
      I would say it went OK, but we did have some "you need to run mysql_upgrade" in our syslog. It was related to some system table having a column "too short" as I recall.
      The mysql_upgrade was run on all three servers with no problems.

      Is there anything we need to do, to make sure that statistics be the same, as good on 5/6 as on DB-4 so the poor optimizer makes the same desicions on all three servers?

      Can anyone spot where I dropped the ball, please

      Anything useful to check ?, to see if something inside Galera is actually not working as it should ... logfiles (syslog) indicate nothing even closely resembling any problems.

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            brianryberg brianr
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.