Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-5407

Dataloss due to maxscale failover of invalid replica

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Duplicate
    • 23.02.2
    • 23.02.7
    • mariadbmon
    • None
    • kubernetes skysql deployment, server version 10.6
    • MXS-SPRINT-225, MXS-SPRINT-226

    Description

      We ran into a situtation were a serverly lagging slave was promoted to master causing a full working day of data loss.

      From our investigation

      There is this code in maxscale that checks for the best master
      https://github.com/mariadb-corporation/MaxScale/blob/2e4a53baaa2b7099f5f2452f04116227527ebd03/server/modules/monitor/mariadbmon/cluster_discovery.cc#L908

      In our case, we have 2 servers and since one is broken, neither server has slaves. So 2 servers, no slaves. Pod 1 is technically considered master, but it has no slaves.

      The code eventually gets here
      https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L414
      https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L298

      The last one would return 0 for all our servers, so they are the same. This means whichever server was tested first will get to be the best_reach . In our case it was pod0 (most likely due to the names getting sorted alphabetically).

      Now when pod1 failed for some reason (no idea why that happened, maybe a crash or something else) pod 0 was promoted to master even though it was not even a valid slave.

      the check in maxscale if it is a valid candidate considers RUNNING servers that are not read_only and not in maintenance to be valid
      https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L1031

      Attachments

        Activity

          People

            esa.korhonen Esa Korhonen
            bryan-skysql Bryan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.