[MXS-5407] Dataloss due to maxscale failover of invalid replica - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Minor
Resolution: Duplicate
Affects Version/s: 23.02.2
Fix Version/s: 23.02.7
Component/s: mariadbmon
Labels:
None
Environment:
kubernetes skysql deployment, server version 10.6

Sprint:
MXS-SPRINT-225, MXS-SPRINT-226

Description

We ran into a situtation were a serverly lagging slave was promoted to master causing a full working day of data loss.

From our investigation

There is this code in maxscale that checks for the best master
https://github.com/mariadb-corporation/MaxScale/blob/2e4a53baaa2b7099f5f2452f04116227527ebd03/server/modules/monitor/mariadbmon/cluster_discovery.cc#L908

In our case, we have 2 servers and since one is broken, neither server has slaves. So 2 servers, no slaves. Pod 1 is technically considered master, but it has no slaves.

The code eventually gets here
https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L414
https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L298

The last one would return 0 for all our servers, so they are the same. This means whichever server was tested first will get to be the best_reach . In our case it was pod0 (most likely due to the names getting sorted alphabetically).

Now when pod1 failed for some reason (no idea why that happened, maybe a crash or something else) pod 0 was promoted to master even though it was not even a valid slave.

the check in maxscale if it is a valid candidate considers RUNNING servers that are not read_only and not in maintenance to be valid
https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L1031

Attachments

Activity

People

Assignee:: Esa Korhonen

Reporter:: Bryan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-12-02 17:25

Updated:: 2025-01-16 12:56

Resolved:: 2025-01-16 12:56

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.