Details
-
Bug
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Duplicate
-
23.02.2
-
None
-
kubernetes skysql deployment, server version 10.6
-
MXS-SPRINT-225, MXS-SPRINT-226
Description
We ran into a situtation were a serverly lagging slave was promoted to master causing a full working day of data loss.
From our investigation
There is this code in maxscale that checks for the best master
https://github.com/mariadb-corporation/MaxScale/blob/2e4a53baaa2b7099f5f2452f04116227527ebd03/server/modules/monitor/mariadbmon/cluster_discovery.cc#L908
In our case, we have 2 servers and since one is broken, neither server has slaves. So 2 servers, no slaves. Pod 1 is technically considered master, but it has no slaves.
The code eventually gets here
https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L414
https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L298
The last one would return 0 for all our servers, so they are the same. This means whichever server was tested first will get to be the best_reach . In our case it was pod0 (most likely due to the names getting sorted alphabetically).
Now when pod1 failed for some reason (no idea why that happened, maybe a crash or something else) pod 0 was promoted to master even though it was not even a valid slave.
the check in maxscale if it is a valid candidate considers RUNNING servers that are not read_only and not in maintenance to be valid
https://github.com/mariadb-corporation/MaxScale/blob/23.02.2/server/modules/monitor/mariadbmon/cluster_discovery.cc#L1031