Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
2.2.13
-
None
-
MXS-SPRINT-65
Description
As described in the support ticket. In short, a slave with IO thread running but SQL thread stopped is in limbo, and causes wrong master to be selected after a failover unless the new master has other slaves.
This is again an effect of the way the 2.2 monitor works. The slave which is still connected or trying to connect to the master (IO thread is on or connecting) but not actually replicating (sql thread is off) is counted as a slave of that node, even if the master node is down. During switchover/failover, servers with a broken slave sql thread are not redirected (since they are not real slaves and cannot replicate from the new master anyway). This difference produces the weird result where the old master gets to be master even after failover. In 2.3 this doesn't happen because the monitor works differently.
Fixing this in 2.2 requires choosing between changing the master selection code or the failover/switchover code. I will try with the latter, since changing the former could affect various other places as well.