[MXS-2036] A slave with sql thread stopped causes wrong master after failover Created: 2018-09-04  Updated: 2020-08-25  Resolved: 2018-09-11

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.2.13
Fix Version/s: 2.2.14

Type: Bug Priority: Major
Reporter: Esa Korhonen Assignee: Esa Korhonen
Resolution: Fixed Votes: 0
Labels: None

Sprint: MXS-SPRINT-65

 Description   

As described in the support ticket. In short, a slave with IO thread running but SQL thread stopped is in limbo, and causes wrong master to be selected after a failover unless the new master has other slaves.

This is again an effect of the way the 2.2 monitor works. The slave which is still connected or trying to connect to the master (IO thread is on or connecting) but not actually replicating (sql thread is off) is counted as a slave of that node, even if the master node is down. During switchover/failover, servers with a broken slave sql thread are not redirected (since they are not real slaves and cannot replicate from the new master anyway). This difference produces the weird result where the old master gets to be master even after failover. In 2.3 this doesn't happen because the monitor works differently.

Fixing this in 2.2 requires choosing between changing the master selection code or the failover/switchover code. I will try with the latter, since changing the former could affect various other places as well.


Generated at Thu Feb 08 04:11:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.