[MXS-2010] MaxScale Failover Not Working as Expected Created: 2018-08-13  Updated: 2020-08-25  Resolved: 2018-08-30

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.2.13
Fix Version/s: 2.3.0

Type: Bug Priority: Major
Reporter: Chris Calender (Inactive) Assignee: Esa Korhonen
Resolution: Fixed Votes: 0
Labels: None

Sprint: MXS-SPRINT-64, MXS-SPRINT-65

 Description   

We are testing with maxscale 2.2.13 with failover scenarios we noticed that one scenario is failing.

Tested this for enough times and the same thing is observed.

scenario.

node1 ==> Master
node2 ==> Slave
node3 ==> Slave

success
===================================

node1 ==> down , rejoined as slave
node2 ==> Master, Promoted as Master
node3 ==> Slave , No Change

Success

===================================

when node1 and node2 is brough down at a time node3 is promoted as Master Successfully
node1 ==> down , Down At a Time
node2 ==> down , Down At a Time
node3 ==> Master , Promoted as Master

Success

====================================
====================================
when bringing up both the nodes at a time (Node2 followed by Node1)

node1 ==> Running
node2 ==> Running
node3 ==> Already Master

out put

node1 ==> Slave Running
node2 ==> Master Running
node3 ==> Running (Out of Cluster) Including Data Loss

Failure

In the above scenario we started both nodes at a time node2 followed by node1
The current Master went to running state.

[maxscale@x18tcldgpapp06 ~]$ maxctrl list servers
┌────────┬──────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server │ Address │ Port │ Connections │ State │ GTID │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node1 │ 10.1.1.96 │ 6603 │ 0 │ Down │ 1-3-341679 │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node2 │ 10.1.1.81 │ 6603 │ 0 │ Down │ 1-3-341679 │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node3 │ 10.1.1.82 │ 6603 │ 0 │ Master, Running │ 1-3-341679 │
└────────┴──────────────┴──────┴─────────────┴─────────────────┴────────────┘
[maxscale@x18tcldgpapp06 ~]$ maxctrl list servers
┌────────┬──────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server │ Address │ Port │ Connections │ State │ GTID │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node1 │ 10.1.1.96 │ 6603 │ 0 │ Slave, Running │ 1-3-341679 │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node2 │ 10.1.1.81 │ 6603 │ 0 │ Master, Running │ 1-3-341679 │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node3 │ 10.1.1.82 │ 6603 │ 0 │ Running │ 1-1-342807 │
└────────┴──────────────┴──────┴─────────────┴─────────────────┴────────────┘
[maxscale@x18tcldgpapp06 ~]$ maxctrl list servers
┌────────┬──────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server │ Address │ Port │ Connections │ State │ GTID │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node1 │ 10.1.1.96 │ 6603 │ 0 │ Slave, Running │ 1-3-341679 │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node2 │ 10.1.1.81 │ 6603 │ 0 │ Master, Running │ 1-3-341679 │
├────────┼──────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ node3 │ 10.1.1.82 │ 6603 │ 0 │ Running │ 1-1-343390 │
└────────┴──────────────┴──────┴─────────────┴─────────────────┴────────────┘

====================================
====================================

While doing individual failover nodes this is promoting perfectly.

But when doing both nodes at a time this is failing on same error.

Uploaded failover maxctrl output for your reference.



 Comments   
Comment by Chris Calender (Inactive) [ 2018-08-14 ]

Also, it has now been tested multiple times with read_only=1 in my.cnf, with still the same results.

The previous master is not joining as slave it's promoting as master.

I will upload the latest logs for your reference.

Comment by Esa Korhonen [ 2018-08-30 ]

This is difficult to fix for 2.2, as the monitoring logic does not remember the previous master. In 2.3 the logic is different and these kinds of issues should not be a problem, or at least they are easier to fix.

Generated at Thu Feb 08 04:11:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.