[MXS-3268] Maxscale should auto detect master if there is none in cluster Created: 2020-10-29  Updated: 2021-09-20  Resolved: 2021-03-31

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: None
Fix Version/s: 6.0.0

Type: New Feature Priority: Major
Reporter: Nilnandan Joshi Assignee: Esa Korhonen
Resolution: Fixed Votes: 0
Labels: None

Sprint: MXS-SPRINT-127, MXS-SPRINT-128

 Description   

Hi Team,

Recently, customer have observed one scenario where master was down due to some reason and maxscale said that "Master has failed. If master status does not change in 4 monitor passes, failover begins." but before failover happened, master started and maxscale set it to "Slave, Running" .

2020-10-17 01:14:19 error : Monitor was unable to connect to server node2[10.232.86.133:6603] : ''
2020-10-17 01:14:19 notice : Server changed state: node2[10.232.86.133:6603]: master_down. [Master, Running] -> [Down]
2020-10-17 01:14:19 warning: [mariadbmon] Master has failed. If master status does not change in 4 monitor passes, failover begins.
2020-10-17 01:14:44 warning: [mariadbmon] The current master server 'node2' is no longer valid because it is in read-only mode, but there is no valid alternative to swap to.
2020-10-17 01:14:44 error : [mariadbmon] No Master can be determined. Last known was 10.232.86.133:6603
2020-10-17 01:14:44 notice : Server changed state: node2[10.232.86.133:6603]: slave_up. [Down] -> [Slave, Running]

So now they had three node with "Slave, Running" and Maxscale didn't make any of the server to master. Finally, they had to restart node2 server and then failover happened.

2020-10-17 01:54:01 error : Monitor was unable to connect to server node2[10.232.86.133:6603] : ''
2020-10-17 01:54:01 error : [mariadbmon] No Master can be determined. Last known was 10.232.86.133:6603
2020-10-17 01:54:01 notice : Server changed state: node2[10.232.86.133:6603]: slave_down. [Slave, Running] -> [Down]
2020-10-17 01:54:01 warning: [mariadbmon] Master has failed. If master status does not change in 4 monitor passes, failover begins.
...
2020-10-17 01:54:21 notice : [mariadbmon] Selecting a server to promote and replace 'node2'. Candidates are: 'node1', 'node3'.
2020-10-17 01:54:21 notice : [mariadbmon] Selected 'node1'.
2020-10-17 01:54:21 notice : [mariadbmon] Performing automatic failover to replace failed master 'node2'.
2020-10-17 01:54:21 notice : [mariadbmon] Redirecting 'node3' to replicate from 'node1' instead of 'node2'.
2020-10-17 01:54:21 notice : [mariadbmon] All redirects successful.
2020-10-17 01:54:22 notice : [mariadbmon] All redirected slaves successfully started replication from 'node1'.
2020-10-17 01:54:22 notice : [mariadbmon] Failover 'node2' -> 'node1' performed.

Can we add some functionality in maxscale which can check about master server frequently ?
and if there is no master then based on GTID, it can decide which has latest ID and make it master and other nodes to slaves?



 Comments   
Comment by Johan Wikman [ 2020-11-18 ]

Is it known why node2 came back up in readonly mode? Is it made so in the server config file?

Comment by Johan Wikman [ 2020-12-17 ]

nicklamb As this is a New Feature and in the Icebox you need to agree with toddstoffel what to do about it.

Comment by Johan Wikman [ 2021-03-08 ]

ccalender Could you clarify what master node in point 2 refers to? To one of the Galera nodes in point 1?

Generated at Thu Feb 08 04:20:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.