[MXS-1508] Failover is sometimes triggered on non-simple topologies - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.2.2
Component/s: mariadbmon
Labels:
None

Sprint:
MXS-SPRINT-52

Description

The failover-function is only supposed to be ran on 1-depth 1-master setups, and this is usually the case. Sometimes it seems, failover is attempted on more complicated topologies, possibly causing unintended effects. The details on when this happens are hazy, as it seems a bit random. The most reliable way of reproducing this bug with 3 servers A, B, C is:
1) Start MS with a normal master-slave replication (A-->BC), then shutdown A.
2) MS performs failover, one of the slaves (e.g. server B) is now the master.
3) Start A again. Manually (with sqlclient) set B to replicate from A, forming A->B->C
4) Wait for MS to detect it correctly, should show B as [Master, Relay Master, Slave, Stale Status, Running]
5) Shutdown B. MS will try to failover, but it will fail.

Logs:
2017-11-01 11:08:17 notice : MaxScale started with 2 worker threads, each with a stack size of 8388608 bytes.
2017-11-01 11:08:17 notice : Server changed state: LocalSlave2[127.0.0.1:3003]: new_slave. [Running] -> [Slave, Running]
2017-11-01 11:09:32 error : Monitor was unable to connect to server [127.0.0.1]:3001 : "Can't connect to MySQL server on '127.0.0.1' (107)"
2017-11-01 11:09:32 notice : [mysqlmon] Server [127.0.0.1]:3001 lost the master status.
2017-11-01 11:09:32 notice : Server changed state: LocalMaster1[127.0.0.1:3001]: master_down. [Master, Running] -> [Down]
2017-11-01 11:09:32 notice : [mysqlmon] Performing automatic failover to replace failed master 'LocalMaster1'.
2017-11-01 11:09:32 notice : [mysqlmon] Failover: Promoting server 'LocalSlave1' to master.
2017-11-01 11:09:32 notice : [mysqlmon] Failover: Redirecting slaves to new master.
2017-11-01 11:09:32 debug : [mysqlmon] Failover: Change master command is 'CHANGE MASTER TO MASTER_HOST = '127.0.0.1', MASTER_PORT = 3002, MASTER_USE_GTID = slave_pos, MASTER_USER = 'replicator', MASTER_PASSWORD = '******';'.
2017-11-01 11:09:32 notice : [mysqlmon] Failover: Slave 'LocalSlave2' redirected to new master.
2017-11-01 11:09:32 error : [mysqlmon] No Master can be determined. Last known was 127.0.0.1:3001
2017-11-01 11:09:32 debug : 140541849107328 [poll_waitevents] epoll_wait found 1 fds
2017-11-01 11:09:32 debug : 140541703866112 [poll_waitevents] epoll_wait found 1 fds
2017-11-01 11:09:37 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: new_master. [Slave, Running] -> [Master, Slave, Running]
2017-11-01 11:09:37 notice : [mysqlmon] A Master Server is now available: 127.0.0.1:3002
2017-11-01 11:09:42 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: new_master. [Master, Slave, Running] -> [Master, Running]
2017-11-01 11:09:42 notice : [mysqlmon] A Master Server is now available: 127.0.0.1:3002
2017-11-01 11:13:43 notice : Server changed state: LocalMaster1[127.0.0.1:3001]: server_up. [Down] -> [Running]
2017-11-01 11:13:43 debug : 140541849107328 [poll_waitevents] epoll_wait found 1 fds
2017-11-01 11:13:43 debug : 140541703866112 [poll_waitevents] epoll_wait found 1 fds
2017-11-01 11:14:18 warning: [mysqlmon] All slave servers under the current master server have been lost. Assigning Stale Master status to the old master server 'LocalSlave1' (127.0.0.1:3002).
2017-11-01 11:14:18 notice : Server changed state: LocalMaster1[127.0.0.1:3001]: new_master. [Running] -> [Master, Running]
2017-11-01 11:14:18 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: new_master. [Master, Running] -> [Master, Relay Master, Slave, Stale Status, Running]
2017-11-01 11:20:19 error : Monitor was unable to connect to server [127.0.0.1]:3002 : "Can't connect to MySQL server on '127.0.0.1' (107)"
2017-11-01 11:20:19 notice : [mysqlmon] Server [127.0.0.1]:3002 lost the master status.
2017-11-01 11:20:19 warning: [mysqlmon] All slave servers under the current master server have been lost. Assigning Stale Master status to the old master server 'LocalMaster1' (127.0.0.1:3001).
2017-11-01 11:20:19 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: master_down. [Master, Relay Master, Slave, Stale Status, Running] -> [Down]
2017-11-01 11:20:19 notice : Server changed state: LocalSlave2[127.0.0.1:3003]: lost_slave. [Slave, Running] -> [Running]
2017-11-01 11:20:19 notice : [mysqlmon] Performing automatic failover to replace failed master 'LocalSlave1'.
2017-11-01 11:20:19 notice : [mysqlmon] Failover: Promoting server 'LocalSlave1' to master.
2017-11-01 11:20:19 warning: [mysqlmon] Failover: Promotion failed: 'MySQL server has gone away'.
2017-11-01 11:20:19 alert : [mysqlmon] Failed to perform failover, disabling failover functionality. To enable failover functionality, manually set 'failover' to 'true' for monitor 'MySQL-Monitor' via MaxAdmin or the REST API.

Attachments

Activity

People

Assignee:: Esa Korhonen

Reporter:: Esa Korhonen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2017-11-01 09:28

Updated:: 2018-03-28 10:51

Resolved:: 2018-02-13 11:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.