Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-1508

Failover is sometimes triggered on non-simple topologies

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.2
    • Component/s: mariadbmon
    • Labels:
      None
    • Sprint:
      MXS-SPRINT-52

      Description

      The failover-function is only supposed to be ran on 1-depth 1-master setups, and this is usually the case. Sometimes it seems, failover is attempted on more complicated topologies, possibly causing unintended effects. The details on when this happens are hazy, as it seems a bit random. The most reliable way of reproducing this bug with 3 servers A, B, C is:
      1) Start MS with a normal master-slave replication (A-->BC), then shutdown A.
      2) MS performs failover, one of the slaves (e.g. server B) is now the master.
      3) Start A again. Manually (with sqlclient) set B to replicate from A, forming A->B->C
      4) Wait for MS to detect it correctly, should show B as [Master, Relay Master, Slave, Stale Status, Running]
      5) Shutdown B. MS will try to failover, but it will fail.

      Logs:
      2017-11-01 11:08:17 notice : MaxScale started with 2 worker threads, each with a stack size of 8388608 bytes.
      2017-11-01 11:08:17 notice : Server changed state: LocalSlave2[127.0.0.1:3003]: new_slave. [Running] -> [Slave, Running]
      2017-11-01 11:09:32 error : Monitor was unable to connect to server [127.0.0.1]:3001 : "Can't connect to MySQL server on '127.0.0.1' (107)"
      2017-11-01 11:09:32 notice : [mysqlmon] Server [127.0.0.1]:3001 lost the master status.
      2017-11-01 11:09:32 notice : Server changed state: LocalMaster1[127.0.0.1:3001]: master_down. [Master, Running] -> [Down]
      2017-11-01 11:09:32 notice : [mysqlmon] Performing automatic failover to replace failed master 'LocalMaster1'.
      2017-11-01 11:09:32 notice : [mysqlmon] Failover: Promoting server 'LocalSlave1' to master.
      2017-11-01 11:09:32 notice : [mysqlmon] Failover: Redirecting slaves to new master.
      2017-11-01 11:09:32 debug : [mysqlmon] Failover: Change master command is 'CHANGE MASTER TO MASTER_HOST = '127.0.0.1', MASTER_PORT = 3002, MASTER_USE_GTID = slave_pos, MASTER_USER = 'replicator', MASTER_PASSWORD = '******';'.
      2017-11-01 11:09:32 notice : [mysqlmon] Failover: Slave 'LocalSlave2' redirected to new master.
      2017-11-01 11:09:32 error : [mysqlmon] No Master can be determined. Last known was 127.0.0.1:3001
      2017-11-01 11:09:32 debug : 140541849107328 [poll_waitevents] epoll_wait found 1 fds
      2017-11-01 11:09:32 debug : 140541703866112 [poll_waitevents] epoll_wait found 1 fds
      2017-11-01 11:09:37 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: new_master. [Slave, Running] -> [Master, Slave, Running]
      2017-11-01 11:09:37 notice : [mysqlmon] A Master Server is now available: 127.0.0.1:3002
      2017-11-01 11:09:42 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: new_master. [Master, Slave, Running] -> [Master, Running]
      2017-11-01 11:09:42 notice : [mysqlmon] A Master Server is now available: 127.0.0.1:3002
      2017-11-01 11:13:43 notice : Server changed state: LocalMaster1[127.0.0.1:3001]: server_up. [Down] -> [Running]
      2017-11-01 11:13:43 debug : 140541849107328 [poll_waitevents] epoll_wait found 1 fds
      2017-11-01 11:13:43 debug : 140541703866112 [poll_waitevents] epoll_wait found 1 fds
      2017-11-01 11:14:18 warning: [mysqlmon] All slave servers under the current master server have been lost. Assigning Stale Master status to the old master server 'LocalSlave1' (127.0.0.1:3002).
      2017-11-01 11:14:18 notice : Server changed state: LocalMaster1[127.0.0.1:3001]: new_master. [Running] -> [Master, Running]
      2017-11-01 11:14:18 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: new_master. [Master, Running] -> [Master, Relay Master, Slave, Stale Status, Running]
      2017-11-01 11:20:19 error : Monitor was unable to connect to server [127.0.0.1]:3002 : "Can't connect to MySQL server on '127.0.0.1' (107)"
      2017-11-01 11:20:19 notice : [mysqlmon] Server [127.0.0.1]:3002 lost the master status.
      2017-11-01 11:20:19 warning: [mysqlmon] All slave servers under the current master server have been lost. Assigning Stale Master status to the old master server 'LocalMaster1' (127.0.0.1:3001).
      2017-11-01 11:20:19 notice : Server changed state: LocalSlave1[127.0.0.1:3002]: master_down. [Master, Relay Master, Slave, Stale Status, Running] -> [Down]
      2017-11-01 11:20:19 notice : Server changed state: LocalSlave2[127.0.0.1:3003]: lost_slave. [Slave, Running] -> [Running]
      2017-11-01 11:20:19 notice : [mysqlmon] Performing automatic failover to replace failed master 'LocalSlave1'.
      2017-11-01 11:20:19 notice : [mysqlmon] Failover: Promoting server 'LocalSlave1' to master.
      2017-11-01 11:20:19 warning: [mysqlmon] Failover: Promotion failed: 'MySQL server has gone away'.
      2017-11-01 11:20:19 alert : [mysqlmon] Failed to perform failover, disabling failover functionality. To enable failover functionality, manually set 'failover' to 'true' for monitor 'MySQL-Monitor' via MaxAdmin or the REST API.

        Attachments

          Activity

            People

            Assignee:
            esa.korhonen Esa Korhonen
            Reporter:
            esa.korhonen Esa Korhonen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: