[MXS-1966] Servers lose stale slave status when MaxScale is restarted Created: 2018-07-08  Updated: 2020-02-14  Resolved: 2020-02-14

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.3
Fix Version/s: 2.4.0

Type: Bug Priority: Major
Reporter: markus makela Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None


 Description   

The following steps cause all servers that were slaves to receive the wrong status.

  1. Start with one master and at least one slave
  2. Stop the master
  3. Restart MaxScale
  4. The slaves end up in the following state

┌─────────┬───────────┬──────┬─────────────┬───────────────────────────────────┬──────────┐
│ Server  │ Address   │ Port │ Connections │ State                             │ GTID     │
├─────────┼───────────┼──────┼─────────────┼───────────────────────────────────┼──────────┤
│ server1 │ 127.0.0.1 │ 3000 │ 0           │ Down                              │          │
├─────────┼───────────┼──────┼─────────────┼───────────────────────────────────┼──────────┤
│ server2 │ 127.0.0.1 │ 3001 │ 0           │ Slave of External Server, Running │ 0-3000-9 │
├─────────┼───────────┼──────┼─────────────┼───────────────────────────────────┼──────────┤
│ server3 │ 127.0.0.1 │ 3002 │ 0           │ Slave of External Server, Running │ 0-3000-9 │
├─────────┼───────────┼──────┼─────────────┼───────────────────────────────────┼──────────┤
│ server4 │ 127.0.0.1 │ 3003 │ 0           │ Slave of External Server, Running │ 0-3000-9 │
└─────────┴───────────┴──────┴─────────────┴───────────────────────────────────┴──────────┘

The following messages are logged after MaxScale is restarted.

2018-07-09 00:15:02   error  : Monitor was unable to connect to server server1[127.0.0.1:3000] : 'Can't connect to MySQL server on '127.0.0.1' (115)'
2018-07-09 00:15:02   warning: [mariadbmon] 'server2' is a better master candidate than the current master 'server1'. Master will change if 'server1' is no longer a valid master.
2018-07-09 00:15:02   notice : Server changed state: server1[127.0.0.1:3000]: server_down. [Running] -> [Down]
2018-07-09 00:15:02   notice : Server changed state: server2[127.0.0.1:3001]: lost_slave. [Slave, Running] -> [Slave of External Server, Running]
2018-07-09 00:15:02   notice : Server changed state: server3[127.0.0.1:3002]: lost_slave. [Slave, Running] -> [Slave of External Server, Running]
2018-07-09 00:15:02   notice : Server changed state: server4[127.0.0.1:3003]: lost_slave. [Slave, Running] -> [Slave of External Server, Running]



 Comments   
Comment by markus makela [ 2018-07-08 ]

Possibly related to MXS-1965.

Comment by Esa Korhonen [ 2018-07-31 ]

This happens because the monitor does not know the server-id of the Downed master. Two solutions:
1) The monitor will instead compare hostnames and ports to determine which slave connection points to which server. Unreliable in situations where servers have different ip:s in different networks etc but would be fine for most cases.
2) The monitor journal can be expanded so monitors could save more data in there, e.g. server id:s.

Comment by markus makela [ 2020-02-14 ]

Fixed by assume_unique_hostnames.

Generated at Thu Feb 08 04:10:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.