[MXS-2454] During connectivity problems MaxScale can get confused in Master-Slave topologies and promote the slave to master within MaxScale's topology without stopping the slave threads, seeing the former master as an external master. Created: 2019-04-26  Updated: 2020-08-25  Resolved: 2019-05-20

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.3.4
Fix Version/s: 2.3.7

Type: Bug Priority: Major
Reporter: Juan Assignee: Esa Korhonen
Resolution: Not a Bug Votes: 0
Labels: need_feedback
Environment:

RHEL


Sprint: MXS-SPRINT-82

 Description   

It looks like an intermittent network issue that eventually confuses MaxScale:

2019-04-09 22:30:25 error : (1945819) Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 28778.2 seconds. Error caused by: #HY000: Lost connection to backend server.
 
2019-04-09 22:40:41 error : (2115177) Lost connection to the master server, closing session. Lost connection to master server while waiting for a result. Connection has been idle for 0.0 seconds. Error caused by: #HY000: Lost connection to backend server. (x3)
 
2019-04-09 22:51:32 error : (1842568) Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 28778.2 seconds. Error caused by: #HY000: Lost connection to backend server.

etc. until we get to:

2019-04-09 23:05:21 warning: Error during monitor update of server 'server01': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to MySQL server during query'.
2019-04-09 23:05:43 error : Failure loading users data from backend [192.168.1.230:3306] for service [MasterSlave-Router]. MySQL error 2002, Can't connect to MySQL server on '192.168.1.230' (110)
2019-04-09 23:05:43 warning: [MySQLAuth] MasterSlave-Router: login attempt for user 'vetdiss'@[192.168.1.240]:36973, authentication failed. User not found.
2019-04-09 23:05:51 error : Monitor timed out when connecting to server server01[192.168.1.230:3306] : 'Can't connect to MySQL server on '192.168.1.230' (110)'
2019-04-09 23:05:51 warning: 'server02' is a better master candidate than the current master 'server01'. Master will change when 'server01' is no longer a valid master.
2019-04-09 23:05:51 notice : Server changed state: server01[192.168.1.230:3306]: master_down. [Master, Running] -> [Down]
2019-04-09 23:05:51 error : Server server01 ([192.168.1.230]:3306) lost the master status while waiting for a result. Client sessions will be closed.

so at this point the monitor ejects server01 from the topology as unreachable, treats server02 as master, but defers resetting the slave on it because it cannot reach server01.

2019-04-09 23:05:51 error : Monitor timed out when connecting to server server01[192.168.1.230:3306] : 'Can't connect to MySQL server on '192.168.1.230' (110)'
2019-04-09 23:05:51 warning: 'server02' is a better master candidate than the current master 'server01'. Master will change when 'server01' is no longer a valid master.
2019-04-09 23:05:51 notice : Server changed state: server01[192.168.1.230:3306]: master_down. [Master, Running] -> [Down]
2019-04-09 23:05:51 error : Server server01 ([192.168.1.230]:3306) lost the master status while waiting for a result. Client sessions will be closed.
2019-04-09 23:05:51 error : Server server01 ([192.168.1.230]:3306) lost the master status while waiting for a result. Client sessions will be closed.
2019-04-09 23:05:51 error : Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 1360.8 seconds. Error caused by: #HY000: Lost connection to backend server.

and after facing more connection difficulties finally exceeds failcount and does this:

2019-04-09 23:06:42 warning: The current master server 'server01' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new master server.
2019-04-09 23:06:42 warning: 'server01' is not a valid master candidate because it is down.
2019-04-09 23:06:42 notice : Setting 'server02' as master.
2019-04-09 23:06:42 notice : Cluster master server is replicating from an external master: server01.domain.com:3306
2019-04-09 23:06:42 notice : Server changed state: server02[192.168.1.231:3306]: new_master. [Slave of External Server, Running] -> [Master, Slave of External Server, Running]

The problem is that at this point MaxScale sees server01 as an external server after having been unable to reach it for a length of time, so does not stop replication from it, and declares server02 the master of a separate topology of one while leaving it a slave of the 'external' server01.


Generated at Thu Feb 08 04:14:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.