Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Not a Bug
-
2.3.4
-
RHEL
-
MXS-SPRINT-82
Description
It looks like an intermittent network issue that eventually confuses MaxScale:
2019-04-09 22:30:25 error : (1945819) Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 28778.2 seconds. Error caused by: #HY000: Lost connection to backend server.
|
|
2019-04-09 22:40:41 error : (2115177) Lost connection to the master server, closing session. Lost connection to master server while waiting for a result. Connection has been idle for 0.0 seconds. Error caused by: #HY000: Lost connection to backend server. (x3)
|
|
2019-04-09 22:51:32 error : (1842568) Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 28778.2 seconds. Error caused by: #HY000: Lost connection to backend server.
|
etc. until we get to:
2019-04-09 23:05:21 warning: Error during monitor update of server 'server01': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to MySQL server during query'.
|
2019-04-09 23:05:43 error : Failure loading users data from backend [192.168.1.230:3306] for service [MasterSlave-Router]. MySQL error 2002, Can't connect to MySQL server on '192.168.1.230' (110)
|
2019-04-09 23:05:43 warning: [MySQLAuth] MasterSlave-Router: login attempt for user 'vetdiss'@[192.168.1.240]:36973, authentication failed. User not found.
|
2019-04-09 23:05:51 error : Monitor timed out when connecting to server server01[192.168.1.230:3306] : 'Can't connect to MySQL server on '192.168.1.230' (110)'
|
2019-04-09 23:05:51 warning: 'server02' is a better master candidate than the current master 'server01'. Master will change when 'server01' is no longer a valid master.
|
2019-04-09 23:05:51 notice : Server changed state: server01[192.168.1.230:3306]: master_down. [Master, Running] -> [Down]
|
2019-04-09 23:05:51 error : Server server01 ([192.168.1.230]:3306) lost the master status while waiting for a result. Client sessions will be closed.
|
so at this point the monitor ejects server01 from the topology as unreachable, treats server02 as master, but defers resetting the slave on it because it cannot reach server01.
2019-04-09 23:05:51 error : Monitor timed out when connecting to server server01[192.168.1.230:3306] : 'Can't connect to MySQL server on '192.168.1.230' (110)'
|
2019-04-09 23:05:51 warning: 'server02' is a better master candidate than the current master 'server01'. Master will change when 'server01' is no longer a valid master.
|
2019-04-09 23:05:51 notice : Server changed state: server01[192.168.1.230:3306]: master_down. [Master, Running] -> [Down]
|
2019-04-09 23:05:51 error : Server server01 ([192.168.1.230]:3306) lost the master status while waiting for a result. Client sessions will be closed.
|
2019-04-09 23:05:51 error : Server server01 ([192.168.1.230]:3306) lost the master status while waiting for a result. Client sessions will be closed.
|
2019-04-09 23:05:51 error : Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 1360.8 seconds. Error caused by: #HY000: Lost connection to backend server.
|
and after facing more connection difficulties finally exceeds failcount and does this:
2019-04-09 23:06:42 warning: The current master server 'server01' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new master server.
|
2019-04-09 23:06:42 warning: 'server01' is not a valid master candidate because it is down.
|
2019-04-09 23:06:42 notice : Setting 'server02' as master.
|
2019-04-09 23:06:42 notice : Cluster master server is replicating from an external master: server01.domain.com:3306
|
2019-04-09 23:06:42 notice : Server changed state: server02[192.168.1.231:3306]: new_master. [Slave of External Server, Running] -> [Master, Slave of External Server, Running]
|
The problem is that at this point MaxScale sees server01 as an external server after having been unable to reach it for a length of time, so does not stop replication from it, and declares server02 the master of a separate topology of one while leaving it a slave of the 'external' server01.