[MXS-2576] Columnstore Monitor inaccurately labels a UM as slave Created: 2019-06-24  Updated: 2020-08-25  Resolved: 2019-08-13

Status: Closed
Project: MariaDB MaxScale
Component/s: Monitor
Affects Version/s: 2.3.8
Fix Version/s: 2.3.12

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None

Sprint: MXS-SPRINT-87

 Description   

Columnstore Monitor can inaccurately label a server as a "slave" if the monitor's connection gets dropped in the get_cs_version() function:

https://github.com/mariadb-corporation/MaxScale/blob/maxscale-2.3.8/server/modules/monitor/csmon/csmon.cc#L56

The reason is that this function is called inside CsMonitor::update_server_status to determine if the server supports the mcsSystemPrimary function:

https://github.com/mariadb-corporation/MaxScale/blob/maxscale-2.3.8/server/modules/monitor/csmon/csmon.cc#L111

If the get_cs_version function returns a value lower than 10200, then MaxScale checks the server's configuration for the "primary" parameter.

However, if the monitor's connection gets disconnected in the get_cs_version function, then the function will return 0. Therefore, since this value is lower than 10200, MaxScale will check the server's configuration for the primary parameter. But since the DBA expects MaxScale to use the mcsSystemPrimary function instead, this parameter is most likely not going to be set at all for the server. This causes MaxScale to set the server as a "Slave", since it thinks that some other server is the primary server.

Instead of allowing the server to be incorrectly set as a "Slave", Columnstore Monitor should detect that the connection died, and it should label the server as "Down".

Here's some relevant entries from a MaxScale error log that shows this happening:

2019-06-20 23:18:30   error  : Failed to execute query on server 'srv1' ([192.168.1.44]:3306): Lost connection to MySQL server during query
2019-06-20 23:18:30   notice : Server changed state: srv1[192.168.1.44:3306]: new_slave. [Master, Running] -> [Slave, Running]
2019-06-20 23:18:35   notice : Server changed state: srv1[192.168.1.44:3306]: new_master. [Slave, Running] -> [Master, Running]



 Comments   
Comment by markus makela [ 2019-06-25 ]

Increasing query_retries should reduce the likelihood of this happening.

Generated at Thu Feb 08 04:15:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.