[MXS-3254] Monitor failover fails Created: 2020-10-22  Updated: 2021-10-11  Resolved: 2021-10-11

Status: Closed
Project: MariaDB MaxScale
Component/s: binlogrouter, test
Affects Version/s: 2.5.5
Fix Version/s: 2.5.16

Type: Bug Priority: Major
Reporter: Niclas Antti Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None

Sprint: MXS-SPRINT-141

 Description   

The pinloki switchover test causes the monitor to fail as described below. Rare scenario, not likely to happen in the real world.

niclas: The pinloki test in review revealed two monitor TODO:s. First, (which I think has come up before) the monitor deduces a replica is replicating from an "external" server by comparing IPs. So a server that is 127.0.0.1 can be external or internal depending on where the IP comes from, and how the monitor is configured. It should be consistent.
Second, if the sleep(5) in the test is replaced with test.maxscale().wait_monitor_ticks(5) the monitor ties itself in knots, and maxctrl becomes unresponsive.
 
esak: The monitor gets stuck?
 
niclas: Something goes awry and the monitor goes into a loop trying to STOP SLAVE, which fails.
I didn't look into it much, just noticing that something is messed up when the two scenarios play at the same time.
 
esak: It's likely not an infinite loop, but depends on some timeout settings.
but why does "stop slave" fail?
 
niclas: That's the part that needs to be dug into.
2020-10-22 10:49:20   warning: [mariadbmon] Query 'SET STATEMENT max_statement_time=3 FOR STOP SLAVE '';' failed on 'pinloki': 'Lost connection to MySQL server during query' (2013). Retrying with 86.9 seconds left.
 
esak: could there be some weird deadlock where one thread cannot advance before the other? It's a bit weird since monitor runs in its own.
 
niclas: I think it is something like that.



 Comments   
Comment by markus makela [ 2021-09-14 ]

The pinloki_switchover test doesn't seem to have a sleep(5) in it but the pinloki_upgrade does have it. Removing it seems to have no effect and the test passes without it.

Comment by markus makela [ 2021-10-11 ]

This seems to have been mostly about the test itself. The only real problem I found was that when configured with a certain GTID with SET GLOBAL @@gtid_slave_pos = <GTID>, the binlogrouter did not report the same GTID as long as replication had not received a newer event.

Generated at Thu Feb 08 04:20:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.