[MXS-3983] Add switchover-force command Created: 2022-02-03  Updated: 2023-09-14  Resolved: 2023-06-19

Status: Closed
Project: MariaDB MaxScale
Component/s: maxctrl
Affects Version/s: 2.5.14
Fix Version/s: 23.08.0

Type: New Feature Priority: Major
Reporter: Rick Pizzi Assignee: Esa Korhonen
Resolution: Fixed Votes: 4
Labels: None

Issue Links:
Relates
relates to MXS-4759 Force failover flag Closed
Sprint: MXS-SPRINT-184

 Description   

If master becomes unresponsive (eg: InnoDB clogged by contention issues) we want to be able to promote a replica and then kill the unrecoverable master.

This is currently not possible:

2022-02-03 13:44:35   error  : [mariadbmon] Failed to enable read_only on 'brokercloudprod-db-mdb-ms-0': Query 'SET STATEMENT max_statement_time=3 FOR SET GLOBAL read_only=1;' failed on 'brokercloudprod-db-mdb-ms-0': 'Query execution was interrupted (max_statement_time exceeded)' (1969).

Proposal: either move forward even if read only cannot be set after some time, or provide an option to switchover to force this when need be. Thank you.



 Comments   
Comment by markus makela [ 2022-02-24 ]

Some sort of a --force flag for switchover might make sense.

Comment by Rick Pizzi [ 2022-03-23 ]

It looks like this also happens during automatic failovers, which is severe....

2022-03-22 21:12:56   error  : [mariadbmon] Failed to enable read_only on 'dalenys-aggreg-db-mdb-ms-1': Query 'SET STATEMENT max_statement
_time=3 FOR SET GLOBAL read_only=1;' failed on 'dalenys-aggreg-db-mdb-ms-1': 'Can't connect to MySQL server on '10.107.0.163' (110)' (2002
).
2022-03-22 21:12:56   error  : [mariadbmon] Failed to disable read_only on 'dalenys-aggreg-db-mdb-ms-1': Query 'SET STATEMENT max_statemen
t_time=3 FOR SET GLOBAL read_only=0;' failed on 'dalenys-aggreg-db-mdb-ms-1': 'Can't connect to MySQL server on '10.107.0.163' (115)' (200
2).
2022-03-22 21:12:56   error  : [mariadbmon] Switchover dalenys-aggreg-db-mdb-ms-1 -> dalenys-aggreg-db-mdb-ms-0 failed.
2022-03-22 21:12:56   notice : [mariadbmon] Disabling automatic cluster operations for 5 monitor ticks.
2022-03-22 21:13:02   error  : Monitor timed out when connecting to server dalenys-aggreg-db-mdb-ms-1[10.107.0.163:3306] : 'Can't connect 
to MySQL server on '10.107.0.163' (115)'
2022-03-22 21:13:02   notice : Server changed state: dalenys-aggreg-db-mdb-ms-1[10.107.0.163:3306]: server_down. [Running] -> [Down]
2022-03-22 21:13:05   warning: Discarding journal file '/var/lib/maxscale/MariaDB-Monitor_journal.json'. Servers described in the journal 
are different from the ones configured on the current monitor.
2022-03-22 21:13:05   notice : Removed 'dalenys-aggreg-db-mdb-ms-1' from 'MariaDB-Monitor'
2022-03-22 21:13:05   notice : Removed 'dalenys-aggreg-db-mdb-ms-1' from 'Read-Only-Service'
2022-03-22 21:13:05   notice : Removed 'dalenys-aggreg-db-mdb-ms-1' from 'Read-Write-Service'
2022-03-22 21:13:05   warning: [mariadbmon] Tried to find a master but no valid master server found.
2022-03-22 21:13:05   warning: [mariadbmon] 'dalenys-aggreg-db-mdb-ms-0' is not a valid master candidate because it's read_only.

I think in the above situation , the replica should be promoted anyways.
Currently, the instance is left without a valid master...

Comment by markus makela [ 2022-03-23 ]

That happened because the server was deleted which caused the monitor to discard the old information. Not doing it would've prevented that.

Comment by Rick Pizzi [ 2022-03-23 ]

What would cause the server to be deleted from maxscale monitor view? Automation?
In any case the server was unreachable, then deleted, and the remaining slave was never promoted.

Comment by Rick Pizzi [ 2022-03-23 ]

Why is the remaining slave not considered a valid master, is the question

Comment by markus makela [ 2022-03-23 ]

Based on that output I'd say someone did a maxctrl delete server followed by a maxctrl create server. Given that it happened right after the failure, it could be some script that did it.

Comment by markus makela [ 2022-03-23 ]

There might be something strange going on as the logs above were generated with enforce_simple_topology enabled. Further investigation might be needed.

Comment by Rick Pizzi [ 2022-03-23 ]

This is SkySQL so it may be part of the operator logic; I don't know..

But the part where maxscale refuses to do a failover because it is unable to set read only on current master, is what
makes me worried.

Let's take for example the (not uncommon) case where the master goes out of available connections ("too many connections") – failover will never happen as maxscale will never able to set the current master to read only.
This is not correct in my opinion; master may be dead, hung, unreachable for many possible different reasons,
we need to take action after a reasonable number of attempts anyway. IMHO.

Thanks
Rick

Comment by markus makela [ 2022-03-23 ]

I agree, even if the switchover fails, automatic failover should eventually be able to promote one of the remaining servers. As to why this didn't happen, we don't know at this point and we'll need to investigate. It's possible that the action of deleting the server is what triggered this but we need to be able to reproduce this to be sure.

Comment by markus makela [ 2023-02-15 ]

I changed this to a New Feature as the current behavior is expected behavior. esa.korhonen and rpizzi, if you think this is wrong, let me know and we can think what is the correct classification for this issue.

Generated at Thu Feb 08 04:25:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.