[MXS-4759] Force failover flag Created: 2023-09-13  Updated: 2023-10-16  Resolved: 2023-10-16

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: None
Fix Version/s: 23.08.2

Type: New Feature Priority: Major
Reporter: Bryan Bancroft (Inactive) Assignee: Esa Korhonen
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-3983 Add switchover-force command Closed
Sprint: MXS-SPRINT-192

 Description   

Either a force failover or cluster refresh option request is to have a way to make a server master regardless of risk. This was spurred from a situation where a outage was extended when we had to remove a bad server from the cnf and restart maxscale to promote a known up-to-date slave. Need her is a situation where the admin knows what needs to be done but the technology is clocking the action.

Example command

maxctrl call command mariadbmon failover MariaDB-Monitor --force
WARNING: Replica could be out of date introducing dataloss. Continue (Y/n) Y
fooshop-2 is now master

below status of a issue cluster due to gtid repl off

[ACME] mariadb@maxscale-1: ~ $ maxctrl list servers
┌───────────┬──────────────┬──────┬─────────────┬────────────────┬─────────┬─────────────────┐
│ Server    │ Address      │ Port │ Connections │ State          │ GTID    │ Monitor         │
├───────────┼──────────────┼──────┼─────────────┼────────────────┼─────────┼─────────────────┤
│ fooshop-1 │ 131.21.1.185 │ 3306 │ 0           │ Down           │ 0-1-111 │ MariaDB-Monitor │
├───────────┼──────────────┼──────┼─────────────┼────────────────┼─────────┼─────────────────┤
│ fooshop-2 │ 131.21.1.248 │ 3306 │ 0           │ Slave, Running │ 0-1-111 │ MariaDB-Monitor │
├───────────┼──────────────┼──────┼─────────────┼────────────────┼─────────┼─────────────────┤
│ fooshop-3 │ 131.21.1.89  │ 3306 │ 0           │ Slave, Running │ 0-1-111 │ MariaDB-Monitor │
└───────────┴──────────────┴──────┴─────────────┴────────────────┴─────────┴─────────────────┘
 
 
'fooshop-2' cannot be selected because its replica connection to 'fooshop-1' is not using gtid.
'fooshop-3' cannot be selected because its replica connection to 'fooshop-1' is not using gtid.
 
[ACME] mariadb@maxscale-1: ~ $ maxctrl call command mariadbmon failover MariaDB-Monitor
Error: Server at http://127.0.0.1:8989 responded with 400 Bad Request to `POST maxscale/modules/mariadbmon/failover?MariaDB-Monitor`
{
    "links": {
        "self": "http://127.0.0.1:8989/v1/maxscale/modules/mariadbmon/failover/"
    },
    "meta": {
        "errors": [
            {
                "detail": "No suitable promotion candidate found:\n'fooshop-2' cannot be selected because its replica connection to 'fooshop-1' is not using gtid.\n'fooshop-3' cannot be selected because its replica connection to 'fooshop-1' is not using gtid."
            },
            {
                "detail": "Could not autoselect promotion target for failover."
            },
            {
                "detail": "Cluster gtid domain is unknown. This is usually caused by the cluster never having a primary server while MaxScale was running."
            },
            {
                "detail": "The slave connection 'fooshop-2' -> 'fooshop-1' is not using gtid replication."
            },
            {
                "detail": "The slave connection 'fooshop-3' -> 'fooshop-1' is not using gtid replication."
            },
            {
                "detail": "Failover cancelled."
            }
        ]
    }
}



 Comments   
Comment by markus makela [ 2023-09-13 ]

The existing reset-replication command seems to do most of what is required here. It prepares the cluster for use with automatic failover even at the risk of potential data loss: https://mariadb.com/kb/en/mariadb-maxscale-2308-mariadb-monitor/#operation-details

The documentation doesn't seem to mention whether the monitor already does this but one improvement is to wait for the relay log to be consumed and to auto-select the best candidate based on the existing GTID position.

Comment by markus makela [ 2023-09-15 ]

There's an optional argument to reset-replication that allows the caller to pick which server to promote. The fact that the replication was modified even if it claims to have failed should be filed as a separate bug. It's possible that the reason why it didn't pick a primary server is because the servers have read_only enabled. If enforce_simple_topology was enabled along with enforce_writable_master and enforce_read_only_slaves it might've fixed the problem on its own.

The switchover-force is identical to the normal switchover except that it does this:

Switchover-force performs the same steps as a normal switchover but ignores any errors on the old primary. Switchover-force also does not expect the new primary to reach the gtid-position of the old, as the old primary could be receiving more events constantly. Thus, switchover-force may lose events.

If the server that was labeled as Master is down and you forcefully promote another node, by definition you're not switching over to something, you're failing over to something. As such, MXS-3983 is as complete as it can be and if some changes were to be done, they'd be done either for failover or, if the reset-replication is inadequate, the not-yet-existing failover-force like you've suggested. Perhaps for the sake of consistency it would be better to have a failover-force that's kind of like reset-replication but it doesn't reset the GTIDs.

Comment by Esa Korhonen [ 2023-10-16 ]

Unclear if anything is needed still. Closing for now.

Generated at Thu Feb 08 04:30:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.