Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
24.02.9, 25.01.6, 25.10.1
-
None
-
MXS-SPRINT-267, MXS-SPRINT-268
Description
When performing runtime modifications, the configuration management code ignores the value returned by the MariaDBMonitor post_configure()-function (that detects cross-parameter dependencies) and takes the faulty configuration into use. This may cause the excluded servers-array to point to non-existing servers, which can lead to a crash, or at the very least, the promotion exclusion system working incorrectly. This can only happen when removing servers from the monitor during runtime, and when the removed servers are also in the "servers_no_promotion"-setting.
The correct fix to this issue is to improve config validation of the monitor settings such that removing the servers simply fails, even during runtime. Preferably the failure would come with a descriptive error message.
After the fix
The monitor refuses runtime attempts to remove a server if that server is still in "servers_no_promotion". In 25.10, "servers_no_cooperative_monitoring_locks" has a similar effect. Adding a non-monitored server to either of the settings also fails.
Original description:
We had an incident where Maxscale failed over to a replica that was listed in `servers_no_promotion`.
- Add a database server using online configuration
- Add all the replicas to the `servers_no_promotion` config of the monitor
- Remove the new server, leaving the name in `servers_no_promotion`
- Stop the primary server
We expect that the cluster would be down as all the replicas should be excluded. But what we saw was Maxscale still promoting one of the replicas.
Actual log lines:
2026-04-03 00:20:27 error : Monitor was unable to connect to server db31[172.20.140.83:3306] : 'Connection to [172.20.140.83]:3306 failed. Error 2002: Can't connect to server on '172.20.140.83' (115)'
|
2026-04-03 00:20:27 notice : Server changed state: db31[172.20.140.83:3306]: master_down. [Master, Running] -> [Down]
|
2026-04-03 00:20:27 warning: [mariadbmon] Primary has failed. If primary does not return in 4 monitor tick(s), failover begins.
|
2026-04-03 00:20:35 notice : [mariadbmon] Selecting a server to promote and replace 'db31'. Candidates are: 'db32', 'db27', 'db28', 'db29', 'db30', 'db33'.
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db27' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db28' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db28' has log_slave_updates disabled. It is a valid candidate but replication will break for lagging replicas if 'db28' is promoted.
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db29' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db29' has log_slave_updates disabled. It is a valid candidate but replication will break for lagging replicas if 'db29' is promoted.
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db30' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
|
2026-04-03 00:20:35 warning: [mariadbmon] Replica 'db30' has log_slave_updates disabled. It is a valid candidate but replication will break for lagging replicas if 'db30' is promoted.
|
2026-04-03 00:20:35 warning: [mariadbmon] Some servers were disqualified for promotion:\n'db32' cannot be selected because it is down or in maintenance.\n'db33' cannot be selected because it is down or in maintenance.
|
2026-04-03 00:20:35 notice : [mariadbmon] Selected 'db30' because it has processed more events.
|
2026-04-03 00:20:35 notice : [mariadbmon] Performing automatic failover to replace failed primary 'db31'.
|
2026-04-03 00:20:35 notice : [mariadbmon] Redirecting 'db27', 'db28', 'db29' to replicate from 'db30' instead of 'db31'.
|
2026-04-03 00:20:35 notice : [mariadbmon] All redirects successful.
|
2026-04-03 00:20:36 notice : [mariadbmon] All redirected slaves successfully started replication from 'db30'.
|
2026-04-03 00:20:36 notice : [mariadbmon] Failover 'db31' -> 'db30' performed.
|
2026-04-03 00:20:36 notice : Server changed state: db30[172.20.137.202:3306]: new_master. [Slave, Running] -> [Master, Running]
|
Configuration:
$ grep "servers_no_promotion" /etc/maxscale.cnf
|
servers_no_promotion=db27,db28,db29,db30,db33
|
$ grep "servers_no_promotion" /var/lib/maxscale/maxscale.cnf.d/*
|
/var/lib/maxscale/maxscale.cnf.d/MariaDB-Monitor.cnf:servers_no_promotion=db27,db28,db29,db30,db33
|
$
|