Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-6374

MariaDB Monitor does not properly verify its configuration when performing runtime modifications

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 24.02.9, 25.01.6, 25.10.1
    • 25.01.7, 25.10.3
    • mariadbmon
    • None
    • MXS-SPRINT-267, MXS-SPRINT-268

    Description

      When performing runtime modifications, the configuration management code ignores the value returned by the MariaDBMonitor post_configure()-function (that detects cross-parameter dependencies) and takes the faulty configuration into use. This may cause the excluded servers-array to point to non-existing servers, which can lead to a crash, or at the very least, the promotion exclusion system working incorrectly. This can only happen when removing servers from the monitor during runtime, and when the removed servers are also in the "servers_no_promotion"-setting.

      The correct fix to this issue is to improve config validation of the monitor settings such that removing the servers simply fails, even during runtime. Preferably the failure would come with a descriptive error message.

      After the fix
      The monitor refuses runtime attempts to remove a server if that server is still in "servers_no_promotion". In 25.10, "servers_no_cooperative_monitoring_locks" has a similar effect. Adding a non-monitored server to either of the settings also fails.

      Original description:
      We had an incident where Maxscale failed over to a replica that was listed in `servers_no_promotion`.

      1. Add a database server using online configuration
      2. Add all the replicas to the `servers_no_promotion` config of the monitor
      3. Remove the new server, leaving the name in `servers_no_promotion`
      4. Stop the primary server

      We expect that the cluster would be down as all the replicas should be excluded. But what we saw was Maxscale still promoting one of the replicas.

      Actual log lines:

      2026-04-03 00:20:27   error  : Monitor was unable to connect to server db31[172.20.140.83:3306] : 'Connection to [172.20.140.83]:3306 failed. Error 2002: Can't connect to server on '172.20.140.83' (115)'
      2026-04-03 00:20:27   notice : Server changed state: db31[172.20.140.83:3306]: master_down. [Master, Running] -> [Down]
      2026-04-03 00:20:27   warning: [mariadbmon] Primary has failed. If primary does not return in 4 monitor tick(s), failover begins.
      2026-04-03 00:20:35   notice : [mariadbmon] Selecting a server to promote and replace 'db31'. Candidates are: 'db32', 'db27', 'db28', 'db29', 'db30', 'db33'.
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db27' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db28' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db28' has log_slave_updates disabled. It is a valid candidate but replication will break for lagging replicas if 'db28' is promoted.
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db29' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db29' has log_slave_updates disabled. It is a valid candidate but replication will break for lagging replicas if 'db29' is promoted.
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db30' has gtid_strict_mode disabled. Enabling this setting is recommended. For more information, see https://mariadb.com/kb/en/library/gtid/#gtid_strict_mode
      2026-04-03 00:20:35   warning: [mariadbmon] Replica 'db30' has log_slave_updates disabled. It is a valid candidate but replication will break for lagging replicas if 'db30' is promoted.
      2026-04-03 00:20:35   warning: [mariadbmon] Some servers were disqualified for promotion:\n'db32' cannot be selected because it is down or in maintenance.\n'db33' cannot be selected because it is down or in maintenance.
      2026-04-03 00:20:35   notice : [mariadbmon] Selected 'db30' because it has processed more events.
      2026-04-03 00:20:35   notice : [mariadbmon] Performing automatic failover to replace failed primary 'db31'.
      2026-04-03 00:20:35   notice : [mariadbmon] Redirecting 'db27', 'db28', 'db29' to replicate from 'db30' instead of 'db31'.
      2026-04-03 00:20:35   notice : [mariadbmon] All redirects successful.
      2026-04-03 00:20:36   notice : [mariadbmon] All redirected slaves successfully started replication from 'db30'.
      2026-04-03 00:20:36   notice : [mariadbmon] Failover 'db31' -> 'db30' performed.
      2026-04-03 00:20:36   notice : Server changed state: db30[172.20.137.202:3306]: new_master. [Slave, Running] -> [Master, Running]
      

      Configuration:

      $ grep "servers_no_promotion" /etc/maxscale.cnf
      servers_no_promotion=db27,db28,db29,db30,db33
      $ grep "servers_no_promotion" /var/lib/maxscale/maxscale.cnf.d/*
      /var/lib/maxscale/maxscale.cnf.d/MariaDB-Monitor.cnf:servers_no_promotion=db27,db28,db29,db30,db33
      $
      

      Attachments

        1. docker-compose.yml
          2 kB
        2. maxscale.cnf
          0.7 kB
        3. maxscale-list-servers.sh
          2 kB
        4. maxscale-show-monitor.sh
          2 kB
        5. reproduce_bug.sh
          17 kB

        Activity

          People

            esa.korhonen Esa Korhonen
            willfong Will Fong
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.