[MXS-4841] Mariadbmon selects diverged server as master Created: 2023-10-30  Updated: 2023-11-30  Resolved: 2023-11-28

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 6.4.11, 22.08.9, 23.02.5, 23.08.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: markus makela Assignee: Esa Korhonen
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-4798 Broken slave promoted to master when ... Closed

 Description   

The problem occurs when a two node setup is configured with asynchronous replication with one server as the master (server1 on port 3000) and another as the slave (server2 on port 3001) and a stray write or other operation causes the slave server to logically diverge.

If the slave's SQL thread fails to execute an event, replication will stop and the server will lose it's Slave label in MaxScale. If the master fails at this point, the normal master selection will end up choosing the broken slave as it is the only remaining running server.

This is technically working as expected but this behavior makes it unreliable enough in the event of a diverged server that I believe it should be classified as a bug. The master_conditions should be amended with a new option that's enabled by default (e.g no_broken_replication) which checks that either replication is not enabled or the SQL thread has no errors in it before selecting a server as the new master.

The defaults in the monitor for server selection should follow the "safe by default" approach that prevents problems in the data itself at the cost of high availability. While this might be desirable behavior in some configurations (e.g. with an external replication source that is "not of interest"), I would argue that in those cases the user should explicitly state that they accept a logically diverged server as Master in MaxScale.

Test script:

#!/bin/bash
 
master="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3000"
slave="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3001"
$master -e "CREATE TABLE t1(id INT PRIMARY KEY, data INT)"
$master -e "INSERT INTO t1 VALUES (1, 1)"
sleep 1
$slave -e "DELETE FROM t1"
$master -e "UPDATE t1 SET data = 2 WHERE id = 1"
sleep2
docker stop server1

Output of SHOW SLAVE STATUS:

MariaDB [test]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 127.0.0.1
                   Master_User: maxuser
                   Master_Port: 3000
                 Connect_Retry: 1
               Master_Log_File: binlog.000002
           Read_Master_Log_Pos: 943
                Relay_Log_File: mysqld-relay-bin.000003
                 Relay_Log_Pos: 1010
         Relay_Master_Log_File: binlog.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: No
          Replicate_Rewrite_DB: 
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 1032
                    Last_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 714
               Relay_Log_Space: 3371
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 1032
                Last_SQL_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 3000
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-3000-12
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: 
              Slave_DDL_Groups: 10
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 2

MaxScale configuration:

[maxscale]
threads=auto
 
[server1]
type=server
address=127.0.0.1
port=3000
protocol=MariaDBBackend
 
[server2]
type=server
address=127.0.0.1
port=3001
protocol=MariaDBBackend
 
[MariaDB-Monitor]
type=monitor
module=mariadbmon
servers=server1,server2
user=maxuser
password=maxpwd
monitor_interval=1s
backend_connect_timeout=1s
backend_read_timeout=1s
backend_write_timeout=1s

MaxScale log when running that test script:

2023-10-30 08:37:54   notice : Server changed state: server1[127.0.0.1:3000]: master_up. [Down] -> [Master, Running]
2023-10-30 08:37:54   notice : Server changed state: server2[127.0.0.1:3001]: slave_up. [Down] -> [Slave, Running]
2023-10-30 08:37:54   notice : No services defined in any of the configuration files
2023-10-30 08:37:54   notice : MaxScale started with 8 worker threads.
2023-10-30 08:38:01   notice : Server changed state: server2[127.0.0.1:3001]: lost_slave. [Slave, Running] -> [Running]: Host: [127.0.0.1]:3000, IO Running: Yes, SQL Running: No, SQL Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
2023-10-30 08:38:03   error  : Monitor was unable to connect to server server1[127.0.0.1:3000] : 'Can't connect to server on '127.0.0.1' (115)'
2023-10-30 08:38:03   warning: [mariadbmon] 'server2' is a better primary candidate than the current primary 'server1'. Primary will change when 'server1' is no longer a valid primary.
2023-10-30 08:38:03   notice : Server changed state: server1[127.0.0.1:3000]: master_down. [Master, Running] -> [Down]
2023-10-30 08:38:08   warning: [mariadbmon] The current primary server 'server1' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new primary server.
2023-10-30 08:38:08   warning: [mariadbmon] 'server1' is not a valid master candidate because it's down.
2023-10-30 08:38:08   notice : [mariadbmon] Setting 'server2' as primary.
2023-10-30 08:38:08   notice : Server changed state: server2[127.0.0.1:3001]: new_master. [Running] -> [Master, Running]


Generated at Thu Feb 08 04:31:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.