[MXS-4841] Mariadbmon selects diverged server as master - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Affects Version/s: 6.4.11, 22.08.9, 23.02.5, 23.08.2
Fix Version/s: None
Component/s: mariadbmon
Labels:
None

Description

The problem occurs when a two node setup is configured with asynchronous replication with one server as the master (server1 on port 3000) and another as the slave (server2 on port 3001) and a stray write or other operation causes the slave server to logically diverge.

If the slave's SQL thread fails to execute an event, replication will stop and the server will lose it's Slave label in MaxScale. If the master fails at this point, the normal master selection will end up choosing the broken slave as it is the only remaining running server.

This is technically working as expected but this behavior makes it unreliable enough in the event of a diverged server that I believe it should be classified as a bug. The master_conditions should be amended with a new option that's enabled by default (e.g no_broken_replication) which checks that either replication is not enabled or the SQL thread has no errors in it before selecting a server as the new master.

The defaults in the monitor for server selection should follow the "safe by default" approach that prevents problems in the data itself at the cost of high availability. While this might be desirable behavior in some configurations (e.g. with an external replication source that is "not of interest"), I would argue that in those cases the user should explicitly state that they accept a logically diverged server as Master in MaxScale.

Test script:

#!/bin/bash

master="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3000"

slave="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3001"

$master -e "CREATE TABLE t1(id INT PRIMARY KEY, data INT)"

$master -e "INSERT INTO t1 VALUES (1, 1)"

sleep 1

$slave -e "DELETE FROM t1"

$master -e "UPDATE t1 SET data = 2 WHERE id = 1"

sleep2

docker stop server1

Output of SHOW SLAVE STATUS:

MariaDB [test]> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 127.0.0.1

                   Master_User: maxuser

                   Master_Port: 3000

                 Connect_Retry: 1

               Master_Log_File: binlog.000002

           Read_Master_Log_Pos: 943

                Relay_Log_File: mysqld-relay-bin.000003

                 Relay_Log_Pos: 1010

         Relay_Master_Log_File: binlog.000002

              Slave_IO_Running: Yes

             Slave_SQL_Running: No

          Replicate_Rewrite_DB:

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1032

                    Last_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 714

               Relay_Log_Space: 3371

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: NULL

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 1032

                Last_SQL_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 3000

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 0-3000-12

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State:

              Slave_DDL_Groups: 10

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 2

MaxScale configuration:

[maxscale]

threads=auto

[server1]

type=server

address=127.0.0.1

port=3000

protocol=MariaDBBackend

[server2]

type=server

address=127.0.0.1

port=3001

protocol=MariaDBBackend

[MariaDB-Monitor]

type=monitor

module=mariadbmon

servers=server1,server2

user=maxuser

password=maxpwd

monitor_interval=1s

backend_connect_timeout=1s

backend_read_timeout=1s

backend_write_timeout=1s

MaxScale log when running that test script:

2023-10-30 08:37:54   notice : Server changed state: server1[127.0.0.1:3000]: master_up. [Down] -> [Master, Running]

2023-10-30 08:37:54   notice : Server changed state: server2[127.0.0.1:3001]: slave_up. [Down] -> [Slave, Running]

2023-10-30 08:37:54   notice : No services defined in any of the configuration files

2023-10-30 08:37:54   notice : MaxScale started with 8 worker threads.

2023-10-30 08:38:01   notice : Server changed state: server2[127.0.0.1:3001]: lost_slave. [Slave, Running] -> [Running]: Host: [127.0.0.1]:3000, IO Running: Yes, SQL Running: No, SQL Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912

2023-10-30 08:38:03   error  : Monitor was unable to connect to server server1[127.0.0.1:3000] : 'Can't connect to server on '127.0.0.1' (115)'

2023-10-30 08:38:03   warning: [mariadbmon] 'server2' is a better primary candidate than the current primary 'server1'. Primary will change when 'server1' is no longer a valid primary.

2023-10-30 08:38:03   notice : Server changed state: server1[127.0.0.1:3000]: master_down. [Master, Running] -> [Down]

2023-10-30 08:38:08   warning: [mariadbmon] The current primary server 'server1' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new primary server.

2023-10-30 08:38:08   warning: [mariadbmon] 'server1' is not a valid master candidate because it's down.

2023-10-30 08:38:08   notice : [mariadbmon] Setting 'server2' as primary.

2023-10-30 08:38:08   notice : Server changed state: server2[127.0.0.1:3001]: new_master. [Running] -> [Master, Running]

Attachments

Issue Links

relates to

MXS-4798 Broken slave promoted to master when no other servers are available

Closed

Activity

People

Assignee:: Esa Korhonen

Reporter:: markus makela

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2023-10-30 06:45

Updated:: 2023-11-30 17:01

Resolved:: 2023-11-28 12:40

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.