Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-4841

Mariadbmon selects diverged server as master

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 6.4.11, 22.08.9, 23.02.5, 23.08.2
    • None
    • mariadbmon
    • None

    Description

      The problem occurs when a two node setup is configured with asynchronous replication with one server as the master (server1 on port 3000) and another as the slave (server2 on port 3001) and a stray write or other operation causes the slave server to logically diverge.

      If the slave's SQL thread fails to execute an event, replication will stop and the server will lose it's Slave label in MaxScale. If the master fails at this point, the normal master selection will end up choosing the broken slave as it is the only remaining running server.

      This is technically working as expected but this behavior makes it unreliable enough in the event of a diverged server that I believe it should be classified as a bug. The master_conditions should be amended with a new option that's enabled by default (e.g no_broken_replication) which checks that either replication is not enabled or the SQL thread has no errors in it before selecting a server as the new master.

      The defaults in the monitor for server selection should follow the "safe by default" approach that prevents problems in the data itself at the cost of high availability. While this might be desirable behavior in some configurations (e.g. with an external replication source that is "not of interest"), I would argue that in those cases the user should explicitly state that they accept a logically diverged server as Master in MaxScale.

      Test script:

      #!/bin/bash
       
      master="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3000"
      slave="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3001"
      $master -e "CREATE TABLE t1(id INT PRIMARY KEY, data INT)"
      $master -e "INSERT INTO t1 VALUES (1, 1)"
      sleep 1
      $slave -e "DELETE FROM t1"
      $master -e "UPDATE t1 SET data = 2 WHERE id = 1"
      sleep2
      docker stop server1
      
      

      Output of SHOW SLAVE STATUS:

      MariaDB [test]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: 127.0.0.1
                         Master_User: maxuser
                         Master_Port: 3000
                       Connect_Retry: 1
                     Master_Log_File: binlog.000002
                 Read_Master_Log_Pos: 943
                      Relay_Log_File: mysqld-relay-bin.000003
                       Relay_Log_Pos: 1010
               Relay_Master_Log_File: binlog.000002
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: No
                Replicate_Rewrite_DB: 
                     Replicate_Do_DB: 
                 Replicate_Ignore_DB: 
                  Replicate_Do_Table: 
              Replicate_Ignore_Table: 
             Replicate_Wild_Do_Table: 
         Replicate_Wild_Ignore_Table: 
                          Last_Errno: 1032
                          Last_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 714
                     Relay_Log_Space: 3371
                     Until_Condition: None
                      Until_Log_File: 
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File: 
                  Master_SSL_CA_Path: 
                     Master_SSL_Cert: 
                   Master_SSL_Cipher: 
                      Master_SSL_Key: 
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error: 
                      Last_SQL_Errno: 1032
                      Last_SQL_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
         Replicate_Ignore_Server_Ids: 
                    Master_Server_Id: 3000
                      Master_SSL_Crl: 
                  Master_SSL_Crlpath: 
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 0-3000-12
             Replicate_Do_Domain_Ids: 
         Replicate_Ignore_Domain_Ids: 
                       Parallel_Mode: optimistic
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: 
                    Slave_DDL_Groups: 10
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 2
      

      MaxScale configuration:

      [maxscale]
      threads=auto
       
      [server1]
      type=server
      address=127.0.0.1
      port=3000
      protocol=MariaDBBackend
       
      [server2]
      type=server
      address=127.0.0.1
      port=3001
      protocol=MariaDBBackend
       
      [MariaDB-Monitor]
      type=monitor
      module=mariadbmon
      servers=server1,server2
      user=maxuser
      password=maxpwd
      monitor_interval=1s
      backend_connect_timeout=1s
      backend_read_timeout=1s
      backend_write_timeout=1s
      

      MaxScale log when running that test script:

      2023-10-30 08:37:54   notice : Server changed state: server1[127.0.0.1:3000]: master_up. [Down] -> [Master, Running]
      2023-10-30 08:37:54   notice : Server changed state: server2[127.0.0.1:3001]: slave_up. [Down] -> [Slave, Running]
      2023-10-30 08:37:54   notice : No services defined in any of the configuration files
      2023-10-30 08:37:54   notice : MaxScale started with 8 worker threads.
      2023-10-30 08:38:01   notice : Server changed state: server2[127.0.0.1:3001]: lost_slave. [Slave, Running] -> [Running]: Host: [127.0.0.1]:3000, IO Running: Yes, SQL Running: No, SQL Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
      2023-10-30 08:38:03   error  : Monitor was unable to connect to server server1[127.0.0.1:3000] : 'Can't connect to server on '127.0.0.1' (115)'
      2023-10-30 08:38:03   warning: [mariadbmon] 'server2' is a better primary candidate than the current primary 'server1'. Primary will change when 'server1' is no longer a valid primary.
      2023-10-30 08:38:03   notice : Server changed state: server1[127.0.0.1:3000]: master_down. [Master, Running] -> [Down]
      2023-10-30 08:38:08   warning: [mariadbmon] The current primary server 'server1' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new primary server.
      2023-10-30 08:38:08   warning: [mariadbmon] 'server1' is not a valid master candidate because it's down.
      2023-10-30 08:38:08   notice : [mariadbmon] Setting 'server2' as primary.
      2023-10-30 08:38:08   notice : Server changed state: server2[127.0.0.1:3001]: new_master. [Running] -> [Master, Running]
      

      Attachments

        Issue Links

          Activity

            People

              esa.korhonen Esa Korhonen
              markus makela markus makela
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.