Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Duplicate
-
6.4.11, 22.08.9, 23.02.5, 23.08.2
-
None
-
None
Description
The problem occurs when a two node setup is configured with asynchronous replication with one server as the master (server1 on port 3000) and another as the slave (server2 on port 3001) and a stray write or other operation causes the slave server to logically diverge.
If the slave's SQL thread fails to execute an event, replication will stop and the server will lose it's Slave label in MaxScale. If the master fails at this point, the normal master selection will end up choosing the broken slave as it is the only remaining running server.
This is technically working as expected but this behavior makes it unreliable enough in the event of a diverged server that I believe it should be classified as a bug. The master_conditions should be amended with a new option that's enabled by default (e.g no_broken_replication) which checks that either replication is not enabled or the SQL thread has no errors in it before selecting a server as the new master.
The defaults in the monitor for server selection should follow the "safe by default" approach that prevents problems in the data itself at the cost of high availability. While this might be desirable behavior in some configurations (e.g. with an external replication source that is "not of interest"), I would argue that in those cases the user should explicitly state that they accept a logically diverged server as Master in MaxScale.
Test script:
#!/bin/bash
|
|
master="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3000" |
slave="mariadb -D test -u maxuser -pmaxpwd -h 127.0.01 -P 3001" |
$master -e "CREATE TABLE t1(id INT PRIMARY KEY, data INT)" |
$master -e "INSERT INTO t1 VALUES (1, 1)" |
sleep 1 |
$slave -e "DELETE FROM t1" |
$master -e "UPDATE t1 SET data = 2 WHERE id = 1" |
sleep2
|
docker stop server1
|
|
Output of SHOW SLAVE STATUS:
MariaDB [test]> SHOW SLAVE STATUS\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 127.0.0.1
|
Master_User: maxuser
|
Master_Port: 3000
|
Connect_Retry: 1
|
Master_Log_File: binlog.000002
|
Read_Master_Log_Pos: 943
|
Relay_Log_File: mysqld-relay-bin.000003
|
Relay_Log_Pos: 1010
|
Relay_Master_Log_File: binlog.000002
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: No
|
Replicate_Rewrite_DB:
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 1032
|
Last_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 714
|
Relay_Log_Space: 3371
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 1032
|
Last_SQL_Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 3000
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 0-3000-12
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State:
|
Slave_DDL_Groups: 10
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 2
|
MaxScale configuration:
[maxscale]
|
threads=auto
|
|
[server1]
|
type=server
|
address=127.0.0.1
|
port=3000
|
protocol=MariaDBBackend
|
|
[server2]
|
type=server
|
address=127.0.0.1
|
port=3001
|
protocol=MariaDBBackend
|
|
[MariaDB-Monitor]
|
type=monitor
|
module=mariadbmon
|
servers=server1,server2
|
user=maxuser
|
password=maxpwd
|
monitor_interval=1s
|
backend_connect_timeout=1s
|
backend_read_timeout=1s
|
backend_write_timeout=1s
|
MaxScale log when running that test script:
2023-10-30 08:37:54 notice : Server changed state: server1[127.0.0.1:3000]: master_up. [Down] -> [Master, Running]
|
2023-10-30 08:37:54 notice : Server changed state: server2[127.0.0.1:3001]: slave_up. [Down] -> [Slave, Running]
|
2023-10-30 08:37:54 notice : No services defined in any of the configuration files
|
2023-10-30 08:37:54 notice : MaxScale started with 8 worker threads.
|
2023-10-30 08:38:01 notice : Server changed state: server2[127.0.0.1:3001]: lost_slave. [Slave, Running] -> [Running]: Host: [127.0.0.1]:3000, IO Running: Yes, SQL Running: No, SQL Error: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000002, end_log_pos 912
|
2023-10-30 08:38:03 error : Monitor was unable to connect to server server1[127.0.0.1:3000] : 'Can't connect to server on '127.0.0.1' (115)'
|
2023-10-30 08:38:03 warning: [mariadbmon] 'server2' is a better primary candidate than the current primary 'server1'. Primary will change when 'server1' is no longer a valid primary.
|
2023-10-30 08:38:03 notice : Server changed state: server1[127.0.0.1:3000]: master_down. [Master, Running] -> [Down]
|
2023-10-30 08:38:08 warning: [mariadbmon] The current primary server 'server1' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new primary server.
|
2023-10-30 08:38:08 warning: [mariadbmon] 'server1' is not a valid master candidate because it's down.
|
2023-10-30 08:38:08 notice : [mariadbmon] Setting 'server2' as primary.
|
2023-10-30 08:38:08 notice : Server changed state: server2[127.0.0.1:3001]: new_master. [Running] -> [Master, Running]
|
Attachments
Issue Links
- relates to
-
MXS-4798 Broken slave promoted to master when no other servers are available
- Closed