[MXS-1668] Switchover under load condition causes out-of-order replicated GTID Created: 2018-02-14  Updated: 2019-06-10  Resolved: 2018-04-17

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.2.1
Fix Version/s: 2.2.4

Type: Bug Priority: Major
Reporter: Dipti Joshi (Inactive) Assignee: Esa Korhonen
Resolution: Cannot Reproduce Votes: 0
Labels: None

Sprint: MXS-SPRINT-54, MXS-SPRINT-55, MXS-SPRINT-56

 Description   

When a switchover was done under load, the new master appeared to create GTID events that broke replication on the old master due to out-of-order replicated GTID. A hypothesis is that this is possible caused by disabling read-only on the candidate master too early by MaxScale during switchover command processing

[root@ip-172-31-36-235 centos]# maxctrl list servers
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv1 │ 172.31.41.18533060           │ Slave, Running  │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv2 │ 172.31.33.9033060           │ Master, Running │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv3 │ 172.31.34.3633060           │ Slave, Running  │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv4 │ 172.31.41.7633060           │ Slave, Running  │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┘
[root@ip-172-31-36-235 centos]# maxctrl call command mariadbmon switchover Rep_Monitor dbserv1 dbserv2
OK
[root@ip-172-31-36-235 centos]# maxctrl list servers
┌─────────┬───────────────┬──────┬─────────────┬─────────────────┐
│ Server  │ Address       │ Port │ Connections │ State           │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv1 │ 172.31.41.18533060           │ Master, Running │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv2 │ 172.31.33.9033060           │ Running         │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv3 │ 172.31.34.3633060           │ Slave, Running  │
├─────────┼───────────────┼──────┼─────────────┼─────────────────┤
│ dbserv4 │ 172.31.41.7633060           │ Slave, Running  │
└─────────┴───────────────┴──────┴─────────────┴─────────────────┘
2018-02-13 19:43:03   notice : [mariadbmon] Stopped the monitor Rep_Monitor for the duration of switchover.
2018-02-13 19:43:03   notice : [mariadbmon] Demoting server 'dbserv1'.
2018-02-13 19:43:04   notice : [mariadbmon] Promoting server 'dbserv2' to master.
2018-02-13 19:43:04   notice : [mariadbmon] Old master 'dbserv1' starting replication from 'dbserv2'.
2018-02-13 19:43:04   notice : [mariadbmon] Redirecting slaves to new master.
2018-02-13 19:43:04   notice : [mariadbmon] Slave 'dbserv3' redirected to new master.
2018-02-13 19:43:04   notice : [mariadbmon] Slave 'dbserv4' redirected to new master.
2018-02-13 19:43:04   notice : [mariadbmon] Switchover dbserv1 -> dbserv2 performed.
2018-02-13 19:43:04   notice : Loaded server states from journal file: /var/lib/maxscale/Rep_Monitor/monitor.dat
2018-02-13 19:43:04   notice : Server changed state: dbserv1[172.31.41.185:3306]: new_slave. [Master, Running] -> [Slave, Running]
2018-02-13 19:43:04   notice : Server changed state: dbserv2[172.31.33.90:3306]: new_master. [Slave, Running] -> [Master, Running]
2018-02-13 19:43:04   notice : [mariadbmon] A Master Server is now available: 172.31.33.90:3306
2018-02-13 19:44:15   notice : [mariadbmon] Stopped the monitor Rep_Monitor for the duration of switchover.
2018-02-13 19:44:15   notice : [mariadbmon] Demoting server 'dbserv2'.
2018-02-13 19:44:15   notice : [mariadbmon] Promoting server 'dbserv1' to master.
2018-02-13 19:44:15   notice : [mariadbmon] Old master 'dbserv2' starting replication from 'dbserv1'.
2018-02-13 19:44:15   notice : [mariadbmon] Redirecting slaves to new master.
2018-02-13 19:44:15   notice : [mariadbmon] Slave 'dbserv3' redirected to new master.
2018-02-13 19:44:15   notice : [mariadbmon] Slave 'dbserv4' redirected to new master.
2018-02-13 19:44:15   notice : [mariadbmon] Switchover dbserv2 -> dbserv1 performed.
2018-02-13 19:44:15   notice : Loaded server states from journal file: /var/lib/maxscale/Rep_Monitor/monitor.dat
2018-02-13 19:44:15   notice : Server changed state: dbserv1[172.31.41.185:3306]: new_master. [Slave, Running] -> [Master, Running]
2018-02-13 19:44:15   notice : Server changed state: dbserv2[172.31.33.90:3306]: new_slave. [Master, Running] -> [Slave, Running]
2018-02-13 19:44:16   notice : Server changed state: dbserv2[172.31.33.90:3306]: lost_slave. [Slave, Running] -> [Running]
[root@ip-172-31-36-235 centos]# 
172.31.41.185:
MariaDB [(none)]> show variables like '%gtid%';
+------------------------+-----------------------+
| Variable_name          | Value                 |
+------------------------+-----------------------+
| gtid_binlog_pos        | 0-1-717942            |
| gtid_binlog_state      | 0-3-717926,0-1-717942 |
| gtid_current_pos       | 0-1-717942            |
| gtid_domain_id         | 0                     |
| gtid_ignore_duplicates | OFF                   |
| gtid_seq_no            | 0                     |
| gtid_slave_pos         | 0-3-717926            |
| gtid_strict_mode       | ON                    |
| last_gtid              |                       |
| wsrep_gtid_domain_id   | 0                     |
| wsrep_gtid_mode        | OFF                   |
+------------------------+-----------------------+
11 rows in set (0.00 sec)
172.31.33.90:
MariaDB [(none)]> show variables like '%gtid%';
+------------------------+-----------------------+
| Variable_name          | Value                 |
+------------------------+-----------------------+
| gtid_binlog_pos        | 0-3-717928            |
| gtid_binlog_state      | 0-1-717927,0-3-717928 |
| gtid_current_pos       | 0-3-717928            |
| gtid_domain_id         | 0                     |
| gtid_ignore_duplicates | OFF                   |
| gtid_seq_no            | 0                     |
| gtid_slave_pos         | 0-1-717927            |
| gtid_strict_mode       | ON                    |
| last_gtid              |                       |
| wsrep_gtid_domain_id   | 0                     |
| wsrep_gtid_mode        | OFF                   |
+------------------------+-----------------------+
11 rows in set (0.00 sec)
MariaDB [(none)]> 

Would you still have the output of the SHOW SLAVE STATUS on old master?
Last_SQL_Error: An attempt was made to binlog GTID 0-1-717928 which would create an out-of-order sequence number with existing GTID 0-3-717928, and gtid strict mode is enabled



 Comments   
Comment by Esa Korhonen [ 2018-04-17 ]

Never seen this happen, shouldn't the switchover-stress test find this.

Generated at Thu Feb 08 04:08:31 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.