[MDEV-33316] rpl_change_master_demote binlog race condition - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL)
Fix Version/s: 10.11
Component/s: Replication, Tests
Labels:
None

Description

In between test case 3 and test case 4 of rpl_change_master_demote.test, ER_GTID_POSITION_NOT_FOUND_IN_BINLOG2 is sent to the slave because of a race where the slave requests to start at a GTID not in the master's binlogs, and the master binlogs a transaction. This puts the master gtid_binlog_position ahead of the requested slave_connect_state.

The resulting failure looks like:

rpl.rpl_change_master_demote 'mix'       w1 [ fail ]

        Test ended at 2024-01-24 10:39:40

CURRENT_TEST: rpl.rpl_change_master_demote

mysqltest: In included file "./include/sync_with_master_gtid.inc":

included from /home/buildbot/buildbot/build/mariadb-10.11.7/mysql-test/suite/rpl/t/rpl_change_master_demote.test at line 149:

At line 48: Failed to sync with master

The result from queries just before the failure was:

< snip >

#  * True primary is back to connection 'master'

#  * True replica is back to connection 'slave'

##############################################

connection master;

connection slave;

CHANGE MASTER TO master_host='127.0.0.1', master_port=MASTER_PORT, master_user='root', master_use_gtid=slave_pos, master_demote_to_slave=1;

# Test Case 4: If gtid_slave_pos and gtid_binlog_pos are equivalent,

# MASTER_DEMOTE_TO_SLAVE=1 will not change gtid_slave_pos.

connection master;

# update gtid_binlog_pos and demote it (we have proven this works)

INSERT INTO t1 VALUES (3);

# Update to account for statements to verify replication in include file

CHANGE MASTER TO master_host='127.0.0.1', master_port=SLAVE_PORT, master_user='root', master_use_gtid=slave_pos, master_demote_to_slave=1;

RESET SLAVE ALL;

include/save_master_gtid.inc

connection slave;

include/sync_with_master_gtid.inc

Timeout in master_gtid_wait('0-1-10', 120), current slave GTID position is: 0-2-9.

More results from queries before failure can be found in /dev/shm/var/1/log/rpl_change_master_demote.log

 - saving '/dev/shm/var/1/log/rpl.rpl_change_master_demote-mix/' to '/dev/shm/var/log/rpl.rpl_change_master_demote-mix/'

Retrying test rpl.rpl_change_master_demote, attempt(2/2)...

Attachments

Issue Links

is part of

MDEV-33073 always green buildbot

Stalled

MDEV-36647 No red leaves in the forest

Open

relates to

MDEV-34554 rpl_change_master_demote sporadically fails on buildbot

Closed

MDEV-29517 rpl.rpl_change_master_demote sporadically fails in BB

Closed

Activity

Brandon Nesterenko added a comment - 2024-05-21 16:17

This test can also fail with

rpl.rpl_change_master_demote 'mix'       w64 [ fail ]

        Test ended at 2024-05-20 09:17:08

CURRENT_TEST: rpl.rpl_change_master_demote

mysqltest: At line 292: "IO thread should not be running after START SLAVE UNTIL master_gtid_pos using a pre-existing GTID"

The result from queries just before the failure was:

< snip >

SELECT VARIABLE_NAME, GLOBAL_VALUE FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE VARIABLE_NAME LIKE 'gtid_binlog_pos' OR VARIABLE_NAME LIKE 'gtid_slave_pos' ORDER BY VARIABLE_NAME ASC;

VARIABLE_NAME GLOBAL_VALUE

GTID_BINLOG_POS 0-1-26,1-3-4,2-1-3,3-1-2,4-3-2

GTID_SLAVE_POS  0-2-24,1-3-4,2-1-3,3-1-2,4-3-2

CHANGE MASTER TO master_host='127.0.0.1', master_port=SLAVE_PORT, master_user='root', master_use_gtid=Slave_Pos, master_demote_to_slave=1;

SELECT VARIABLE_NAME, GLOBAL_VALUE FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE VARIABLE_NAME LIKE 'gtid_binlog_pos' OR VARIABLE_NAME LIKE 'gtid_slave_pos' ORDER BY VARIABLE_NAME ASC;

VARIABLE_NAME GLOBAL_VALUE

GTID_BINLOG_POS 0-1-26,1-3-4,2-1-3,3-1-2,4-3-2

GTID_SLAVE_POS  0-1-26,1-3-4,2-1-3,3-1-2,4-3-2

# GTID ssu_middle_binlog_pos should be considered in the past because

# gtid_slave_pos should be updated using the latest binlog gtids.

# The following call to sync_with_master_gtid.inc uses the latest

# binlog position and should still succeed despite the SSU stop

# position pointing to a previous event (because

# master_demote_to_slave=1 merges gtid_binlog_pos into gtid_slave_pos).

START SLAVE UNTIL master_gtid_pos="ssu_middle_binlog_pos";

Warnings:

Note  1278  It is recommended to use --skip-slave-start when doing step-by-step replication with START SLAVE UNTIL; otherwise, you will get problems if you get an unexpected slave's mariadbd restart

# Slave needs time to start and stop automatically

# Validating neither SQL nor IO threads are running..

Brandon Nesterenko added a comment - 2024-05-21 16:17 This test can also fail with rpl.rpl_change_master_demote 'mix' w64 [ fail ] Test ended at 2024-05-20 09:17:08 CURRENT_TEST: rpl.rpl_change_master_demote mysqltest: At line 292: "IO thread should not be running after START SLAVE UNTIL master_gtid_pos using a pre-existing GTID" The result from queries just before the failure was: < snip > SELECT VARIABLE_NAME, GLOBAL_VALUE FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE VARIABLE_NAME LIKE 'gtid_binlog_pos' OR VARIABLE_NAME LIKE 'gtid_slave_pos' ORDER BY VARIABLE_NAME ASC; VARIABLE_NAME GLOBAL_VALUE GTID_BINLOG_POS 0-1-26,1-3-4,2-1-3,3-1-2,4-3-2 GTID_SLAVE_POS 0-2-24,1-3-4,2-1-3,3-1-2,4-3-2 CHANGE MASTER TO master_host='127.0.0.1', master_port=SLAVE_PORT, master_user='root', master_use_gtid=Slave_Pos, master_demote_to_slave=1; SELECT VARIABLE_NAME, GLOBAL_VALUE FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE VARIABLE_NAME LIKE 'gtid_binlog_pos' OR VARIABLE_NAME LIKE 'gtid_slave_pos' ORDER BY VARIABLE_NAME ASC; VARIABLE_NAME GLOBAL_VALUE GTID_BINLOG_POS 0-1-26,1-3-4,2-1-3,3-1-2,4-3-2 GTID_SLAVE_POS 0-1-26,1-3-4,2-1-3,3-1-2,4-3-2 # GTID ssu_middle_binlog_pos should be considered in the past because # gtid_slave_pos should be updated using the latest binlog gtids. # The following call to sync_with_master_gtid.inc uses the latest # binlog position and should still succeed despite the SSU stop # position pointing to a previous event (because # master_demote_to_slave=1 merges gtid_binlog_pos into gtid_slave_pos). START SLAVE UNTIL master_gtid_pos="ssu_middle_binlog_pos"; Warnings: Note 1278 It is recommended to use --skip-slave-start when doing step-by-step replication with START SLAVE UNTIL; otherwise, you will get problems if you get an unexpected slave's mariadbd restart # Slave needs time to start and stop automatically # Validating neither SQL nor IO threads are running..

People

Assignee:: Brandon Nesterenko

Reporter:: Brandon Nesterenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-01-25 22:29

Updated:: Yesterday 06:28

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server