[MDEV-33327] rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position Created: 2024-01-29  Updated: 2024-01-30  Resolved: 2024-01-30

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.4, 10.5, 10.6, 10.11, 11.0, 11.1, 11.2, 11.3
Fix Version/s: 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2

Type: Bug Priority: Major
Reporter: Brandon Nesterenko Assignee: Brandon Nesterenko
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-16091 Seconds_Behind_Master spikes to milli... Closed
relates to MDEV-32551 "Read semi-sync reply magic number er... Closed

 Description   

The test rpl.rpl_seconds_behind_master_spike can fail with

CURRENT_TEST: rpl.rpl_seconds_behind_master_spike
mysqltest: At line 63: query 'select count(*)=1 from t1' failed: ER_NO_SUCH_TABLE (1146): Table 'test.t1' doesn't exist
 
The result from queries just before the failure was:
< snip >
SET @@global.debug_dbug="+d,pause_sql_thread_on_fde,negate_clock_diff_with_master";
include/start_slave.inc
# Future events must be logged at least 2 seconds after
# the slave starts
connection master;
# Write events to ensure slave will be consistent with master
create table t1 (a int);
insert into t1 values (1);
# Flush logs on master forces slave to generate a Format description
# event in its relay log
flush logs;
connection slave;
# Ignore FDEs that happen before the CREATE/INSERT commands
SET DEBUG_SYNC='now WAIT_FOR paused_on_fde';
SET DEBUG_SYNC='now SIGNAL sql_thread_continue';
SET DEBUG_SYNC='now WAIT_FOR paused_on_fde';
SET DEBUG_SYNC='now SIGNAL sql_thread_continue';
# On the next FDE, the slave should have the master CREATE/INSERT events
SET DEBUG_SYNC='now WAIT_FOR paused_on_fde';
select count(*)=1 from t1;

because the test is reliant on a specific number of format descriptor events (FDEs). However, depending on when the IO thread is stopped, it can send an extra FDE before sending the transactions, forcing the test to pause before executing any transactions.

The test should be fixed to be more flexible and not reliant on FDE count.

Note this test became much more unstable after MDEV-32551, as the amended kill_zombie_dump_threads would kill the IO thread quicker, making this test fail quite often.



 Comments   
Comment by Brandon Nesterenko [ 2024-01-30 ]

Analyzed and fixed, pushed into 10.4 as e4f221a5

Generated at Thu Feb 08 10:38:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.