[MDEV-29639] Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas Created: 2022-09-26  Updated: 2023-09-27  Resolved: 2023-01-24

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10
Fix Version/s: 10.11.2, 11.0.1, 10.3.38, 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Bug Priority: Critical
Reporter: Brandon Nesterenko Assignee: Brandon Nesterenko
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Duplicate
duplicates MDEV-17516 Replication lag issue using parallel ... Stalled
Problem/Incident
causes MDEV-30619 Parallel Slave SQL Thread Can Update ... Closed
Relates
relates to MDEV-30458 Consolidate Serial Replica to Paralle... Open
relates to MDEV-30608 rpl.rpl_delayed_parallel_slave_sbm so... Closed
relates to MDEV-32265 seconds_behind_master is inaccurate f... Closed
relates to MDEV-31745 First Event After Starting a Delayed ... Open

 Description   

Delayed Replicas, i.e. those using the MASTER_DELAY option of CHANGE MASTER TO, also configured to use parallel threads calculate Seconds_Behind_Master incorrectly. This commit changed parallel replicas to update Seconds_Behind_Master at the time of transaction commit. However, on a delayed replica, an event's Seconds_Behind_Master will not be calculated until after MASTER_DELAY seconds have passed and the event has finished executing. In other words, when a new event is received, the value of Seconds_Behind_Master will be calculated using the time of the last committed event, resulting in potentially very large values of Seconds_Behind_Master for the entire duration of MASTER_DELAY. This is especially prevalent for workloads with infrequent transactions.

The following MTR test highlights this issue:

--source include/master-slave.inc
--source include/have_binlog_format_row.inc
 
--echo #
--echo # Initialize test data
--connection master
create table t1 (a int);
insert into t1 values (1);
--source include/save_master_gtid.inc
 
--connection slave
--source include/sync_with_master_gtid.inc
--source include/stop_slave.inc
CHANGE MASTER TO MASTER_DELAY=4, MASTER_USE_GTID=Slave_Pos;
set @@global.slave_parallel_threads= 4;
--source include/start_slave.inc
 
--echo # Set up a long interval between now and the next event to boost SBM
--connection master
--sleep 10
 
--let $ctr=8
while($ctr)
{
    --connection slave
 
    # On the first iteration, SBM will be 0 because there are no new events
    --let $status_items= Seconds_Behind_Master
    --source include/show_slave_status.inc
 
    --connection master
    --eval insert into t1 values ($ctr)
    --send select sleep(1)
    --dec $ctr
 
    # On the first iteration, SBM will boost to 10 because of the long
    # interval, despite only just receiving the event
    --connection slave
    --source include/show_slave_status.inc
 
    --connection master
    --reap
}
 
 
 
--echo #
--echo # Cleanup
--connection master
DROP TABLE t1;
--source include/save_master_gtid.inc
 
--connection slave
--source include/sync_with_master_gtid.inc
--source include/stop_slave.inc
CHANGE MASTER TO MASTER_DELAY=0;
set @@global.slave_parallel_threads= 0;
--source include/start_slave.inc
 
--source include/rpl_end.inc
 
--echo # End of tests



 Comments   
Comment by Brandon Nesterenko [ 2022-10-11 ]

Closing as duplicate because the underlying cause is the same as MDEV-17516

Comment by Andrei Elkin [ 2022-10-25 ]

Reopened as MDEV-17516 is a more general issue, also unrelated to the delayed replication option.

Comment by Brandon Nesterenko [ 2022-11-04 ]

Hi Andrei!

This is ready for review:
PR-2323

Comment by Andrei Elkin [ 2023-01-04 ]

Brandon, please find a refined approach in bb-10.3-MDEV-29639-review.

Comment by Brandon Nesterenko [ 2023-01-13 ]

Hi Andrei!

The newest commit to PR-2323 is ready for your review.

Comment by Andrei Elkin [ 2023-01-23 ]

Thanks for this work, Brandon!

Comment by Brandon Nesterenko [ 2023-01-24 ]

Fixed as d69e835 in 10.3.

No merge conflicts or test failures observed in local merge-up.

Generated at Thu Feb 08 10:10:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.