[MDEV-13895] GTID and Master_Delay causes excessive initial delay Created: 2017-09-22  Updated: 2019-11-10  Resolved: 2019-03-28

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2.8, 10.2
Fix Version/s: 10.2.24, 10.3.14, 10.4.4

Type: Bug Priority: Major
Reporter: Marcus Zambrano Assignee: Sujatha Sivakumar (Inactive)
Resolution: Fixed Votes: 0
Labels: Seconds_Behind_Master, seconds-behind-master
Environment:

RHEL 6 x86_64


Issue Links:
Relates
relates to MDEV-17469 Replication Slave + SQL_Delay Open

 Description   

When attempting to delay a Slave attached with GTID, there appears to be an extra delay applied initially. For example, this output reflects a Slave that is already delayed by 43200 seconds. When switching to GTID replication, replication is paused until SQL_Remaining_Delay counts down to 0:

MariaDB [(none)]> stop slave; CHANGE MASTER TO master_use_gtid=current_pos; CHANGE MASTER TO MASTER_DELAY=43200; start slave;

Relay_Master_Log_File: bin-log.018039
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 625
Seconds_Behind_Master: 44847
Using_Gtid: Current_Pos
SQL_Delay: 43200
SQL_Remaining_Delay: 43089
Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

When switching to conventional replication coordinates, the delay is applied as expected:

MariaDB [(none)]> change master to master_log_file='bin-log.018040', master_log_pos=20289860;
Query OK, 0 rows affected (0.25 sec)

Relay_Master_Log_File: bin-log.018040
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 28182467
Seconds_Behind_Master: 43636
Using_Gtid: No
SQL_Delay: 43200
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Unlocking tables



 Comments   
Comment by Elena Stepanova [ 2017-09-30 ]

Thanks for the report.

The test below is for reproducing only, do not put it into the regression suite!

The test sets delay to 20 seconds, then runs 2 inserts on master with 10-second interval, and checks status on the slave over the next minute.
Comments show which result is expect at which step.
With master_use_gtid=no everything works as expected. With master_use_gtid=current_pos, indeed, there is an extra delay:

####################################
# After 13 seconds waiting (expected behind ~13, remaining ~7, 1 row)
Slave_IO_Running = 'Yes'
Slave_SQL_Running = 'Yes'
Seconds_Behind_Master = '23'
SQL_Delay = '20'
SQL_Remaining_Delay = '17'
select * from t1;
i
####################################
# After 23 seconds waiting (expected behind 0, remaining NULL, 2 rows)
Slave_IO_Running = 'Yes'
Slave_SQL_Running = 'Yes'
Seconds_Behind_Master = '33'
SQL_Delay = '20'
SQL_Remaining_Delay = '7'
select * from t1;
i

At the end the slave catches up.

mdev13895.test

# The test uses env variable USE_GTID. Set it to 'no' or 'current_pos'
 
--source include/master-slave.inc
--source include/have_binlog_format_mixed.inc
 
create table t1 (i int);
 
--sync_slave_with_master
stop slave;
eval change master to master_use_gtid= $USE_GTID;
change master to master_delay= 20;
start slave;
 
--connection master
insert into t1 values (1);
--sleep 10
insert into t1 values (2);
 
--let $status_items= Slave_IO_Running,Slave_SQL_Running,Seconds_Behind_Master,SQL_Delay,SQL_Remaining_Delay
 
--connection slave
--echo ####################################
--echo # First check, 10 sec since 1st insert (expected behind ~10, remaining ~10, no rows)
--source include/show_slave_status.inc
select * from t1;
--sleep 5
--echo ####################################
--echo # After 5 seconds waiting (expected behind ~15, remaining ~5, no rows)
--source include/show_slave_status.inc
select * from t1;
--sleep 8
--echo ####################################
--echo # After 13 seconds waiting (expected behind ~13, remaining ~7, 1 row)
--source include/show_slave_status.inc
select * from t1;
--sleep 10
--echo ####################################
--echo # After 23 seconds waiting (expected behind 0, remaining NULL, 2 rows)
--source include/show_slave_status.inc
select * from t1;
--sleep 10
--echo ####################################
--echo # After 33 seconds waiting (expected behind 0, remaining NULL, 2 rows)
--source include/show_slave_status.inc
select * from t1;
 
 
# Cleanup
 
stop slave;
change master to master_use_gtid=no, master_delay=0;
start slave;
--connection master
drop table t1;
--sync_slave_with_master
--connection master
--source include/rpl_end.inc

Comment by Andrei Elkin [ 2018-10-29 ]

Most probably the case hosts MDEV-17469's greater-than-expected SBM which also happens there having GTID replication mode.
I suggest we follow up from here esp having a valuable how-to-repeat from Elena.

Comment by Sujatha Sivakumar (Inactive) [ 2019-03-11 ]

Hello Andrei,

Please refer the following branch for the patch.

bb-10.2-MDEV-13895

Kindly review the changes.

Comment by Andrei Elkin [ 2019-03-25 ]

The patch looks good, thank for this piece of work and analysis!

Comment by Sujatha Sivakumar (Inactive) [ 2019-03-28 ]

Fix for this issue is implemented as part of following commit in 10.2.24.

e42192d7b3821640bcf18c58dc303a2338c6d1aa

Generated at Thu Feb 08 08:09:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.