[MDEV-34354] Shows replication time difference in master_last_event_time after setting MASTER_DELAY > 0 in chain replication - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 11.6.0
Fix Version/s: 11.6.0
Component/s: Replication
Labels:
None

Description

Found a small difference in Master_last_event_time in replica2 when setting up chain replication. The issue found only when setting MASTER_DELAY > 0 on replica1

Test case

MariaDB [test]> create or replace table t1 select seq from seq_1_to_1000;select now();

Query OK, 1000 rows affected (0.086 sec)

Records: 1000  Duplicates: 0  Warnings: 0

+---------------------+

| now()               |

+---------------------+

| 2024-06-07 10:16:25 |

+---------------------+

1 row in set (0.000 sec)

MariaDB [test]>

replica status from server replica1. Master delay configured to 5 sec on replica1 SQL_Delay: 5

                 Parallel_Mode: aggressive

                     SQL_Delay: 5

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

              Slave_DDL_Groups: 8

Slave_Non_Transactional_Groups: 2

    Slave_Transactional_Groups: 1

          Retried_transactions: 0

            Max_relay_log_size: 1073741824

          Executed_log_entries: 1327

     Slave_received_heartbeats: 67

        Slave_heartbeat_period: 30.000

                Gtid_Slave_Pos: 11-100-11

        Master_last_event_time: 2024-06-07 10:16:25

         Slave_last_event_time: 2024-06-07 10:16:25

        Master_Slave_time_diff: 0

1 row in set (0.000 sec)

MariaDB [test]>

replica status from server replica2, please check Master_last_event_time difference

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

              Slave_DDL_Groups: 1

Slave_Non_Transactional_Groups: 9

    Slave_Transactional_Groups: 1

          Retried_transactions: 0

            Max_relay_log_size: 1073741824

          Executed_log_entries: 1308

     Slave_received_heartbeats: 91

        Slave_heartbeat_period: 30.000

                Gtid_Slave_Pos: 11-100-11

        Master_last_event_time: 2024-06-07 10:16:30

         Slave_last_event_time: 2024-06-07 10:16:25

        Master_Slave_time_diff: 5

1 row in set (0.000 sec)

MariaDB [test]>

Attachments

Issue Links

is caused by

MDEV-33856 Alternative Replication Lag Representation via Received/Executed Master Binlog Event Timestamps

Closed

Activity

Ascending order - Click to sort in descending order

Ramesh Sivaraman created issue - 2024-06-10 18:00

Ramesh Sivaraman added a comment - 2024-06-10 18:01

The issue is also present in `Seconds_behind_master`. Thank you bnestere for sharing the MTR test case to test SBM issue

Ramesh Sivaraman added a comment - 2024-06-10 18:01 The issue is also present in `Seconds_behind_master`. Thank you bnestere for sharing the MTR test case to test SBM issue

Brandon Nesterenko added a comment - 2024-06-10 20:52

The issue is that the slave overwrites the binlog event's exec_time, which includes the delay time. This makes Seconds_behind_master and master_last_event_time incorrect on slaves that follow a chain. See the following MTR test case highlighting the issue with SBM

--source include/have_binlog_format_statement.inc

--let $rpl_topology= 1->2->3

--source include/rpl_init.inc

--connection server_1

--source include/have_innodb.inc

--connection server_2

--source include/have_innodb.inc

--connection server_3

--source include/have_innodb.inc

--connection server_1

--connection server_1

create table t1 (a int) engine=aria;

--source include/save_master_gtid.inc

--connection server_3

--source include/sync_with_master_gtid.inc

--connection server_2

--source include/stop_slave.inc

change master to master_delay=5;

--source include/start_slave.inc

--connection server_3

--source include/stop_slave.inc

change master to master_delay=10;

--source include/start_slave.inc

lock tables t1 write;

--connection server_1

insert into t1 values (100);

--source include/save_master_gtid.inc

--connection server_2

--source include/sync_with_master_gtid.inc

--echo # Server_2 has new event

--connection server_3

--echo #

--echo # Sleep 3s

--sleep 3

--let $sbm= query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1)

--echo # Server_3 is waiting for delay, has correct SBM: $sbm

--echo #

--echo # Now wait ~2s for server_3's master_delay to end

--let $wait_condition= SELECT count(*) FROM information_schema.processlist WHERE state LIKE 'Waiting for table metadata lock';

--source include/wait_condition.inc

--let $sbm= query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1)

--echo # Once server_3 delay ends (now is waiting on lock), its SBM jumps by

--echo # server_2's MASTER_DELAY-(~2 seconds): $sbm

unlock tables;

--source include/sync_with_master_gtid.inc

--connection server_2

--source include/stop_slave.inc

change master to master_delay=0;

--source include/start_slave.inc

--connection server_3

--source include/stop_slave.inc

change master to master_delay=0;

--source include/start_slave.inc

--connection server_1

drop table t1;

--source include/rpl_end.inc

With .cnf file

!include include/default_mysqld.cnf

[mysqld.1]

log-slave-updates

gtid_domain_id=0

[mysqld.2]

log-slave-updates

[mysqld.3]

log-slave-updates

[ENV]

SERVER_MYPORT_1= @mysqld.1.port

SERVER_MYPORT_2= @mysqld.2.port

SERVER_MYPORT_3= @mysqld.3.port

As the transaction is still executing, it would be expected that after 2s, the SBM would go from 8 -> 10; but instead, it jumps down to 5.

Brandon Nesterenko added a comment - 2024-06-10 20:52 The issue is that the slave overwrites the binlog event's exec_time, which includes the delay time. This makes Seconds_behind_master and master_last_event_time incorrect on slaves that follow a chain. See the following MTR test case highlighting the issue with SBM --source include/have_binlog_format_statement.inc --let $rpl_topology= 1->2->3 --source include/rpl_init.inc --connection server_1 --source include/have_innodb.inc --connection server_2 --source include/have_innodb.inc --connection server_3 --source include/have_innodb.inc --connection server_1 --connection server_1 create table t1 (a int) engine=aria; --source include/save_master_gtid.inc --connection server_3 --source include/sync_with_master_gtid.inc --connection server_2 --source include/stop_slave.inc change master to master_delay=5; --source include/start_slave.inc --connection server_3 --source include/stop_slave.inc change master to master_delay=10; --source include/start_slave.inc lock tables t1 write; --connection server_1 insert into t1 values (100); --source include/save_master_gtid.inc --connection server_2 --source include/sync_with_master_gtid.inc --echo # Server_2 has new event --connection server_3 --echo # --echo # Sleep 3s --sleep 3 --let $sbm= query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1) --echo # Server_3 is waiting for delay, has correct SBM: $sbm --echo # --echo # Now wait ~2s for server_3's master_delay to end --let $wait_condition= SELECT count(*) FROM information_schema.processlist WHERE state LIKE 'Waiting for table metadata lock'; --source include/wait_condition.inc --let $sbm= query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1) --echo # Once server_3 delay ends (now is waiting on lock), its SBM jumps by --echo # server_2's MASTER_DELAY-(~2 seconds): $sbm unlock tables; --source include/sync_with_master_gtid.inc --connection server_2 --source include/stop_slave.inc change master to master_delay=0; --source include/start_slave.inc --connection server_3 --source include/stop_slave.inc change master to master_delay=0; --source include/start_slave.inc --connection server_1 drop table t1; --source include/rpl_end.inc With .cnf file !include include/default_mysqld.cnf [mysqld.1] log-slave-updates gtid_domain_id=0 [mysqld.2] log-slave-updates [mysqld.3] log-slave-updates [ENV] SERVER_MYPORT_1= @mysqld.1.port SERVER_MYPORT_2= @mysqld.2.port SERVER_MYPORT_3= @mysqld.3.port As the transaction is still executing, it would be expected that after 2s, the SBM would go from 8 -> 10; but instead, it jumps down to 5.

Brandon Nesterenko made changes - 2024-06-10 20:52

Field	Original Value	New Value
Assignee	Brandon Nesterenko [ JIRAUSER48702 ]	Michael Widenius [ monty ]

Brandon Nesterenko made changes - 2024-06-10 20:52

Status

Open [ 1 ]

Confirmed [ 10101 ]

Brandon Nesterenko made changes - 2024-08-27 21:17

Link

This issue is caused by ~~MDEV-33856~~ [ ~~MDEV-33856~~ ]

Brandon Nesterenko made changes - 2024-08-27 21:17

Fix Version/s		11.6.0 [ 29839 ]
Fix Version/s	10.6 [ 24028 ]
Affects Version/s		11.6.0 [ 29839 ]
Affects Version/s	10.6 [ 24028 ]

Brandon Nesterenko added a comment - 2024-08-27 21:19

This issue was fixed in the official commit for ~~MDEV-33856~~ in 11.6.0.

Brandon Nesterenko added a comment - 2024-08-27 21:19 This issue was fixed in the official commit for MDEV-33856 in 11.6.0.

Brandon Nesterenko made changes - 2024-08-27 21:19

issue.field.resolutiondate

2024-08-27 21:19:07.0

2024-08-27 21:19:07.052

Brandon Nesterenko made changes - 2024-08-27 21:19

Resolution		Fixed [ 1 ]
Status	Confirmed [ 10101 ]	Closed [ 6 ]

People

Assignee:: Michael Widenius

Reporter:: Ramesh Sivaraman

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-06-10 18:00

Updated:: 2024-08-27 21:19

Resolved:: 2024-08-27 21:19

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server