[MDEV-17516] Replication lag issue using parallel replication - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.1.36
Fix Version/s: 10.5
Component/s: Replication
Labels:
- seconds-behind-master

Description

Using parallel replication second behind master is wrongly reporting 0 when SQL thread is stopped and restarted long time after.

This happen by design
https://lists.launchpad.net/maria-developers/msg08958.html

but is really a show stopper for most proxy that send traffic to such slave thinking it's in sync with master.

My understanding is that slave_behind_master is computed after first commit so in this case the master is 2 days in advance and on a fresh restarted slave we get this

|   30 | system user  |                      | tsce_unedic | Connect | 2211 | altering table                                                                 | OPTIMIZE TABLE `requetes` |    0.000 |

And we can see wrong second behind master

        Seconds_Behind_Master: 0

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 0-21-28557589

                Parallel_Mode: conservative

but on his master

gtid_current_pos       | 0-21-28570301

A possible solution would be to update Seconds_Behind_Master by injecting a fake event in start slave with the max timestamp of all events read by the leader thread and send to to the worker threads .

To reproduce :

--source include/have_innodb.inc

--source include/have_binlog_format_mixed.inc

--let $rpl_topology=1->2

--source include/rpl_init.inc

# Test various aspects of parallel replication.

--connection server_1

ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;

CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;

--save_master_pos

--connection server_2

--sync_with_master

--source include/stop_slave_sql_thread.inc

SET GLOBAL slave_parallel_threads=4;

--connection server_2

--sync_with_master

--source include/stop_slave.inc

SET GLOBAL slave_parallel_threads=1;

--connection server_1

--disable_warnings

INSERT INTO t1 VALUES (1, SLEEP(100));

--wait 100s

INSERT INTO t1 VALUES (1, SLEEP(1));

--connection server_2

--source include/start_slave.inc

--let $status_items= Seconds_Behind_Master

--source include/show_slave_status.inc

--sync_with_master

--let $status_items= Seconds_Behind_Master

--source include/show_slave_status.inc

Attachments

Issue Links

is duplicated by

MDEV-29639 Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas

Closed

relates to

MDEV-30458 Consolidate Serial Replica to Parallel Replica with 1 Worker Thread

Open

MDEV-30619 Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers

Closed

MDEV-31745 First Event After Starting a Delayed Parallel Replica Shows 0 Seconds_Behind_Master

Open

MDEV-7837 Seconds behind Master reports incorrect value when Parallel replication is used

Closed

MDEV-32265 seconds_behind_master is inaccurate for Delayed replication

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

View 5 older comments

VAROQUI Stephane added a comment - 2023-02-01 13:38

"memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query

To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED,
i'm still curious why not using a heartbeat from the leader enrich with timestamp of last binary log event would be more accurate, a slave would not start fetching event before first heartbeat ? And SBM definition become time difference last event in the leader and oldest event COMMITTED in the queue

VAROQUI Stephane added a comment - 2023-02-01 13:38 "memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED, i'm still curious why not using a heartbeat from the leader enrich with timestamp of last binary log event would be more accurate, a slave would not start fetching event before first heartbeat ? And SBM definition become time difference last event in the leader and oldest event COMMITTED in the queue

VAROQUI Stephane added a comment - 2023-02-01 13:49

Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct

VAROQUI Stephane added a comment - 2023-02-01 13:49 Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct

Andrei Elkin added a comment - 2023-02-01 17:55 - edited

stephane@skysql.com, to the HB exploitation , your direction is great. Just not HB, but when necessary it's feasible to add up to the master-slave connection handshake something like you propose.
E.g that the slave service is started for the 1st time on the (restarted) slave server. In the recommended CM...master_use_gtid = slave_pos in the handshake time slave would receive back the end-of-transaction timestamp corresponding to its last GTID executed (without this ts piece slave would be aware only of the GTID details of its last executed trx).
This measure refines 'a maximum SBM'.

Andrei Elkin added a comment - 2023-02-01 17:55 - edited stephane@skysql.com , to the HB exploitation , your direction is great. Just not HB, but when necessary it's feasible to add up to the master-slave connection handshake something like you propose. E.g that the slave service is started for the 1st time on the (restarted) slave server. In the recommended CM...master_use_gtid = slave_pos in the handshake time slave would receive back the end-of-transaction timestamp corresponding to its last GTID executed (without this ts piece slave would be aware only of the GTID details of its last executed trx). This measure refines 'a maximum SBM'.

VAROQUI Stephane added a comment - 2023-02-01 18:06

handshake to both of you so

VAROQUI Stephane added a comment - 2023-02-01 18:06 handshake to both of you so

VAROQUI Stephane added a comment - 2023-02-01 18:14

At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request

VAROQUI Stephane added a comment - 2023-02-01 18:14 At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request

People

Assignee:: Brandon Nesterenko

Reporter:: VAROQUI Stephane

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 2018-10-22 09:02

Updated:: 2025-01-20 13:42

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server