[MDEV-7837] Seconds behind Master reports incorrect value when Parallel replication is used Created: 2015-03-25  Updated: 2022-09-13

Status: Stalled
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0
Fix Version/s: 10.1

Type: Bug Priority: Major
Reporter: Guillaume Lefranc Assignee: Kristian Nielsen
Resolution: Unresolved Votes: 3
Labels: seconds-behind-master

Issue Links:
Relates
relates to MDEV-17516 Replication lag issue using parallel ... Stalled

 Description   

I am not sure how parallel replication impacts SBM calculation but the computation is obviously wrong when using parallel threads.
SBM should be (imho) calculated by comparing the timestamp of the currently executed event in the binary log to the current datetime. I understand that there might be issues when executing events in parallel however that should work for most cases.

simple reproducer: 
<pre>
#stop perfectly working slave
STOP SLAVE;
SELECT SLEEP(10);
START SLAVE;
SHOW PROCESSLIST\G
MariaDB(db-02)[(none)]> SHOW PROCESSLIST\G
*************************** 1. row ***************************
      Id: 3
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 10
   State: Waiting for work from SQL thread
    Info: NULL
Progress: 0.000
*************************** 2. row ***************************
      Id: 4
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 10
   State: Waiting for work from SQL thread
    Info: NULL
Progress: 0.000
*************************** 3. row ***************************
SHOW SLAVE STATUS\G
 
        Seconds_Behind_Master: 0

SBM first reports 0 then will increase monotonically from this point until it reaches the actual delay value (whatever is reported in "Time" for a worker thread in the processlist, for example).



 Comments   
Comment by Elena Stepanova [ 2015-03-26 ]

Hi,

From the description it sounds like regular behavior with somewhat slow-ish IO thread: the slave starts, it takes time to read an event from master during which periods the SQL thread(s) are idle, and thus SBM is 0, then an SQL thread gets the even and SBM gets updated and starts increasing.
But I might have missed something in your description.
Could you please paste, attach or upload the PROCESSLIST output for all replication-related threads, and also full SHOW SLAVE STATUS output?
Thanks.

Comment by Guillaume Lefranc [ 2015-04-16 ]

Yes, that is the case indeed. I am complaining of this behaviour because I happen to stop a slave fully, including the IO thread, then restart it a few seconds later, and it does not report the correct number of seconds behind master. As you said, that is more or less expected because it's based on events gotten from the IO thread.
I am basically asking if there be would a way to make this indication more reliable, without relying on external tools (e.g. an heartbeat table being updated).

Comment by Elena Stepanova [ 2015-04-16 ]

I doubt that – not with the current design, anyway.
But I'll assign it to knielsen, not for fixing, but for the second opinion, maybe he can suggest some neat trick.

Comment by Kristian Nielsen [ 2015-04-16 ]

Hm, I'm always confused about the seconds-behind-master value.

From how I read the comments, it seems the seconds-behind-master value is
the difference between the last event executed by the SQL thread and the
last event received by the IO thread?

I suppose it was made that way to be able to report "0 seconds" for an
up-to-date slave, even if there is clock skew between master and slave?

Though I also vaguely remember reading some code that tries to detect and
adjust for time skew between master and slave. As I said, I've always been
confused with this part of the code.

I guess it would be possible to send master's current time in the heartbeat
messages (if enabled), or even in every event sent from master, but it's not
something that I think anyone have planned...

Another option to get an idea of how far behind the slave is, is to compare
@@gtid_slave_pos on the slave with @@gtid_binlog_pos on the master...

Comment by Elena Stepanova [ 2015-04-16 ]

From how I read the comments, it seems the seconds-behind-master value is the difference between the last event executed by the SQL thread and the last event received by the IO thread?

I think more accurate would be not "last event executed" but "current event that is being executed". Otherwise it sounds right. Here is a partial description from MySQL manual:

When the slave is actively processing updates, this field shows the difference between the current timestamp on the slave and the original timestamp logged on the master for the event currently being processed on the slave.

When no event is currently being processed on the slave, this value is 0.

In essence, this field measures the time difference in seconds between the slave SQL thread and the slave I/O thread. If the network connection between master and slave is fast, the slave I/O thread is very close to the master, so this field is a good approximation of how late the slave SQL thread is compared to the master. If the network is slow, this is not a good approximation; the slave SQL thread may quite often be caught up with the slow-reading slave I/O thread, so Seconds_Behind_Master often shows a value of 0, even if the I/O thread is late compared to the master. In other words, this column is useful only for fast networks.

This time difference computation works even if the master and slave do not have identical clock times, provided that the difference, computed when the slave I/O thread starts, remains constant from then on. Any changes—including NTP updates—can lead to clock skews that can make calculation of Seconds_Behind_Master less reliable.
...

Another option to get an idea of how far behind the slave is, is to compare @@gtid_slave_pos on the slave with @@gtid_binlog_pos on the master...

I'm afraid it will be even less informative in terms of time delay, because there might be 1mln tiny little events which will execute in a minute, or there might be a single event that will take hours...

Anyway, tanj, if you want to submit a feature request for a new delay indicator, please do so, but as Kristian said above, it's not something that's currently planned.

Comment by VAROQUI Stephane [ 2015-09-16 ]

Using 10.0.16 GTID SBR we stop a slave for 3 days .

We start the replication that get a correct value for second_behind_master 160K

stop slave;
set global slave-parallel-threads=10;
start slave;

second_behind_master report 10 and slowly increase to 50 .
The difference is for sure a regression.

Comment by VAROQUI Stephane [ 2015-09-16 ]

All events have UTC timestamp , so sql thread can get local UTC timestamp and compute a diff with the event one.
Can you point me what is wrong with such computation of course it does not take care of the time to process events.
but report delay between the executions of that event , this delay will potentially change for each event , making it hard using it to predict a catch up time.

Comment by Rick James (Inactive) [ 2015-10-09 ]

The timestamp from the Master is the start time of the query on the Master, correct? So, this naturally leads to fluctuations in Seconds_behind_master.

Comment by VAROQUI Stephane [ 2015-10-10 ]

that's the point Parallel replication stop computing like this , if you stop a slave for 10 days on a busy master , restarting the slave will report 5s Seconds_behind_master despite it is still 10 days late, that's not correct and it is show stopper for Parallel replication in many cases .

Comment by VAROQUI Stephane [ 2015-10-10 ]

to make Seconds_behind_master correct the io thread should receive timestamp of event sent to the slave - sum ( following events response time until last commit )

the slave compute:

current time - execution time on master + sum of pending event execution

If we don't wan't to store sum of pending events exec time in each events this should be store in heartbeat

Comment by Kristian Nielsen [ 2015-10-15 ]

See https://lists.launchpad.net/maria-developers/msg08958.html

Comment by VAROQUI Stephane [ 2022-09-13 ]

This task is missing a variable in the equation, it should be possible on IO thread connection to fetch the time of the last binary log of the leader and refresh it using heartbeat . The correct SBM is then equal :

*MAX(Time end transactions on all SQL threads) - MAX( last bin log first connection time , last IO thread fetch time) *

With END time transaction the question remain for accounting the running transaction progress as for online DDL using START time we would ignore curent execution and this should be accurate with most transactional online DDL

This should support microsecond delay because as the new default for any query time now

At the end the heartbeat of a statement create or replace view select timestamp like proposed by many tools is probably the most elegant implementation

Generated at Thu Feb 08 07:22:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.