[MDEV-31895] Report a Replica's Time Difference with its Primary Created: 2023-08-10  Updated: 2024-02-07

Status: Open
Project: MariaDB Server
Component/s: Replication
Fix Version/s: 11.5

Type: New Feature Priority: Major
Reporter: Brandon Nesterenko Assignee: Brandon Nesterenko
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-16091 Seconds_Behind_Master spikes to milli... Closed
relates to MDEV-30619 Parallel Slave SQL Thread Can Update ... Closed

 Description   

Extend SHOW SLAVE STATUS, to display the variable mi->clock_diff_with_master.

Additionally, MTR tests which use Seconds_Behind_Master should be updated to adjust its value based on this variable. This will remove the need for the debug point "negate_clock_diff_with_master", and tests which rely on have_debug.inc for that purpose can be freed from that dependency.



 Comments   
Comment by Brandon Nesterenko [ 2024-02-06 ]

With more thought, I wonder if presenting this value may be problematic. I originally encountered this issue when an expected equivalence

Seconds_Behind_Master + SQL_Remaining_Delay == SQL_Delay

didn't hold true, because the mi->clock_diff_with_master variable is used to set Seconds_Behind_Master.

This value is calculated by comparing the primary's system clock, i.e. with

SELECT UNIX_TIMESTAMP()

against the current clock of the slave, with the granularity of seconds.

The issue in the MTR tests, was that the master could be queried for its timestamp just before ticking to the next second, and then by the time the replica checks its current timestamp, it has reached the next second. So mi->clock_diff_with_master would compute to "1", despite the master and slave using the same system clock.

In a real scenario, this seems like it could be problematic, as the replica and primary could be very close (within the NTP standard), yet we would report one second (or potentially more if the system scheduler is messed up), and falsely scare admins.

Elkin, knielsen do you have any thoughts?

Comment by Kristian Nielsen [ 2024-02-07 ]

I think your concern is very real that exposing additional information like clock_diff_with_master can easily lead to confusion. If that's enough to decide not to expose it, I'm not sure though.

Having followed replication closely for many years, the "Seconds_behind_master" is an endless source of confusion, from the very nature of the problem it's supposed to describe. Every few years someone comes up with a new corner case they want it to mean something differently, or a new idea how to tune the value. The end result is of course that the meaning doesn't become any clearer or less confusing, only more complicated to understand.

I have long since given up on "fixing" the Seconds_behind_master, I think there will always be someone unhappy about how it works no matter how it's changed.

Generated at Thu Feb 08 10:27:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.