[MDEV-9325] inaccurate seconds behind master calculation in chained replication Created: 2015-12-26  Updated: 2016-01-08  Resolved: 2016-01-08

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.23
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Alex Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: need_feedback
Environment:

CentOS 6.7



 Description   

Hello,

In one of my projects I have a chained replication setup:
master -> (slave|master) -> slave
2nd server acts as a slave for the 1st server and as a master for 2nd server at the same time.

The configuration isn't using GTID.

Lately 1st slave (intermediate master) started delaying behind primary master (caused by problems I am recently working on).
However, its slave is not delaying at all behind it because it replicates only few tables and thus replication happens quite fast.

Last slave in a chain is always 0 seconds behind master, however, sometimes it reports the global delay that happens on an intermediate master (the 1st slave in a chain), like if it's aware of the whole topology and a replication delay on that intermediate master.

What I know about calculation of 'seconds behind master' is that last executed query's timestamp is compared to the very recent query's timestamp received from the master (meaning newest query in relay log).
In 5.5.x (and all previous versions) it works exactly as expected, and the slave was aware only about its delay behind its own master.
However, here sometimes slave shows 0 seconds, and sometimes some global delay behind the very 1st master in a chain.
Could be it's somehow related to all that GTID info in binary logs.

As a result, I am not sure I always know what is the real delay on last slave.
I think it's important to know any slave's personal delay behind its own, directly connected master.
Or have an option to switch that info to use global data in a chain and display delay as a whole in a chain (if that's relevant to someone at all)

Thanks!
Alex



 Comments   
Comment by Elena Stepanova [ 2016-01-08 ]

I don't see how the behavior could have been different on 5.5. Timestamps or events replicated through a chain are preserved, it is even explicitly documented in MySQL manual in the context of Seconds_Behind_Master:

The value of Seconds_Behind_Master is based on the timestamps stored in events, which are preserved through replication. This means that if a master M1 is itself a slave of M0, any event from M1's binary log that originates from M0's binary log has M0's timestamp for that event. This enables MySQL to replicate TIMESTAMP successfully. However, the problem for Seconds_Behind_Master is that if M1 also receives direct updates from clients, the Seconds_Behind_Master value randomly fluctuates because sometimes the last event from M1 originates from M0 and sometimes is the result of a direct update on M1.

Same in 5.5, 5.6, 5.7.
A Simple experiment on MariaDB 5.5 vs 10.0 confirms it is the same there.

Possibly you had somewhat different workflow when you were on 5.5 (for example, maybe you had direct updates on the intermediate master which made you believe that Seconds_Behind_Master really shows the difference between the last slave and the intermediate master/slave).

Comment by Alex [ 2016-01-08 ]

Hi Elena,
I've checked everything again and seems like I was wrong about "bad" behavior. It truly seems like being by design.
I switched to MariaDB 10 when traffic increased and 5.5 replication couldn't work well. So I wanted to use parallel replication which helped a lot.
I don't remember whether I ever had such problems with 5.5 but now simply got tired of seeing replication monitors every second that are either 0 or the global delay. And you are absolutely right, that's caused by some direct activity on intermediate slave/master. So I guess there is nothing to do (though I personally prefer seeing delay only related to directly connected master and not through all the chain). When all application/replication problems that I am currently working on will go away, then my mailbox will "calm down" and I won't see those monitors any more, until the next turnaround of course

Thank you for your time and the input!

Alex

Generated at Thu Feb 08 07:33:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.