[MDEV-17516] Replication lag issue using parallel replication Created: 2018-10-22 Updated: 2023-09-27 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.1.36 |
| Fix Version/s: | 10.4, 10.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | VAROQUI Stephane | Assignee: | Brandon Nesterenko |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | seconds-behind-master | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
Using parallel replication second behind master is wrongly reporting 0 when SQL thread is stopped and restarted long time after. This happen by design but is really a show stopper for most proxy that send traffic to such slave thinking it's in sync with master. My understanding is that slave_behind_master is computed after first commit so in this case the master is 2 days in advance and on a fresh restarted slave we get this
And we can see wrong second behind master
but on his master
A possible solution would be to update Seconds_Behind_Master by injecting a fake event in start slave with the max timestamp of all events read by the leader thread and send to to the worker threads . To reproduce :
|
| Comments |
| Comment by Brandon Nesterenko [ 2022-10-11 ] |
|
Note that updating strategy of Seconds_Behind_Master also needs to consider the behavior of delayed slaves. See |
| Comment by Brandon Nesterenko [ 2022-10-21 ] |
|
Hi Andrei! This is ready for review. PR-2296 |
| Comment by Brandon Nesterenko [ 2023-01-31 ] |
|
The cause of Seconds_Behind_Master (SBM) reporting as 0 after slave restart is that the relay log is empty until the IO thread is able to re-queue the events. That is, after STOP SLAVE has been issued, SBM is invalidated, and presented as NULL by SHOW SLAVE STATUS. Currently, this state is maintained until the slave threads are running again. However, SBM will report as 0 for the time it takes the IO thread to receive an event from the master, queue it to the relay log, and until the SQL thread reads it. This is misleading, as SBM is not actually 0, but should still be invalid because it is not yet known when the last timestamp from the master is. The proposed fix (from Elkin) is to extend the duration of the NULL/invalid state of SBM until the first event is read by the SQL thread. |
| Comment by VAROQUI Stephane [ 2023-02-01 ] |
|
NULL looks like a bad idea what about 666 ? None of the tools around monitoring will have to be changed with a valid numeric value . Unless io_thread reports "No" when second behind_masster is reporting NULL Best would be to refetch the first event from the relay log to extract last event time ever played from the rejoining node and compute the delay from current_time - first_relay_time. and later from connection set it to 0 if the first event receive from leader is same GTID as this one , i don't think this could happen as if exists multiple entry in relay log it's because of a late situation For my knowledge what reason trigger the relay log to be deleted, if they get replayed the time can be read from sql_thread ? |
| Comment by Andrei Elkin [ 2023-02-01 ] |
|
stephane@skysql.com, you have a good point and a nice recommendation. It's better not to endanger monitoring tools. |
| Comment by VAROQUI Stephane [ 2023-02-01 ] |
|
"memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED, |
| Comment by VAROQUI Stephane [ 2023-02-01 ] |
|
Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct |
| Comment by Andrei Elkin [ 2023-02-01 ] |
|
stephane@skysql.com, to the HB exploitation |
| Comment by VAROQUI Stephane [ 2023-02-01 ] |
|
handshake to both of you so |
| Comment by VAROQUI Stephane [ 2023-02-01 ] |
|
At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request |