[MDEV-25999] Unrealistic "Seconds behind master" value - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.2(EOL), 10.5, 10.6, 10.11, 11.4, 11.8
Fix Version/s: 10.5, 10.6, 10.11, 11.4, 11.8
Component/s: Replication
Labels:
- seconds-behind-master

Description

Seconds_Behind_Master can quickly and sporadically spike from 0 to really high values, and then go back to 0 (independent of running on a parallel or serial replica). This is due to the Seconds_Behind_Master computation having a special idleness check which overrides the value to 0. This idleness condition (and thereby when SBM is calculated/written) is based on the SQL thread queueing status:

      idle= mi->rli.sql_thread_caught_up;

Which isn't tightly coupled with user events. That is, the timestamp of internal events (e.g. Format_description_log_event, Gtid_list_log_event, and Binlog_checkpoint_event) aren't saved on the replica, and thereby the Seconds_Behind_Master value won't reflect a more recent replica when these events are processed. However, when these events are queued by the SQL thread, the idleness condition no longer holds, and then Seconds_Behind_Master will be calculated and displayed based on the last timestamp from a user transaction, which can be very large (particularly when a replica is configured with SQL_Delay. This is particularly prevalent during log rotation, as all events generated during this time are internal, and should not be considered in Seconds_Behind_Master.

The fix should update the idleness condition to not be based on the queueing state of the SQL thread, but rather that there are no new user events from the master to process (e.g. GTID_IO_Pos == GTID_Slave_Pos).

Test case to reproduce (which requires a small patch for debug_sync integration):

# TODO: Write me

# References:

#   MDEV-:

--source include/have_innodb.inc

--source include/have_log_bin.inc

# Format independent

--source include/have_binlog_format_row.inc

--source include/master-slave.inc

--connection slave

--source include/stop_slave.inc

set @@global.debug_dbug="+d,pause_sql_thread_on_next_fde";

--source include/start_slave.inc

--echo #

--echo # Initialize test data

--connection master

create table t1 (a int primary key) engine=innodb;

insert into t1 values (1);

--source include/save_master_gtid.inc

--echo # Catch slave up with master

--connection slave

--echo # Debug_sync is set on format description events, so skip 2 (1 for master binlog, 1 for slave relay log)

set debug_sync="now wait_for paused_on_fde";

set debug_sync="now signal sql_thread_continue";

set debug_sync="now wait_for paused_on_fde";

set debug_sync="now signal sql_thread_continue";

--source include/sync_with_master_gtid.inc

--echo # Rotate master binary logs to force sql thread to read internal only events

--connection master

FLUSH LOGS;

--connection slave

--echo #

--echo # Skip through format description events from relay log file & master

--echo # binlog until we get to the last one

--echo #

set debug_sync="now wait_for paused_on_fde";

set debug_sync="now signal sql_thread_continue";

set debug_sync="now wait_for paused_on_fde";

set debug_sync="now signal sql_thread_continue";

--echo # Now pause on the last format description event

set debug_sync="now wait_for paused_on_fde";

--echo # Wait 2s so SBM has a chance to increase..

--sleep 2

--echo # Seconds_Behind_Master should be zero as user events are up-to-date (but is not)

--let $status_items= Seconds_Behind_Master

--source include/show_slave_status.inc

--echo # Resume SQL thread

set debug_sync="now signal sql_thread_continue";

--echo # TODO clean up debug_dbug and debug_sync status

--echo #

--echo # Cleanup

--connection master

drop table t1;

--source include/save_master_gtid.inc

--connection slave

--source include/sync_with_master_gtid.inc

--source include/rpl_end.inc

--echo # End of .test

--echo # TODO ^ Fill out with test name

patch for debug_sync (on 10.6 community):

diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc

index 8b740195b24..9fa14a6b873 100644

--- a/sql/log_event_server.cc

+++ b/sql/log_event_server.cc

@@ -2426,6 +2426,12 @@ int Format_description_log_event::do_apply_event(rpl_group_info *rgi)

   Relay_log_info *rli= rgi->rli;

   DBUG_ENTER("Format_description_log_event::do_apply_event");

+  DBUG_EXECUTE_IF("pause_sql_thread_on_next_fde", {

+    DBUG_ASSERT(!debug_sync_set_action(

+        thd, STRING_WITH_LEN(

+                 "now SIGNAL paused_on_fde WAIT_FOR sql_thread_continue")));

+  });

/*

     As a transaction NEVER spans on 2 or more binlogs:

     if we have an active transaction at this point, the master died

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

seconds-behind-master-wrong.png
310 kB
2025-04-03 21:52

Issue Links

relates to

MDEV-16091 Seconds_Behind_Master spikes to millions of seconds

Closed

Activity

Ascending order - Click to sort in descending order

Niranjan added a comment - 2021-06-23 12:46 - edited

STEPS TO REPRODUCE THE ISSUE:

== Set up replication between two MariaDB servers (running on 10.2.38)
== I do not have any other query running on the DB.
== On the "Primary" server, repeatedly flush the binary logs:
I did something like this:
flush binary logs;show binary logs;select now(),sleep(0.25);flush binary logs;show binary logs;select now(),sleep(0.25)... .... .... .... ... ...

== On the "replica" server, repeatedly monitor the seconds behind master by running "show slave status\G"

== I have grepped the output to show only the required details. Pasting a part of it below:

Replica_Output

–
Relay_Master_Log_File: mysql-bin-changelog.000455
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 307
1 row in set (0.01 sec)

–
Relay_Master_Log_File: mysql-bin-changelog.000470
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1650
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000472
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1650
1 row in set (0.01 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000474
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1651
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000476
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1651
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000482
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1653
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000496
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1659
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000498
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1660
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000500
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1660
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000510
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1663
1 row in set (0.01 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000512
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1663
1 row in set (0.00 sec)

–
--
Relay_Master_Log_File: mysql-bin-changelog.000514
Exec_Master_Log_Pos: 4
Seconds_Behind_Master: 1664
1 row in set (0.02 sec)

Looking at the binary logs of the "Primary" instance corresponding to "Relay_Master_Log_File" and "Exec_Master_Log_Pos", I can just see that its the starting of the file (immediately after rotation of older file).

MASTER> show binlog events in 'mysql-bin-changelog.000455';

+----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+

| Log_name                   | Pos | Event_type        | Server_id | End_log_pos | Info                                           |

+----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+

| mysql-bin-changelog.000455 |   4 | Format_desc       |       123 |         256 | Server ver: 10.2.38-MariaDB-log, Binlog ver: 4 |

| mysql-bin-changelog.000455 | 256 | Gtid_list         |       123 |         299 | [0-123-1]                                      |

| mysql-bin-changelog.000455 | 299 | Binlog_checkpoint |       123 |         352 | mysql-bin-changelog.000454                     |

| mysql-bin-changelog.000455 | 352 | Binlog_checkpoint |       123 |         405 | mysql-bin-changelog.000455                     |

| mysql-bin-changelog.000455 | 405 | Rotate            |       123 |         462 | mysql-bin-changelog.000456;pos=4               |

+----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+

5 rows in set (0.00 sec)

—
Another important observation is that the "seconds behind master" value keeps increasing.
You can refer the above output itself. Each time wrong value is returned, it is greater than the previous value.

Niranjan added a comment - 2021-06-23 12:46 - edited STEPS TO REPRODUCE THE ISSUE: == Set up replication between two MariaDB servers (running on 10.2.38) == I do not have any other query running on the DB. == On the "Primary" server, repeatedly flush the binary logs: I did something like this: flush binary logs;show binary logs;select now(),sleep(0.25);flush binary logs;show binary logs;select now(),sleep(0.25)... .... .... .... ... ... == On the "replica" server, repeatedly monitor the seconds behind master by running "show slave status\G" == I have grepped the output to show only the required details. Pasting a part of it below: Replica_Output – Relay_Master_Log_File: mysql-bin-changelog.000455 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 307 1 row in set (0.01 sec) – Relay_Master_Log_File: mysql-bin-changelog.000470 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1650 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000472 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1650 1 row in set (0.01 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000474 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1651 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000476 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1651 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000482 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1653 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000496 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1659 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000498 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1660 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000500 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1660 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000510 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1663 1 row in set (0.01 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000512 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1663 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000514 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1664 1 row in set (0.02 sec) Looking at the binary logs of the "Primary" instance corresponding to "Relay_Master_Log_File" and "Exec_Master_Log_Pos", I can just see that its the starting of the file (immediately after rotation of older file). MASTER> show binlog events in 'mysql-bin-changelog.000455' ; + ----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+ | Log_name | Pos | Event_type | Server_id | End_log_pos | Info | + ----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+ | mysql-bin-changelog.000455 | 4 | Format_desc | 123 | 256 | Server ver: 10.2.38-MariaDB-log, Binlog ver: 4 | | mysql-bin-changelog.000455 | 256 | Gtid_list | 123 | 299 | [0-123-1] | | mysql-bin-changelog.000455 | 299 | Binlog_checkpoint | 123 | 352 | mysql-bin-changelog.000454 | | mysql-bin-changelog.000455 | 352 | Binlog_checkpoint | 123 | 405 | mysql-bin-changelog.000455 | | mysql-bin-changelog.000455 | 405 | Rotate | 123 | 462 | mysql-bin-changelog.000456;pos=4 | + ----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+ 5 rows in set (0.00 sec) — Another important observation is that the "seconds behind master" value keeps increasing. You can refer the above output itself. Each time wrong value is returned, it is greater than the previous value.

Andrei Elkin added a comment - 2022-05-30 10:24

bnestere, could you please verify whether the description fits to the fixed ~~MDEV-16091~~?

Andrei Elkin added a comment - 2022-05-30 10:24 bnestere , could you please verify whether the description fits to the fixed MDEV-16091 ?

Angelique Sklavounos (Inactive) added a comment - 2022-09-30 13:56

It was possible to reproduce this with 10.2.38 by following the steps in the first comment and using a script to output SHOW SLAVE STATUS\G to a file in an infinite loop until manually killed. Grepping for "Slave_SQL_Running_State: Reading event from the relay log" would sometimes show Seconds_Behind_Master as a value on the order of hundreds or thousands of seconds, and always increasing, as described.

The issue no longer appears in 10.2.42, and testing showed that commit 96de6bf did not have the behavior, while the parent commit 452c9a4 still had the issue.

Angelique Sklavounos (Inactive) added a comment - 2022-09-30 13:56 It was possible to reproduce this with 10.2.38 by following the steps in the first comment and using a script to output SHOW SLAVE STATUS\G to a file in an infinite loop until manually killed. Grepping for "Slave_SQL_Running_State: Reading event from the relay log" would sometimes show Seconds_Behind_Master as a value on the order of hundreds or thousands of seconds, and always increasing, as described. The issue no longer appears in 10.2.42, and testing showed that commit 96de6bf did not have the behavior, while the parent commit 452c9a4 still had the issue.

Brandon Nesterenko added a comment - 2025-03-12 15:40

Re-opening as this was not initially reproducible, but later investigation shows the issue is still present.

Brandon Nesterenko added a comment - 2025-03-12 15:40 Re-opening as this was not initially reproducible, but later investigation shows the issue is still present.

Jimmy Hú added a comment - 2025-03-22 04:49

Approaches suggested during today’s Replication weekly:

Deduce the idleness based on the I/O and SQL threads’ positions (e.g., as in the description, GTID_IO_Pos == GTID_Slave_Pos, assuming GTIDs are on)
- Problem: Positions alone cannot tell what’s “internal” (~~MDEV-16091~~) and what’s not.
Internals skip flagging the non-idle state but do update the timestamp
- I don’t think it matters, for the next non-internal event will overwrite the timestamp when it flags non-idle anyway.

Jimmy Hú added a comment - 2025-03-22 04:49 Approaches suggested during today’s Replication weekly: Deduce the idleness based on the I/O and SQL threads’ positions (e.g., as in the description, GTID_IO_Pos == GTID_Slave_Pos , assuming GTIDs are on) Problem: Positions alone cannot tell what’s “internal” ( MDEV-16091 ) and what’s not. Internals skip flagging the non-idle state but do update the timestamp I don’t think it matters, for the next non-internal event will overwrite the timestamp when it flags non-idle anyway.

Brandon Nesterenko added a comment - 2025-03-24 17:45

ParadoxV5 thanks for summarizing!

The second point

Internals skip flagging the non-idle state but do update the timestamp

is a little off of the idea. It should actually be:

for internal events that come from the master: update both 1) the timestamp and 2) the non-idle state (in such a way that there wouldn't be a spike in SBM (Seconds_Behind_Master in-between these updates).
for internal events that come from the IO thread (flagged as is_relay_log_event() or is_artificial_event()): neither update the timestamp nor the idle state.

Though this adds even more complexity to the Seconds_Behind_Master calculation . As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness. I think the simplest (and probably most consistent solution) would probably be to just not update idleness for internal events either (regardless of source). This would also leave existing behavior alone as to what is actually reported by Seconds_Behind_Master (so users wouldn't start seeing the value updating when it previously hadn't). Note too, the comment

/*

          Ignore FD's timestamp as it does not reflect the slave execution

          state but likely to reflect a deep past.

...

which re-enforces the idea that the timestamp should be reflective of the slave execution state.

Brandon Nesterenko added a comment - 2025-03-24 17:45 ParadoxV5 thanks for summarizing! The second point Internals skip flagging the non-idle state but do update the timestamp is a little off of the idea. It should actually be: for internal events that come from the master: update both 1) the timestamp and 2) the non-idle state (in such a way that there wouldn't be a spike in SBM ( Seconds_Behind_Master in-between these updates). for internal events that come from the IO thread (flagged as is_relay_log_event() or is_artificial_event() ): neither update the timestamp nor the idle state. Though this adds even more complexity to the Seconds_Behind_Master calculation . As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness. I think the simplest (and probably most consistent solution) would probably be to just not update idleness for internal events either (regardless of source). This would also leave existing behavior alone as to what is actually reported by Seconds_Behind_Master (so users wouldn't start seeing the value updating when it previously hadn't). Note too, the comment /* Ignore FD's timestamp as it does not reflect the slave execution state but likely to reflect a deep past. ... which re-enforces the idea that the timestamp should be reflective of the slave execution state.

Jimmy Hú added a comment - 2025-04-04 00:46 - edited

As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness.

Of the 3 places that remove the idle state (sql_thread_caught_up= false), except for the one that’s part of error handling (probably unnecessary), the other two are both guarded under event_can_update_last_master_timestamp() together with the timestamp update.
In short: The current code neither updates the timestamp nor the idleness for internal events, as of ~~MDEV-16091~~. Which is

for internal events that come from the IO thread (flagged as is_relay_log_event() or is_artificial_event()): neither update the timestamp nor the idle state.

It’s uncomfortable to say, but bnestere, I doubt your analysis is it.

11.4 doesn’t look like having a regression either, and now I’m out of ideas.

Jimmy Hú added a comment - 2025-04-04 00:46 - edited As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness. Of the 3 places that remove the idle state ( sql_thread_caught_up= false ), except for the one that’s part of error handling (probably unnecessary), the other two are both guarded under event_can_update_last_master_timestamp() together with the timestamp update. In short: The current code neither updates the timestamp nor the idleness for internal events, as of MDEV-16091 . Which is for internal events that come from the IO thread ( flagged as is_relay_log_event() or is_artificial_event() ): neither update the timestamp nor the idle state. It’s uncomfortable to say, but bnestere , I doubt your analysis is it. 11.4 doesn’t look like having a regression either, and now I’m out of ideas.

Jimmy Hú added a comment - 2025-04-04 18:35 - edited

The issue description does not reproduce the bug (if there is one).
The Seconds_Behind_Master increase reported is not from a Format Description Event alone, but rather originates from its preceding event (in this case, an INSERT).
With this addition to the provided draft MTR test:

@@ -33,11 +33,23 @@ set debug_sync="now wait_for paused_on_fde";

 set debug_sync="now signal sql_thread_continue";

 --source include/sync_with_master_gtid.inc

+--echo # Pause the upcoming INSERT without needing debug_sync

+LOCK TABLE t1 WRITE NOWAIT;

+--echo # Add a non-format description event

+--connection master

+INSERT INTO t1 VALUES (2);

 --echo # Rotate master binary logs to force sql thread to read internal only events

 --connection master

 FLUSH LOGS;

 --connection slave

+--echo # Wait 3s so SBM has a chance to increase..

+--sleep 3

+--echo # Seconds_Behind_Master should not be zero as user events are not up-to-date

+--let $status_items= Seconds_Behind_Master

+--source include/show_slave_status.inc

+UNLOCK TABLES;

 --echo #

 --echo # Skip through format description events from relay log file & master

Seconds_Behind_Master will show (at least) 3 before the pause on the last Format Description Event and (at least) 5 after.
It’s intended that:

The (new) INSERT starts the increase – that’s part of Seconds_Behind_Master’s responsibilities.
The Format Description Event does not decrease Seconds_Behind_Master, as the state does not know whether there’re more events to come and should not preëmptively reset it.

Jimmy Hú added a comment - 2025-04-04 18:35 - edited The issue description does not reproduce the bug (if there is one). The Seconds_Behind_Master increase reported is not from a Format Description Event alone, but rather originates from its preceding event (in this case, an INSERT). With this addition to the provided draft MTR test: @@ -33,11 +33,23 @@ set debug_sync="now wait_for paused_on_fde"; set debug_sync="now signal sql_thread_continue"; --source include/sync_with_master_gtid.inc +--echo # Pause the upcoming INSERT without needing debug_sync +LOCK TABLE t1 WRITE NOWAIT; +--echo # Add a non-format description event +--connection master +INSERT INTO t1 VALUES (2); + --echo # Rotate master binary logs to force sql thread to read internal only events --connection master FLUSH LOGS; --connection slave +--echo # Wait 3s so SBM has a chance to increase.. +--sleep 3 +--echo # Seconds_Behind_Master should not be zero as user events are not up-to-date +--let $status_items= Seconds_Behind_Master +--source include/show_slave_status.inc +UNLOCK TABLES; --echo # --echo # Skip through format description events from relay log file & master Seconds_Behind_Master will show (at least) 3 before the pause on the last Format Description Event and (at least) 5 after. It’s intended that: The (new) INSERT starts the increase – that’s part of Seconds_Behind_Master ’s responsibilities. The Format Description Event does not decrease Seconds_Behind_Master , as the state does not know whether there’re more events to come and should not preëmptively reset it.

People

Assignee:: Jimmy Hú

Reporter:: Niranjan

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 2021-06-23 12:32

Updated:: 2025-04-15 23:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration