Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25999

Unrealistic "Seconds behind master" value

Details

    Description

      Seconds_Behind_Master can quickly and sporadically spike from 0 to really high values, and then go back to 0 (independent of running on a parallel or serial replica). This is due to the Seconds_Behind_Master computation having a special idleness check which overrides the value to 0. This idleness condition (and thereby when SBM is calculated/written) is based on the SQL thread queueing status:

            idle= mi->rli.sql_thread_caught_up;
      

      Which isn't tightly coupled with user events. That is, the timestamp of internal events (e.g. Format_description_log_event, Gtid_list_log_event, and Binlog_checkpoint_event) aren't saved on the replica, and thereby the Seconds_Behind_Master value won't reflect a more recent replica when these events are processed. However, when these events are queued by the SQL thread, the idleness condition no longer holds, and then Seconds_Behind_Master will be calculated and displayed based on the last timestamp from a user transaction, which can be very large (particularly when a replica is configured with SQL_Delay. This is particularly prevalent during log rotation, as all events generated during this time are internal, and should not be considered in Seconds_Behind_Master.

      The fix should update the idleness condition to not be based on the queueing state of the SQL thread, but rather that there are no new user events from the master to process (e.g. GTID_IO_Pos == GTID_Slave_Pos).

      Test case to reproduce (which requires a small patch for debug_sync integration):

      #
      # TODO: Write me
      #
      # References:
      #   MDEV-:
      #
       
      --source include/have_innodb.inc
      --source include/have_log_bin.inc
      # Format independent
      --source include/have_binlog_format_row.inc
      --source include/master-slave.inc
       
      --connection slave
      --source include/stop_slave.inc
      set @@global.debug_dbug="+d,pause_sql_thread_on_next_fde";
      --source include/start_slave.inc
       
      --echo #
      --echo # Initialize test data
      --connection master
       
      create table t1 (a int primary key) engine=innodb;
      insert into t1 values (1);
      --source include/save_master_gtid.inc
       
      --echo # Catch slave up with master
      --connection slave
      --echo # Debug_sync is set on format description events, so skip 2 (1 for master binlog, 1 for slave relay log)
      set debug_sync="now wait_for paused_on_fde";
      set debug_sync="now signal sql_thread_continue";
      set debug_sync="now wait_for paused_on_fde";
      set debug_sync="now signal sql_thread_continue";
      --source include/sync_with_master_gtid.inc
       
      --echo # Rotate master binary logs to force sql thread to read internal only events
      --connection master
      FLUSH LOGS;
       
       
      --connection slave
       
      --echo #
      --echo # Skip through format description events from relay log file & master
      --echo # binlog until we get to the last one
      --echo #
      set debug_sync="now wait_for paused_on_fde";
      set debug_sync="now signal sql_thread_continue";
      set debug_sync="now wait_for paused_on_fde";
      set debug_sync="now signal sql_thread_continue";
       
      --echo # Now pause on the last format description event
      set debug_sync="now wait_for paused_on_fde";
       
       
      --echo # Wait 2s so SBM has a chance to increase..
      --sleep 2
      --echo # Seconds_Behind_Master should be zero as user events are up-to-date (but is not)
      --let $status_items= Seconds_Behind_Master
      --source include/show_slave_status.inc
       
      --echo # Resume SQL thread
      set debug_sync="now signal sql_thread_continue";
      --echo # TODO clean up debug_dbug and debug_sync status
       
      --echo #
      --echo # Cleanup
      --connection master
      drop table t1;
      --source include/save_master_gtid.inc
       
      --connection slave
      --source include/sync_with_master_gtid.inc
       
      --source include/rpl_end.inc
      --echo # End of .test
      --echo # TODO ^ Fill out with test name
      

      patch for debug_sync (on 10.6 community):

      diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
      index 8b740195b24..9fa14a6b873 100644
      --- a/sql/log_event_server.cc
      +++ b/sql/log_event_server.cc
      @@ -2426,6 +2426,12 @@ int Format_description_log_event::do_apply_event(rpl_group_info *rgi)
         Relay_log_info *rli= rgi->rli;
         DBUG_ENTER("Format_description_log_event::do_apply_event");
       
      +  DBUG_EXECUTE_IF("pause_sql_thread_on_next_fde", {
      +    DBUG_ASSERT(!debug_sync_set_action(
      +        thd, STRING_WITH_LEN(
      +                 "now SIGNAL paused_on_fde WAIT_FOR sql_thread_continue")));
      +  });
      +
         /*
           As a transaction NEVER spans on 2 or more binlogs:
           if we have an active transaction at this point, the master died
      
      

      Attachments

        Issue Links

          Activity

            Niranjan Niranjan added a comment - - edited

            STEPS TO REPRODUCE THE ISSUE:

            == Set up replication between two MariaDB servers (running on 10.2.38)
            == I do not have any other query running on the DB.
            == On the "Primary" server, repeatedly flush the binary logs:
            I did something like this:
            flush binary logs;show binary logs;select now(),sleep(0.25);flush binary logs;show binary logs;select now(),sleep(0.25)... .... .... .... ... ...

            == On the "replica" server, repeatedly monitor the seconds behind master by running "show slave status\G"

            == I have grepped the output to show only the required details. Pasting a part of it below:

            Replica_Output


            Relay_Master_Log_File: mysql-bin-changelog.000455
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 307
            1 row in set (0.01 sec)


            Relay_Master_Log_File: mysql-bin-changelog.000470
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1650
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000472
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1650
            1 row in set (0.01 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000474
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1651
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000476
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1651
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000482
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1653
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000496
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1659
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000498
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1660
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000500
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1660
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000510
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1663
            1 row in set (0.01 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000512
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1663
            1 row in set (0.00 sec)


            --
            Relay_Master_Log_File: mysql-bin-changelog.000514
            Exec_Master_Log_Pos: 4
            Seconds_Behind_Master: 1664
            1 row in set (0.02 sec)

            Looking at the binary logs of the "Primary" instance corresponding to "Relay_Master_Log_File" and "Exec_Master_Log_Pos", I can just see that its the starting of the file (immediately after rotation of older file).

            MASTER> show binlog events in 'mysql-bin-changelog.000455';
            +----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
            | Log_name                   | Pos | Event_type        | Server_id | End_log_pos | Info                                           |
            +----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
            | mysql-bin-changelog.000455 |   4 | Format_desc       |       123 |         256 | Server ver: 10.2.38-MariaDB-log, Binlog ver: 4 |
            | mysql-bin-changelog.000455 | 256 | Gtid_list         |       123 |         299 | [0-123-1]                                      |
            | mysql-bin-changelog.000455 | 299 | Binlog_checkpoint |       123 |         352 | mysql-bin-changelog.000454                     |
            | mysql-bin-changelog.000455 | 352 | Binlog_checkpoint |       123 |         405 | mysql-bin-changelog.000455                     |
            | mysql-bin-changelog.000455 | 405 | Rotate            |       123 |         462 | mysql-bin-changelog.000456;pos=4               |
            +----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
            5 rows in set (0.00 sec)
            


            Another important observation is that the "seconds behind master" value keeps increasing.
            You can refer the above output itself. Each time wrong value is returned, it is greater than the previous value.

            Niranjan Niranjan added a comment - - edited STEPS TO REPRODUCE THE ISSUE: == Set up replication between two MariaDB servers (running on 10.2.38) == I do not have any other query running on the DB. == On the "Primary" server, repeatedly flush the binary logs: I did something like this: flush binary logs;show binary logs;select now(),sleep(0.25);flush binary logs;show binary logs;select now(),sleep(0.25)... .... .... .... ... ... == On the "replica" server, repeatedly monitor the seconds behind master by running "show slave status\G" == I have grepped the output to show only the required details. Pasting a part of it below: Replica_Output – Relay_Master_Log_File: mysql-bin-changelog.000455 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 307 1 row in set (0.01 sec) – Relay_Master_Log_File: mysql-bin-changelog.000470 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1650 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000472 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1650 1 row in set (0.01 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000474 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1651 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000476 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1651 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000482 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1653 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000496 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1659 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000498 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1660 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000500 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1660 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000510 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1663 1 row in set (0.01 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000512 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1663 1 row in set (0.00 sec) – -- Relay_Master_Log_File: mysql-bin-changelog.000514 Exec_Master_Log_Pos: 4 Seconds_Behind_Master: 1664 1 row in set (0.02 sec) Looking at the binary logs of the "Primary" instance corresponding to "Relay_Master_Log_File" and "Exec_Master_Log_Pos", I can just see that its the starting of the file (immediately after rotation of older file). MASTER> show binlog events in 'mysql-bin-changelog.000455' ; + ----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+ | Log_name | Pos | Event_type | Server_id | End_log_pos | Info | + ----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+ | mysql-bin-changelog.000455 | 4 | Format_desc | 123 | 256 | Server ver: 10.2.38-MariaDB-log, Binlog ver: 4 | | mysql-bin-changelog.000455 | 256 | Gtid_list | 123 | 299 | [0-123-1] | | mysql-bin-changelog.000455 | 299 | Binlog_checkpoint | 123 | 352 | mysql-bin-changelog.000454 | | mysql-bin-changelog.000455 | 352 | Binlog_checkpoint | 123 | 405 | mysql-bin-changelog.000455 | | mysql-bin-changelog.000455 | 405 | Rotate | 123 | 462 | mysql-bin-changelog.000456;pos=4 | + ----------------------------+-----+-------------------+-----------+-------------+------------------------------------------------+ 5 rows in set (0.00 sec) — Another important observation is that the "seconds behind master" value keeps increasing. You can refer the above output itself. Each time wrong value is returned, it is greater than the previous value.
            Elkin Andrei Elkin added a comment -

            bnestere, could you please verify whether the description fits to the fixed MDEV-16091?

            Elkin Andrei Elkin added a comment - bnestere , could you please verify whether the description fits to the fixed MDEV-16091 ?

            It was possible to reproduce this with 10.2.38 by following the steps in the first comment and using a script to output SHOW SLAVE STATUS\G to a file in an infinite loop until manually killed. Grepping for "Slave_SQL_Running_State: Reading event from the relay log" would sometimes show Seconds_Behind_Master as a value on the order of hundreds or thousands of seconds, and always increasing, as described.

            The issue no longer appears in 10.2.42, and testing showed that commit 96de6bf did not have the behavior, while the parent commit 452c9a4 still had the issue.

            angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - It was possible to reproduce this with 10.2.38 by following the steps in the first comment and using a script to output SHOW SLAVE STATUS\G to a file in an infinite loop until manually killed. Grepping for "Slave_SQL_Running_State: Reading event from the relay log" would sometimes show Seconds_Behind_Master as a value on the order of hundreds or thousands of seconds, and always increasing, as described. The issue no longer appears in 10.2.42, and testing showed that commit 96de6bf did not have the behavior, while the parent commit 452c9a4 still had the issue.

            Re-opening as this was not initially reproducible, but later investigation shows the issue is still present.

            bnestere Brandon Nesterenko added a comment - Re-opening as this was not initially reproducible, but later investigation shows the issue is still present.
            ParadoxV5 Jimmy Hú added a comment -

            Approaches suggested during today’s Replication weekly:

            • Deduce the idleness based on the I/O and SQL threads’ positions (e.g., as in the description, GTID_IO_Pos == GTID_Slave_Pos, assuming GTIDs are on)
              • Problem: Positions alone cannot tell what’s “internal” (MDEV-16091) and what’s not.
            • Internals skip flagging the non-idle state but do update the timestamp
              • I don’t think it matters, for the next non-internal event will overwrite the timestamp when it flags non-idle anyway.
            ParadoxV5 Jimmy Hú added a comment - Approaches suggested during today’s Replication weekly: Deduce the idleness based on the I/O and SQL threads’ positions (e.g., as in the description, GTID_IO_Pos == GTID_Slave_Pos , assuming GTIDs are on) Problem: Positions alone cannot tell what’s “internal” ( MDEV-16091 ) and what’s not. Internals skip flagging the non-idle state but do update the timestamp I don’t think it matters, for the next non-internal event will overwrite the timestamp when it flags non-idle anyway.

            ParadoxV5 thanks for summarizing!

            The second point

            Internals skip flagging the non-idle state but do update the timestamp

            is a little off of the idea. It should actually be:

            • for internal events that come from the master: update both 1) the timestamp and 2) the non-idle state (in such a way that there wouldn't be a spike in SBM (Seconds_Behind_Master in-between these updates).
            • for internal events that come from the IO thread (flagged as is_relay_log_event() or is_artificial_event()): neither update the timestamp nor the idle state.

            Though this adds even more complexity to the Seconds_Behind_Master calculation . As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness. I think the simplest (and probably most consistent solution) would probably be to just not update idleness for internal events either (regardless of source). This would also leave existing behavior alone as to what is actually reported by Seconds_Behind_Master (so users wouldn't start seeing the value updating when it previously hadn't). Note too, the comment

                    /*
                      Ignore FD's timestamp as it does not reflect the slave execution
                      state but likely to reflect a deep past.
            ...
            

            which re-enforces the idea that the timestamp should be reflective of the slave execution state.

            bnestere Brandon Nesterenko added a comment - ParadoxV5 thanks for summarizing! The second point Internals skip flagging the non-idle state but do update the timestamp is a little off of the idea. It should actually be: for internal events that come from the master: update both 1) the timestamp and 2) the non-idle state (in such a way that there wouldn't be a spike in SBM ( Seconds_Behind_Master in-between these updates). for internal events that come from the IO thread (flagged as is_relay_log_event() or is_artificial_event() ): neither update the timestamp nor the idle state. Though this adds even more complexity to the Seconds_Behind_Master calculation . As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness. I think the simplest (and probably most consistent solution) would probably be to just not update idleness for internal events either (regardless of source). This would also leave existing behavior alone as to what is actually reported by Seconds_Behind_Master (so users wouldn't start seeing the value updating when it previously hadn't). Note too, the comment /* Ignore FD's timestamp as it does not reflect the slave execution state but likely to reflect a deep past. ... which re-enforces the idea that the timestamp should be reflective of the slave execution state.
            ParadoxV5 Jimmy Hú added a comment - - edited

            As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness.

            Of the 3 places that remove the idle state (sql_thread_caught_up= false), except for the one that’s part of error handling (probably unnecessary), the other two are both guarded under event_can_update_last_master_timestamp() together with the timestamp update.
            In short: The current code neither updates the timestamp nor the idleness for internal events, as of MDEV-16091. Which is

            • for internal events that come from the IO thread (flagged as is_relay_log_event() or is_artificial_event()): neither update the timestamp nor the idle state.

            It’s uncomfortable to say, but bnestere, I doubt your analysis is it.

            11.4 doesn’t look like having a regression either, and now I’m out of ideas.

            ParadoxV5 Jimmy Hú added a comment - - edited As the code currently sits, we don't update the timestamp for internal events ever, but we do update the idleness. Of the 3 places that remove the idle state ( sql_thread_caught_up= false ), except for the one that’s part of error handling (probably unnecessary), the other two are both guarded under event_can_update_last_master_timestamp() together with the timestamp update. In short: The current code neither updates the timestamp nor the idleness for internal events, as of MDEV-16091 . Which is for internal events that come from the IO thread ( flagged as is_relay_log_event() or is_artificial_event() ): neither update the timestamp nor the idle state. It’s uncomfortable to say, but bnestere , I doubt your analysis is it. 11.4 doesn’t look like having a regression either, and now I’m out of ideas.
            ParadoxV5 Jimmy Hú added a comment - - edited

            The issue description does not reproduce the bug (if there is one).
            The Seconds_Behind_Master increase reported is not from a Format Description Event alone, but rather originates from its preceding event (in this case, an INSERT).
            With this addition to the provided draft MTR test:

            @@ -33,11 +33,23 @@ set debug_sync="now wait_for paused_on_fde";
             set debug_sync="now signal sql_thread_continue";
             --source include/sync_with_master_gtid.inc
             
            +--echo # Pause the upcoming INSERT without needing debug_sync
            +LOCK TABLE t1 WRITE NOWAIT;
            +--echo # Add a non-format description event
            +--connection master
            +INSERT INTO t1 VALUES (2);
            +
             --echo # Rotate master binary logs to force sql thread to read internal only events
             --connection master
             FLUSH LOGS;
             
             --connection slave
            +--echo # Wait 3s so SBM has a chance to increase..
            +--sleep 3
            +--echo # Seconds_Behind_Master should not be zero as user events are not up-to-date
            +--let $status_items= Seconds_Behind_Master
            +--source include/show_slave_status.inc
            +UNLOCK TABLES;
             
             --echo #
             --echo # Skip through format description events from relay log file & master
            

            Seconds_Behind_Master will show (at least) 3 before the pause on the last Format Description Event and (at least) 5 after.
            It’s intended that:

            • The (new) INSERT starts the increase – that’s part of Seconds_Behind_Master’s responsibilities.
            • The Format Description Event does not decrease Seconds_Behind_Master, as the state does not know whether there’re more events to come and should not preëmptively reset it.
            ParadoxV5 Jimmy Hú added a comment - - edited The issue description does not reproduce the bug (if there is one). The Seconds_Behind_Master increase reported is not from a Format Description Event alone, but rather originates from its preceding event (in this case, an INSERT). With this addition to the provided draft MTR test: @@ -33,11 +33,23 @@ set debug_sync="now wait_for paused_on_fde"; set debug_sync="now signal sql_thread_continue"; --source include/sync_with_master_gtid.inc   +--echo # Pause the upcoming INSERT without needing debug_sync +LOCK TABLE t1 WRITE NOWAIT; +--echo # Add a non-format description event +--connection master +INSERT INTO t1 VALUES (2); + --echo # Rotate master binary logs to force sql thread to read internal only events --connection master FLUSH LOGS;   --connection slave +--echo # Wait 3s so SBM has a chance to increase.. +--sleep 3 +--echo # Seconds_Behind_Master should not be zero as user events are not up-to-date +--let $status_items= Seconds_Behind_Master +--source include/show_slave_status.inc +UNLOCK TABLES;   --echo # --echo # Skip through format description events from relay log file & master Seconds_Behind_Master will show (at least) 3 before the pause on the last Format Description Event and (at least) 5 after. It’s intended that: The (new) INSERT starts the increase – that’s part of Seconds_Behind_Master ’s responsibilities. The Format Description Event does not decrease Seconds_Behind_Master , as the state does not know whether there’re more events to come and should not preëmptively reset it.

            People

              ParadoxV5 Jimmy Hú
              Niranjan Niranjan
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.