Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16091

Seconds_Behind_Master spikes to millions of seconds

Details

    Description

      Hey everyone

      Rarely, we have a weird issue happening. We have a master <-> master setup. The writes are going to only one of them. Every now and then, the replication lag on the master where we send writes jumps to millions of seconds (ex: 2879736). Then, after 1s, is back to 0. After all, there's no lag, as the writes and the binlog from the other master should be ignored, as this is the server where we write to.
      Both servers have the same timezone and ntp running.
      Replication is set with GTIDs, and parallel threads.

      Sample config:
      slave_net_timeout=2
      gtid-strict-mode=1
      gtid-domain-id=1
      server_id = 101
      sync_binlog = 1
      binlog_format = ROW
      log-slave-updates
      replicate-same-server-id = 0
      log_bin=/var/log/mysql/binlog/mysql-bin.log
      log_bin_index=/var/log/mysql/binlog/mysql-bin.log.index
      relay_log=/var/log/mysql/binlog/mysql-relay-bin
      relay_log_index=/var/log/mysql/binlog/mysql-relay-bin.index
      expire_logs_days=5
      max_binlog_size=1G
      slave_parallel_threads=10

      We have a bunch of monitoring scripts looking at that value and triggering actions and alerts and we want to know what can cause this and if there's anything we can do to avoid this in the future.

      Thank you

      Attachments

        Issue Links

          Activity

            This patch will have a merge conflict in 10.4 with sql/slave.cc. Here is a commit with its resolution: ddd79e2 (taken from 10.4-MDEV-16091-merge)

            bnestere Brandon Nesterenko added a comment - This patch will have a merge conflict in 10.4 with sql/slave.cc. Here is a commit with its resolution: ddd79e2 (taken from 10.4-MDEV-16091-merge )
            james.wang james wang added a comment -

            Hi All,

            I have the same issue for 10.2.27 (not feasible to upgrade to 10.4)

            How to walk around the issue please?

            How to "The fix in 10.2.42 is to mark and remember the event as relaylog event instead use the last_master_timestamp from master again" please?

            Thanks a lot in advance

            james.wang james wang added a comment - Hi All, I have the same issue for 10.2.27 (not feasible to upgrade to 10.4) How to walk around the issue please? How to "The fix in 10.2.42 is to mark and remember the event as relaylog event instead use the last_master_timestamp from master again" please? Thanks a lot in advance
            james.wang james wang added a comment -

            @bnestere @Brandon Nesterenko we have many master-master replication configurations based on 10.2.27 and only one has such issue.

            Is there any way to walk around the issue please? thanks

            james.wang james wang added a comment - @bnestere @Brandon Nesterenko we have many master-master replication configurations based on 10.2.27 and only one has such issue. Is there any way to walk around the issue please? thanks

            The fix is visible in the related commit:

            https://github.com/MariaDB/server/commit/96de6bfd5e0d08bd99a47e8a3d60b1c2900a38f8

            "Problem:
            ========
            A slave’s relay log format description event is used when
            calculating Seconds_Behind_Master (SBM). This forces the SBM
            value to spike when processing these events, as their creation
            date is set to the timestamp that the IO thread begins.

            Solution:
            ========
            When the slave generates a format description event, mark the
            event as a relay log event so it does not update the
            rli->last_master_timestamp variable."

            That one line added in sql/log.cc can probably be included into custom 10.2.27 code you probably use to build your specific version.

            valerii Valerii Kravchuk added a comment - The fix is visible in the related commit: https://github.com/MariaDB/server/commit/96de6bfd5e0d08bd99a47e8a3d60b1c2900a38f8 "Problem: ======== A slave’s relay log format description event is used when calculating Seconds_Behind_Master (SBM). This forces the SBM value to spike when processing these events, as their creation date is set to the timestamp that the IO thread begins. Solution: ======== When the slave generates a format description event, mark the event as a relay log event so it does not update the rli->last_master_timestamp variable." That one line added in sql/log.cc can probably be included into custom 10.2.27 code you probably use to build your specific version.
            rjasdfiii Rick James added a comment -

            I first saw this bug back in MySQL 4.0. It was sporadic and soon went away without any manual action.

            See also https://bugs.mysql.com/bug.php?id=72376

            rjasdfiii Rick James added a comment - I first saw this bug back in MySQL 4.0. It was sporadic and soon went away without any manual action. See also https://bugs.mysql.com/bug.php?id=72376

            People

              bnestere Brandon Nesterenko
              genesisdigital raul (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.