Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34211

Seconds_Behind_Master gives impossible values

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.6.18
    • N/A
    • Galera, Replication
    • None
    • Linux x86_64

    Description

      After upgrade from 10.6.17 to 10.6.18, our monitoring started to give alerts multiple times a day from all replicas becuase Seconds_Behind_Master > 3600.

      It looks to me that value is no more seconds.

      Tested by checking value every two seconds. value jumps from 0 to 40726, might increase a bit for checks, and then reset back to 0.

      It is completey impossible for this value to jump from 0 to 40726 seconds in two seconds.
      Clearly a regression.

      It looks like time increases at the same time on all replicas, replication is set to replicate from galera cluster.

      Attachments

        Issue Links

          Activity

            lars_42 Lars Wolff added a comment - - edited

            We just upgraded to 10.6.18 and see the exact same problem. Seconds_Behind_Master jumps between 0 and a random number from 40000 to 42000 instantly.

            Like OP this is a slave of a galera cluster.

            This bug is also present in 10.11 and 11.5.

            lars_42 Lars Wolff added a comment - - edited We just upgraded to 10.6.18 and see the exact same problem. Seconds_Behind_Master jumps between 0 and a random number from 40000 to 42000 instantly. Like OP this is a slave of a galera cluster. This bug is also present in 10.11 and 11.5.

            Hi sysprg, janlindstrom,

            This looks to be caused by MDEV-31413. Looking in the binary log, the timestamps of the events which replicate mysql.gtid_slave_pos look to be quite incorrect.

            bnestere Brandon Nesterenko added a comment - Hi sysprg , janlindstrom , This looks to be caused by MDEV-31413 . Looking in the binary log, the timestamps of the events which replicate mysql.gtid_slave_pos look to be quite incorrect.
            lars_42 Lars Wolff added a comment -

            Hi Brandon,

            I dont think this is caused by https://jira.mariadb.org/browse/MDEV-31413 as this should be fixed with 10.6.17. I my case the Galera cluster is running 10.6.18.

            lars_42 Lars Wolff added a comment - Hi Brandon, I dont think this is caused by https://jira.mariadb.org/browse/MDEV-31413 as this should be fixed with 10.6.17. I my case the Galera cluster is running 10.6.18.

            Hi lars_42,

            I actually mean that the introduction of the fix in MDEV-31413 may have caused this behavior. Though it is a little more complicated than just mariadb slave version (as the reporter upgrades their slave from 10.6.17, which would also be affected), but I think it would also involve the versions of the nodes in the cluster as well. And my Galera knowledge is rather limited, so I'll leave further investigation to the Galera team.

            To sysprg and janlindstrom, I was using galera.galera_restart_replica (added with MDEV-31413) and looking through the binlogs on those nodes, I see inconsistencies in the timestamp of the events between servers, and nodes 1 and 2 seem to show some very strange values in its event timestamps.

            bnestere Brandon Nesterenko added a comment - Hi lars_42 , I actually mean that the introduction of the fix in MDEV-31413 may have caused this behavior. Though it is a little more complicated than just mariadb slave version (as the reporter upgrades their slave from 10.6.17, which would also be affected), but I think it would also involve the versions of the nodes in the cluster as well. And my Galera knowledge is rather limited, so I'll leave further investigation to the Galera team. To sysprg and janlindstrom , I was using galera.galera_restart_replica (added with MDEV-31413 ) and looking through the binlogs on those nodes, I see inconsistencies in the timestamp of the events between servers, and nodes 1 and 2 seem to show some very strange values in its event timestamps.

            I'm seeing the same issue.

            3 node galera cluster (mariadb 10.6.18) with async delayed replica (mariadb 10.6.19).

            The slave now shows:
            Seconds_Behind_Master: 0 -> ~40000
            SQL_Delay: 21600

            Checking the data, it has the latest and is not delayed at all. The Seconds_Behind_Master gives random numbers when called multiple times.

            Sometimes it helps to kill some pending connections to the database, then it starts to show numbers like 0, 1, ... 21600, but then it fails after some time again.

            Simulated on 2 separate clusters, but the interesting thing is that a third cluster with the same setup doesn't have this issue

            hydrapolic Tomáš Mózes added a comment - I'm seeing the same issue. 3 node galera cluster (mariadb 10.6.18) with async delayed replica (mariadb 10.6.19). The slave now shows: Seconds_Behind_Master: 0 -> ~40000 SQL_Delay: 21600 Checking the data, it has the latest and is not delayed at all. The Seconds_Behind_Master gives random numbers when called multiple times. Sometimes it helps to kill some pending connections to the database, then it starts to show numbers like 0, 1, ... 21600, but then it fails after some time again. Simulated on 2 separate clusters, but the interesting thing is that a third cluster with the same setup doesn't have this issue

            Still happens with MariaDB 10.6.20. Except for the MariaDB galera cluster there is pt-hearbeat (https://docs.percona.com/percona-toolkit/pt-heartbeat.html) running. It was discovered that restarting pt-heartbeat each hour (instead of running it constantly) makes the "0 Seconds behind master" go away. However I don't know why it actually happens. It just keeps updating percona.heartbeat table with current timestamp that is replicated and checked on replica server. This hasn't happened before.

            hydrapolic Tomáš Mózes added a comment - Still happens with MariaDB 10.6.20. Except for the MariaDB galera cluster there is pt-hearbeat ( https://docs.percona.com/percona-toolkit/pt-heartbeat.html ) running. It was discovered that restarting pt-heartbeat each hour (instead of running it constantly) makes the "0 Seconds behind master" go away. However I don't know why it actually happens. It just keeps updating percona.heartbeat table with current timestamp that is replicated and checked on replica server. This hasn't happened before.

            This will be fixed as part of MDEV-34924

            janlindstrom Jan Lindström added a comment - This will be fixed as part of MDEV-34924

            People

              janlindstrom Jan Lindström
              bleve Tuomo Soini
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.