[MDEV-34211] Seconds_Behind_Master gives impossible values - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Affects Version/s: 10.6.18
Fix Version/s: N/A
Component/s: Galera, Replication
Labels:
None
Environment:
Linux x86_64

Description

After upgrade from 10.6.17 to 10.6.18, our monitoring started to give alerts multiple times a day from all replicas becuase Seconds_Behind_Master > 3600.

It looks to me that value is no more seconds.

Tested by checking value every two seconds. value jumps from 0 to 40726, might increase a bit for checks, and then reset back to 0.

It is completey impossible for this value to jump from 0 to 40726 seconds in two seconds.
Clearly a regression.

It looks like time increases at the same time on all replicas, replication is set to replicate from galera cluster.

Attachments

Issue Links

is caused by

MDEV-31413 Node has been dropped from the cluster on Startup / Shutdown with async replica

Closed

is part of

MDEV-34924 gtid_slave_pos table rows newer been deleted on non replica nodes (wsrep_gtid_mode = 1)

Closed

relates to

MDEV-33977 MariaDB hangs on --wsrep-recover phase

Closed

MDEV-35157 wrong binlog timestamps on secondary nodes of a galera cluster

Closed

Activity

Ascending order - Click to sort in descending order

Lars Wolff added a comment - 2024-08-28 14:11 - edited

We just upgraded to 10.6.18 and see the exact same problem. Seconds_Behind_Master jumps between 0 and a random number from 40000 to 42000 instantly.

Like OP this is a slave of a galera cluster.

This bug is also present in 10.11 and 11.5.

Lars Wolff added a comment - 2024-08-28 14:11 - edited We just upgraded to 10.6.18 and see the exact same problem. Seconds_Behind_Master jumps between 0 and a random number from 40000 to 42000 instantly. Like OP this is a slave of a galera cluster. This bug is also present in 10.11 and 11.5.

Brandon Nesterenko added a comment - 2024-08-28 18:48

Hi sysprg, janlindstrom,

This looks to be caused by ~~MDEV-31413~~. Looking in the binary log, the timestamps of the events which replicate mysql.gtid_slave_pos look to be quite incorrect.

Brandon Nesterenko added a comment - 2024-08-28 18:48 Hi sysprg , janlindstrom , This looks to be caused by MDEV-31413 . Looking in the binary log, the timestamps of the events which replicate mysql.gtid_slave_pos look to be quite incorrect.

Lars Wolff added a comment - 2024-08-29 06:11

Hi Brandon,

I dont think this is caused by https://jira.mariadb.org/browse/MDEV-31413 as this should be fixed with 10.6.17. I my case the Galera cluster is running 10.6.18.

Lars Wolff added a comment - 2024-08-29 06:11 Hi Brandon, I dont think this is caused by https://jira.mariadb.org/browse/MDEV-31413 as this should be fixed with 10.6.17. I my case the Galera cluster is running 10.6.18.

Brandon Nesterenko added a comment - 2024-08-29 13:23

Hi lars_42,

I actually mean that the introduction of the fix in ~~MDEV-31413~~ may have caused this behavior. Though it is a little more complicated than just mariadb slave version (as the reporter upgrades their slave from 10.6.17, which would also be affected), but I think it would also involve the versions of the nodes in the cluster as well. And my Galera knowledge is rather limited, so I'll leave further investigation to the Galera team.

To sysprg and janlindstrom, I was using galera.galera_restart_replica (added with ~~MDEV-31413~~) and looking through the binlogs on those nodes, I see inconsistencies in the timestamp of the events between servers, and nodes 1 and 2 seem to show some very strange values in its event timestamps.

Brandon Nesterenko added a comment - 2024-08-29 13:23 Hi lars_42 , I actually mean that the introduction of the fix in MDEV-31413 may have caused this behavior. Though it is a little more complicated than just mariadb slave version (as the reporter upgrades their slave from 10.6.17, which would also be affected), but I think it would also involve the versions of the nodes in the cluster as well. And my Galera knowledge is rather limited, so I'll leave further investigation to the Galera team. To sysprg and janlindstrom , I was using galera.galera_restart_replica (added with MDEV-31413 ) and looking through the binlogs on those nodes, I see inconsistencies in the timestamp of the events between servers, and nodes 1 and 2 seem to show some very strange values in its event timestamps.

Tomáš Mózes added a comment - 2024-10-18 14:13

I'm seeing the same issue.

3 node galera cluster (mariadb 10.6.18) with async delayed replica (mariadb 10.6.19).

The slave now shows:
Seconds_Behind_Master: 0 -> ~40000
SQL_Delay: 21600

Checking the data, it has the latest and is not delayed at all. The Seconds_Behind_Master gives random numbers when called multiple times.

Sometimes it helps to kill some pending connections to the database, then it starts to show numbers like 0, 1, ... 21600, but then it fails after some time again.

Simulated on 2 separate clusters, but the interesting thing is that a third cluster with the same setup doesn't have this issue

Tomáš Mózes added a comment - 2024-10-18 14:13 I'm seeing the same issue. 3 node galera cluster (mariadb 10.6.18) with async delayed replica (mariadb 10.6.19). The slave now shows: Seconds_Behind_Master: 0 -> ~40000 SQL_Delay: 21600 Checking the data, it has the latest and is not delayed at all. The Seconds_Behind_Master gives random numbers when called multiple times. Sometimes it helps to kill some pending connections to the database, then it starts to show numbers like 0, 1, ... 21600, but then it fails after some time again. Simulated on 2 separate clusters, but the interesting thing is that a third cluster with the same setup doesn't have this issue

Tomáš Mózes added a comment - 2024-11-20 09:44

Still happens with MariaDB 10.6.20. Except for the MariaDB galera cluster there is pt-hearbeat (https://docs.percona.com/percona-toolkit/pt-heartbeat.html) running. It was discovered that restarting pt-heartbeat each hour (instead of running it constantly) makes the "0 Seconds behind master" go away. However I don't know why it actually happens. It just keeps updating percona.heartbeat table with current timestamp that is replicated and checked on replica server. This hasn't happened before.

Tomáš Mózes added a comment - 2024-11-20 09:44 Still happens with MariaDB 10.6.20. Except for the MariaDB galera cluster there is pt-hearbeat ( https://docs.percona.com/percona-toolkit/pt-heartbeat.html ) running. It was discovered that restarting pt-heartbeat each hour (instead of running it constantly) makes the "0 Seconds behind master" go away. However I don't know why it actually happens. It just keeps updating percona.heartbeat table with current timestamp that is replicated and checked on replica server. This hasn't happened before.

Jan Lindström added a comment - 2024-12-18 11:07

This will be fixed as part of ~~MDEV-34924~~

Jan Lindström added a comment - 2024-12-18 11:07 This will be fixed as part of MDEV-34924

People

Assignee:: Jan Lindström

Reporter:: Tuomo Soini

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 2024-05-22 08:55

Updated:: 2025-02-07 13:49

Resolved:: 2024-12-18 11:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server