[MDEV-26346] Seconds_Behind_Master is 0 when replicating very slow update queries Created: 2021-08-12  Updated: 2021-08-13

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3.17
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shawn Heisey Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 8, installed from packages. Dell server hardware, 64 bit.



 Description   

We have a replication that had some problems and got way behind.

Seconds_Behind_Master seems to be correct as long as replication is executing updates that complete quickly. But for this system, many updates have been made that reference fields with no index, and since the ibd file for this table is 8.4TB in size, these updates take FOREVER. It seems that while replication is doing one of these very slow updates, Seconds_Behind_Master shows 0 seconds, when at this moment, replication is over 10 days behind. There are a LOT of these updates mixed in with updates that happen a lot faster.

We have zabbix alerting set up for replication lag, but every time the alert happened, the alert quickly got resolved, because Seconds_Behind_Master would show 0 shortly after it showed the real number. So we didn't realize there was a problem right away – all the alerts were resolved automatically.

Please let me know what information you need to troubleshoot further. If you need one of the binlog files, I need to send that privately, as those logs contain confidential data protected by HIPAA.

We know that we are very far behind on the 10.3.x release. The master server is a production system that is central to everything we do for this customer. Updating it will require a significant effort including a change request that the customer must approve. We're going to get that process started, but it could take a while.



 Comments   
Comment by Shawn Heisey [ 2021-08-13 ]

I wrote a perl script to gather info about this problem and print it out. Here's the output. I will include the script momentarily.

[root@dr-prod-webservices ~]# watchrep
Fri Aug 13 11:52:13 2021: Elapsed time for behind=0: 25 (last: 909824)
Fri Aug 13 11:52:27 2021: Elapsed time for behind=0: 14 (last: 909838)
Fri Aug 13 11:53:01 2021: Elapsed time for behind=0: 34 (last: 909872)
Fri Aug 13 11:53:19 2021: Elapsed time for behind=0: 18 (last: 909890)
Fri Aug 13 11:53:32 2021: Elapsed time for behind=0: 13 (last: 909903)
Fri Aug 13 11:53:44 2021: Elapsed time for behind=0: 12 (last: 909915)
Fri Aug 13 11:54:01 2021: Elapsed time for behind=0: 17 (last: 909932)
Fri Aug 13 11:54:12 2021: Elapsed time for behind=0: 11 (last: 909943)
Fri Aug 13 11:54:21 2021: Elapsed time for behind=0: 9 (last: 909952)
Fri Aug 13 11:54:27 2021: Elapsed time for behind=0: 6 (last: 909958)
Fri Aug 13 11:54:38 2021: Elapsed time for long query: 170
Fri Aug 13 11:54:38 2021: Elapsed time for behind=0: 11 (last: 909945)
Fri Aug 13 11:55:14 2021: Elapsed time for behind=0: 36 (last: 909980)
Fri Aug 13 11:55:31 2021: Elapsed time for behind=0: 17 (last: 909997)
Fri Aug 13 11:56:25 2021: Elapsed time for behind=0: 54 (last: 910051)
Fri Aug 13 11:56:43 2021: Elapsed time for behind=0: 18 (last: 910069)
Fri Aug 13 11:57:26 2021: Elapsed time for behind=0: 43 (last: 910112)
Fri Aug 13 11:57:41 2021: Elapsed time for behind=0: 15 (last: 910127)
Fri Aug 13 11:58:04 2021: Elapsed time for behind=0: 23 (last: 910150)
Fri Aug 13 11:58:09 2021: Elapsed time for behind=0: 5 (last: 910155)
Fri Aug 13 11:58:21 2021: Elapsed time for behind=0: 12 (last: 910167)
Fri Aug 13 11:58:54 2021: Elapsed time for long query: 255
Fri Aug 13 11:58:54 2021: Elapsed time for behind=0: 33 (last: 186485)
Fri Aug 13 11:59:10 2021: Elapsed time for behind=0: 16 (last: 910167)
Fri Aug 13 12:00:09 2021: Elapsed time for behind=0: 59 (last: 910226)
Fri Aug 13 12:00:25 2021: Elapsed time for behind=0: 16 (last: 910242)
Fri Aug 13 12:00:50 2021: Elapsed time for behind=0: 25 (last: 910267)
Fri Aug 13 12:01:10 2021: Elapsed time for behind=0: 20 (last: 910287)
Fri Aug 13 12:01:43 2021: Elapsed time for behind=0: 33 (last: 910320)
Fri Aug 13 12:02:56 2021: Elapsed time for long query: 241

Comment by Shawn Heisey [ 2021-08-13 ]

Script:

#!/usr/bin/perl
 
my $logfile;
my $lastfile = "";
my $logpos;
my $lastpos = 0;
my $elapsed;
my $query_start = time;
my $behind_start = $query_start;
 
sub elapsed {
    my $start_time = shift(@_);
    return time - $start_time;
}
 
# -- Main routine starts here
while(1) {
  my $output = qx(echo "show slave status\\G" | mysql);
  my @lines = split /\n/, $output;
  foreach my $line (@lines) {
    # print ("$1 \n") if($time =~ /(\d\d:\d\d:\d\d)/);
    $logpos = $1 if $line =~ /Exec_Master_Log_Pos: (\d+)/;
    $logfile = $1 if $line =~ /Master_Log_File: (\S+)/;
    $behind = $1 if $line =~ /Seconds_Behind_Master: (\d+)/;
  }
  if ($logpos != $lastpos || $logfile != $lastfile) {
    $lastpos = $logpos;
    $lastfile = $logfile;
    $elapsed = elapsed($query_start);
    $query_start = time;
    my $now = localtime;
    if ($elapsed >= 2) {
      print "$now: Elapsed time for long query: $elapsed\n";
    }
  }
  if ($behind != 0) {
    $elapsed = elapsed($behind_start);
    my $now = localtime;
    $behind_start = time;
    if ($elapsed >= 2) {
      print "$now: Elapsed time for behind=0: $elapsed (last: $behind)\n";
    }
  }
  sleep 1;
}

Comment by Shawn Heisey [ 2021-08-13 ]

I found a mistake in the script. I don't think it actually made any difference, but I have corrected it and updated the paste in the previous comment.

Generated at Thu Feb 08 09:44:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.