We have a replication that had some problems and got way behind.
Seconds_Behind_Master seems to be correct as long as replication is executing updates that complete quickly. But for this system, many updates have been made that reference fields with no index, and since the ibd file for this table is 8.4TB in size, these updates take FOREVER. It seems that while replication is doing one of these very slow updates, Seconds_Behind_Master shows 0 seconds, when at this moment, replication is over 10 days behind. There are a LOT of these updates mixed in with updates that happen a lot faster.
We have zabbix alerting set up for replication lag, but every time the alert happened, the alert quickly got resolved, because Seconds_Behind_Master would show 0 shortly after it showed the real number. So we didn't realize there was a problem right away – all the alerts were resolved automatically.
Please let me know what information you need to troubleshoot further. If you need one of the binlog files, I need to send that privately, as those logs contain confidential data protected by HIPAA.
We know that we are very far behind on the 10.3.x release. The master server is a production system that is central to everything we do for this customer. Updating it will require a significant effort including a change request that the customer must approve. We're going to get that process started, but it could take a while.