[MDEV-30109] Replication lagging up to 4200 seconds with no obvious reasons Created: 2022-11-28 Updated: 2022-12-06 Resolved: 2022-12-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.4.24 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Artem S. Tashkinov | Assignee: | Unassigned |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Ubuntu 20.04.4 LTS |
||
| Attachments: |
|
| Description |
|
We are facing a very weird issue with replication which randomly starts lagging up to 4200 seconds. It doesn't happen instantly, it's normally running with no lags, i.e. there's 0 delay, then it starts gradually growing two seconds each second up to 4200 seconds. The master server reports that:
which is weird considering that the current/last mariadb-bin.XXXXXXX binary log file is growing at a rate of approximately 300KB (kilobytes) per minute which doesn't sound like the server is idle. The slave (when a lag occurs) reports that:
The slave itself serves as a replication master to one or two other slaves. All the servers are flush with RAM (128GB, swap is disabled, over 100GB of available RAM) running on top of fast SSD storage (load average around ~1 on the slave, ~3 on the master), so IO throughput is not an issue. In terms of network connectivity, there's a 1Gbit/sec link between servers with ~0.4ms latency/ping and no packets drop (less than 0.001% of dropped packets in the worst case):
Is this behavior intentional? What can be done to debug the issue and eliminate the replication lag? In terms of system logs. Master for today:
Slave for today: none. No logs at all. Running perfectly. |
| Comments |
| Comment by Angelique Sklavounos (Inactive) [ 2022-11-28 ] |
|
Hi birdie, Thanks for the report. When this happens, could you please provide the output of SHOW PROCESSLIST on the slave? Also, could you please provide the output of SHOW GLOBAL VARIABLES on the slave? Thank you. |
| Comment by Artem S. Tashkinov [ 2022-11-29 ] |
|
Right now with a lag of ~4300 seconds: |
| Comment by Artem S. Tashkinov [ 2022-11-29 ] |
|
SHOW GLOBAL VARIABLES: |
| Comment by Angelique Sklavounos (Inactive) [ 2022-12-01 ] |
|
Hi birdie The processlist doesn’t seem to indicate a cause for the lag, as the times are 0 or NULL (except for the Slave_IO and the Binlog Dump, but these look fine). I was curious to see “Closing tables” show up twice, as this state should happen briefly, but the time is 0. Is it always 0 or does it increase, in which case disk space might be an issue: https://mariadb.com/kb/en/general-thread-states/ ? The exec_master_log_pos is behind the read_master_log_pos, which tracks with a lag. If exec_master_log_pos is still increasing (I imagine it is, because the Slave SQL thread looks fine) then perhaps try parallel replication to reduce the lag (https://mariadb.com/resources/blog/goodbye-replication-lag/), as slave_parallel_threads was set to 0.
The server is not necessarily idle - it is just at the particular point in time the Master Thread has read all the events in the binlog. These are the available master states: https://mariadb.com/kb/en/master-thread-states/
I don’t believe this is related to the lag. |
| Comment by Artem S. Tashkinov [ 2022-12-01 ] |
|
Hello @Angelique Sklavounos, Disk space is definitely not an issue: both master and slave have around 800GB of free space. As for parallel replication, I've just enabled it on all slave servers. The lag has disappeared but sometimes it creeps up to 15 seconds. We are monitoring the situation. We can live with a 15 seconds lag but certainly not with >5000 seconds we've seen previously. In the past we read that slave_parallel_threads > 0 is not always safe to use, are there any precautions/pitfalls related to it, or it's totally safe? |
| Comment by Angelique Sklavounos (Inactive) [ 2022-12-06 ] |
|
Hi birdie, That's good to hear. With 10.4, the default mode of parallel replication is Conservative, which should be safe: https://mariadb.com/kb/en/parallel-replication/ |