Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Not a Bug
-
10.4.24
-
None
-
Ubuntu 20.04.4 LTS
Mariadb: 10.4.24-MariaDB-1:10.4.24+maria~focal-log
Description
We are facing a very weird issue with replication which randomly starts lagging up to 4200 seconds. It doesn't happen instantly, it's normally running with no lags, i.e. there's 0 delay, then it starts gradually growing two seconds each second up to 4200 seconds.
The master server reports that:
STATE: Master has sent all binlog to slave; waiting for binlog to be updated
|
which is weird considering that the current/last mariadb-bin.XXXXXXX binary log file is growing at a rate of approximately 300KB (kilobytes) per minute which doesn't sound like the server is idle.
The slave (when a lag occurs) reports that:
Slave_IO_State: Waiting for master to send event
|
Master_Host: master.com
|
Master_User: replication
|
Master_Port: 3306
|
Connect_Retry: 5
|
Master_Log_File: mariadb-bin.002266
|
Read_Master_Log_Pos: 636045762
|
Relay_Log_File: mysqld-relay-bin.000424
|
Relay_Log_Pos: 327741349
|
Relay_Master_Log_File: mariadb-bin.002266
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 327741048
|
Relay_Log_Space: 636046422
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 3230
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 105
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 0-105-10275001360
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: conservative
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Closing tables
|
Slave_DDL_Groups: 4
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 154022476
|
The slave itself serves as a replication master to one or two other slaves.
All the servers are flush with RAM (128GB, swap is disabled, over 100GB of available RAM) running on top of fast SSD storage (load average around ~1 on the slave, ~3 on the master), so IO throughput is not an issue. In terms of network connectivity, there's a 1Gbit/sec link between servers with ~0.4ms latency/ping and no packets drop (less than 0.001% of dropped packets in the worst case):
sudo ping -f -w 10 master.com
|
PING master.com (AA.BB.CC.DD) 56(84) bytes of data.
|
|
--- master.com ping statistics ---
|
27255 packets transmitted, 27255 received, 0% packet loss, time 10000ms
|
rtt min/avg/max/mdev = 0.249/0.328/3.132/0.065 ms, ipg/ewma 0.366/0.313 ms
|
Is this behavior intentional? What can be done to debug the issue and eliminate the replication lag?
In terms of system logs.
Master for today:
-- Logs begin at Thu 2022-04-14 15:13:22 UTC, end at Mon 2022-11-28 10:03:49 UTC. --
|
Nov 28 00:00:10 master.com mysqld[808]: 2022-11-28 0:00:10 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5199ms. The settings might not be optimal. (flushed=317 and evicted=0, during the time.)
|
Nov 28 00:00:24 master.com mysqld[808]: 2022-11-28 0:00:24 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4348ms. The settings might not be optimal. (flushed=400 and evicted=0, during the time.)
|
Nov 28 04:33:31 master.com mysqld[808]: 2022-11-28 4:33:31 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4017ms. The settings might not be optimal. (flushed=401 and evicted=0, during the time.)
|
Slave for today: none. No logs at all. Running perfectly.