[MDEV-30109] Replication lagging up to 4200 seconds with no obvious reasons Created: 2022-11-28  Updated: 2022-12-06  Resolved: 2022-12-06

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4.24
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Artem S. Tashkinov Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Ubuntu 20.04.4 LTS
Mariadb: 10.4.24-MariaDB-1:10.4.24+maria~focal-log


Attachments: Text File SHOW GLOBAL VARIABLES.txt     Text File SLAVE [SHOW FULL PROCESSLIST-G].txt    

 Description   

We are facing a very weird issue with replication which randomly starts lagging up to 4200 seconds. It doesn't happen instantly, it's normally running with no lags, i.e. there's 0 delay, then it starts gradually growing two seconds each second up to 4200 seconds.

The master server reports that:

STATE: Master has sent all binlog to slave; waiting for binlog to be updated

which is weird considering that the current/last mariadb-bin.XXXXXXX binary log file is growing at a rate of approximately 300KB (kilobytes) per minute which doesn't sound like the server is idle.

The slave (when a lag occurs) reports that:

                Slave_IO_State: Waiting for master to send event
                   Master_Host: master.com
                   Master_User: replication
                   Master_Port: 3306
                 Connect_Retry: 5
               Master_Log_File: mariadb-bin.002266
           Read_Master_Log_Pos: 636045762
                Relay_Log_File: mysqld-relay-bin.000424
                 Relay_Log_Pos: 327741349
         Relay_Master_Log_File: mariadb-bin.002266
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 327741048
               Relay_Log_Space: 636046422
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 3230
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 105
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-105-10275001360
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Closing tables
              Slave_DDL_Groups: 4
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 154022476

The slave itself serves as a replication master to one or two other slaves.

All the servers are flush with RAM (128GB, swap is disabled, over 100GB of available RAM) running on top of fast SSD storage (load average around ~1 on the slave, ~3 on the master), so IO throughput is not an issue. In terms of network connectivity, there's a 1Gbit/sec link between servers with ~0.4ms latency/ping and no packets drop (less than 0.001% of dropped packets in the worst case):

sudo ping -f -w 10 master.com
PING master.com (AA.BB.CC.DD) 56(84) bytes of data.
 
--- master.com ping statistics ---
27255 packets transmitted, 27255 received, 0% packet loss, time 10000ms
rtt min/avg/max/mdev = 0.249/0.328/3.132/0.065 ms, ipg/ewma 0.366/0.313 ms

Is this behavior intentional? What can be done to debug the issue and eliminate the replication lag?

In terms of system logs.

Master for today:

-- Logs begin at Thu 2022-04-14 15:13:22 UTC, end at Mon 2022-11-28 10:03:49 UTC. --
Nov 28 00:00:10 master.com mysqld[808]: 2022-11-28  0:00:10 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5199ms. The settings might not be optimal. (flushed=317 and evicted=0, during the time.)
Nov 28 00:00:24 master.com mysqld[808]: 2022-11-28  0:00:24 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4348ms. The settings might not be optimal. (flushed=400 and evicted=0, during the time.)
Nov 28 04:33:31 master.com mysqld[808]: 2022-11-28  4:33:31 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4017ms. The settings might not be optimal. (flushed=401 and evicted=0, during the time.)

Slave for today: none. No logs at all. Running perfectly.



 Comments   
Comment by Angelique Sklavounos (Inactive) [ 2022-11-28 ]

Hi birdie,

Thanks for the report. When this happens, could you please provide the output of SHOW PROCESSLIST on the slave?

Also, could you please provide the output of SHOW GLOBAL VARIABLES on the slave?

Thank you.

Comment by Artem S. Tashkinov [ 2022-11-29 ]

Right now with a lag of ~4300 seconds:

SLAVE [SHOW FULL PROCESSLIST-G].txt

Comment by Artem S. Tashkinov [ 2022-11-29 ]

SHOW GLOBAL VARIABLES:

SHOW GLOBAL VARIABLES.txt

Comment by Angelique Sklavounos (Inactive) [ 2022-12-01 ]

Hi birdie

The processlist doesn’t seem to indicate a cause for the lag, as the times are 0 or NULL (except for the Slave_IO and the Binlog Dump, but these look fine).

I was curious to see “Closing tables” show up twice, as this state should happen briefly, but the time is 0. Is it always 0 or does it increase, in which case disk space might be an issue: https://mariadb.com/kb/en/general-thread-states/ ?

The exec_master_log_pos is behind the read_master_log_pos, which tracks with a lag. If exec_master_log_pos is still increasing (I imagine it is, because the Slave SQL thread looks fine) then perhaps try parallel replication to reduce the lag (https://mariadb.com/resources/blog/goodbye-replication-lag/), as slave_parallel_threads was set to 0.

which is weird considering that the current/last mariadb-bin.XXXXXXX binary log file is growing at a rate of approximately 300KB (kilobytes) per minute which doesn't sound like the server is idle.

The server is not necessarily idle - it is just at the particular point in time the Master Thread has read all the events in the binlog. These are the available master states: https://mariadb.com/kb/en/master-thread-states/

InnoDB: page_cleaner: 1000ms intended loop took 5199ms. The settings might not be optimal. (flushed=317 and evicted=0, during the time.)

I don’t believe this is related to the lag. MDEV-13670 might be of interest.

Comment by Artem S. Tashkinov [ 2022-12-01 ]

Hello @Angelique Sklavounos,

Disk space is definitely not an issue: both master and slave have around 800GB of free space.

As for parallel replication, I've just enabled it on all slave servers. The lag has disappeared but sometimes it creeps up to 15 seconds. We are monitoring the situation. We can live with a 15 seconds lag but certainly not with >5000 seconds we've seen previously.

In the past we read that slave_parallel_threads > 0 is not always safe to use, are there any precautions/pitfalls related to it, or it's totally safe?

Comment by Angelique Sklavounos (Inactive) [ 2022-12-06 ]

Hi birdie,

That's good to hear. With 10.4, the default mode of parallel replication is Conservative, which should be safe: https://mariadb.com/kb/en/parallel-replication/

Generated at Thu Feb 08 10:13:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.