[MDEV-30109] Replication lagging up to 4200 seconds with no obvious reasons - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 10.4.24
Fix Version/s: N/A
Component/s: Replication
Labels:
None
Environment:
Ubuntu 20.04.4 LTS
Mariadb: 10.4.24-MariaDB-1:10.4.24+maria~focal-log

Description

We are facing a very weird issue with replication which randomly starts lagging up to 4200 seconds. It doesn't happen instantly, it's normally running with no lags, i.e. there's 0 delay, then it starts gradually growing two seconds each second up to 4200 seconds.

The master server reports that:

STATE: Master has sent all binlog to slave; waiting for binlog to be updated

which is weird considering that the current/last mariadb-bin.XXXXXXX binary log file is growing at a rate of approximately 300KB (kilobytes) per minute which doesn't sound like the server is idle.

The slave (when a lag occurs) reports that:

                Slave_IO_State: Waiting for master to send event

                   Master_Host: master.com

                   Master_User: replication

                   Master_Port: 3306

                 Connect_Retry: 5

               Master_Log_File: mariadb-bin.002266

           Read_Master_Log_Pos: 636045762

                Relay_Log_File: mysqld-relay-bin.000424

                 Relay_Log_Pos: 327741349

         Relay_Master_Log_File: mariadb-bin.002266

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 327741048

               Relay_Log_Space: 636046422

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 3230

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 105

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 0-105-10275001360

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: conservative

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Closing tables

              Slave_DDL_Groups: 4

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 154022476

The slave itself serves as a replication master to one or two other slaves.

All the servers are flush with RAM (128GB, swap is disabled, over 100GB of available RAM) running on top of fast SSD storage (load average around ~1 on the slave, ~3 on the master), so IO throughput is not an issue. In terms of network connectivity, there's a 1Gbit/sec link between servers with ~0.4ms latency/ping and no packets drop (less than 0.001% of dropped packets in the worst case):

sudo ping -f -w 10 master.com

PING master.com (AA.BB.CC.DD) 56(84) bytes of data.

--- master.com ping statistics ---

27255 packets transmitted, 27255 received, 0% packet loss, time 10000ms

rtt min/avg/max/mdev = 0.249/0.328/3.132/0.065 ms, ipg/ewma 0.366/0.313 ms

Is this behavior intentional? What can be done to debug the issue and eliminate the replication lag?

In terms of system logs.

Master for today:

-- Logs begin at Thu 2022-04-14 15:13:22 UTC, end at Mon 2022-11-28 10:03:49 UTC. --

Nov 28 00:00:10 master.com mysqld[808]: 2022-11-28  0:00:10 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5199ms. The settings might not be optimal. (flushed=317 and evicted=0, during the time.)

Nov 28 00:00:24 master.com mysqld[808]: 2022-11-28  0:00:24 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4348ms. The settings might not be optimal. (flushed=400 and evicted=0, during the time.)

Nov 28 04:33:31 master.com mysqld[808]: 2022-11-28  4:33:31 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4017ms. The settings might not be optimal. (flushed=401 and evicted=0, during the time.)

Slave for today: none. No logs at all. Running perfectly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

SHOW GLOBAL VARIABLES.txt
19 kB
2022-11-29 06:01
SLAVE [SHOW FULL PROCESSLIST-G].txt
2 kB
2022-11-29 05:55

Activity

People

Assignee:: Unassigned

Reporter:: Artem S. Tashkinov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2022-11-28 09:57

Updated:: 2022-12-06 13:19

Resolved:: 2022-12-06 13:19

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.