Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30109

Replication lagging up to 4200 seconds with no obvious reasons

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 10.4.24
    • N/A
    • Replication
    • None
    • Ubuntu 20.04.4 LTS
      Mariadb: 10.4.24-MariaDB-1:10.4.24+maria~focal-log

    Description

      We are facing a very weird issue with replication which randomly starts lagging up to 4200 seconds. It doesn't happen instantly, it's normally running with no lags, i.e. there's 0 delay, then it starts gradually growing two seconds each second up to 4200 seconds.

      The master server reports that:

      STATE: Master has sent all binlog to slave; waiting for binlog to be updated
      

      which is weird considering that the current/last mariadb-bin.XXXXXXX binary log file is growing at a rate of approximately 300KB (kilobytes) per minute which doesn't sound like the server is idle.

      The slave (when a lag occurs) reports that:

                      Slave_IO_State: Waiting for master to send event
                         Master_Host: master.com
                         Master_User: replication
                         Master_Port: 3306
                       Connect_Retry: 5
                     Master_Log_File: mariadb-bin.002266
                 Read_Master_Log_Pos: 636045762
                      Relay_Log_File: mysqld-relay-bin.000424
                       Relay_Log_Pos: 327741349
               Relay_Master_Log_File: mariadb-bin.002266
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB: 
                 Replicate_Ignore_DB: 
                  Replicate_Do_Table: 
              Replicate_Ignore_Table: 
             Replicate_Wild_Do_Table: 
         Replicate_Wild_Ignore_Table: 
                          Last_Errno: 0
                          Last_Error: 
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 327741048
                     Relay_Log_Space: 636046422
                     Until_Condition: None
                      Until_Log_File: 
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File: 
                  Master_SSL_CA_Path: 
                     Master_SSL_Cert: 
                   Master_SSL_Cipher: 
                      Master_SSL_Key: 
               Seconds_Behind_Master: 3230
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error: 
                      Last_SQL_Errno: 0
                      Last_SQL_Error: 
         Replicate_Ignore_Server_Ids: 
                    Master_Server_Id: 105
                      Master_SSL_Crl: 
                  Master_SSL_Crlpath: 
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 0-105-10275001360
             Replicate_Do_Domain_Ids: 
         Replicate_Ignore_Domain_Ids: 
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Closing tables
                    Slave_DDL_Groups: 4
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 154022476
      

      The slave itself serves as a replication master to one or two other slaves.

      All the servers are flush with RAM (128GB, swap is disabled, over 100GB of available RAM) running on top of fast SSD storage (load average around ~1 on the slave, ~3 on the master), so IO throughput is not an issue. In terms of network connectivity, there's a 1Gbit/sec link between servers with ~0.4ms latency/ping and no packets drop (less than 0.001% of dropped packets in the worst case):

      sudo ping -f -w 10 master.com
      PING master.com (AA.BB.CC.DD) 56(84) bytes of data.
       
      --- master.com ping statistics ---
      27255 packets transmitted, 27255 received, 0% packet loss, time 10000ms
      rtt min/avg/max/mdev = 0.249/0.328/3.132/0.065 ms, ipg/ewma 0.366/0.313 ms
      

      Is this behavior intentional? What can be done to debug the issue and eliminate the replication lag?

      In terms of system logs.

      Master for today:

      -- Logs begin at Thu 2022-04-14 15:13:22 UTC, end at Mon 2022-11-28 10:03:49 UTC. --
      Nov 28 00:00:10 master.com mysqld[808]: 2022-11-28  0:00:10 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5199ms. The settings might not be optimal. (flushed=317 and evicted=0, during the time.)
      Nov 28 00:00:24 master.com mysqld[808]: 2022-11-28  0:00:24 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4348ms. The settings might not be optimal. (flushed=400 and evicted=0, during the time.)
      Nov 28 04:33:31 master.com mysqld[808]: 2022-11-28  4:33:31 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4017ms. The settings might not be optimal. (flushed=401 and evicted=0, during the time.)
      

      Slave for today: none. No logs at all. Running perfectly.

      Attachments

        Activity

          People

            Unassigned Unassigned
            birdie Artem S. Tashkinov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.