[MDEV-39334] "Waiting for the slave SQL thread to free enough relay log space" Causes silent replication failure - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11.14
Fix Version/s: None
Component/s: Replication
Labels:
- gtid_slave_pos
Environment:
OS: Debian 12

Bug Category:
Can result in unexpected behaviour

Description

In a MariaDB Primary - Replica setup our replica server stopped receiving new data from the primary DB despite both the Slave_IO_Running and Slave_SQL_Running variables reporting "Yes" as their status.

The issue was noticed when we saw that "SHOW SLAVE HOSTS;" on the primary only reported one replica instead of the expected two.

In an attempt to fix the issue, the "STOP SLAVE" and "START SLAVE" commands were ran on the replica. When the replica status was checked after restart the replica reported the error appended below.

We believe that this replica has been effectively stopped by this issue for multiple months, but due to the fact that MariaDB never fully failed, or reported the "Slave_IO_Running" and "Slave_SQL_Running" variables as "No", our monitoring software failed to detect an issue.

We are unsure what specific steps to follow to replicate the issue.

MariaDB [(none)]> show slave hosts;

+-----------+----------------+------+-----------+

| Server_id | Host           | Port | Master_id |

+-----------+----------------+------+-----------+

|                2 | db2 | 3306 |         1 |

+-----------+----------------+------+-----------+

1 row in set (0.000 sec)

DB3 (Replica) "SHOW SLAVE STATUS;" before/after restart output:

MariaDB [(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space

                   Master_Host: db1.domain.com

                   Master_User: rep_user

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: bin-log.000108

           Read_Master_Log_Pos: 169830505

                Relay_Log_File: bin-relay.000017

                 Relay_Log_Pos: 169830760

         Relay_Master_Log_File: bin-log.000108

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 169830463

               Relay_Log_Space: 1077914767

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 0

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 1

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 0-1-424511

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

              Slave_DDL_Groups: 3441

Slave_Non_Transactional_Groups: 96

    Slave_Transactional_Groups: 105912

          Replicate_Rewrite_DB:

1 row in set (0.000 sec)

MariaDB [(none)]> stop slave;

Query OK, 0 rows affected (5 min 7.683 sec)

MariaDB [(none)]> start slave;

Query OK, 0 rows affected (0.059 sec)

MariaDB [(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State:

                   Master_Host: db1.domain.com

                   Master_User: rep_user

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: bin-log.000108

           Read_Master_Log_Pos: 169830463

                Relay_Log_File: bin-relay.000001

                 Relay_Log_Pos: 4

         Relay_Master_Log_File: bin-log.000108

              Slave_IO_Running: No

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 169830463

               Relay_Log_Space: 296

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: NULL

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 1236

                 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 1

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 0-1-424511

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

              Slave_DDL_Groups: 3441

Slave_Non_Transactional_Groups: 96

    Slave_Transactional_Groups: 105912

          Replicate_Rewrite_DB:

1 row in set (0.000 sec)

MariaDB [(none)]>

db3.err, truncated for brevity:

2026-01-30 21:54:35 356 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2026-01-30 21:54:35 356 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'bin-log.000100' at position 804997841; GTID position '0-1-424259', GTID event skip 48998

2026-01-30 21:54:36 356 [Note] Slave IO thread is reconnected to receive Gtid_log_event 0-1-424260. It is to skip 48998 already received events including the gtid one

2026-01-30 22:00:50 356 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2026-01-30 22:00:50 356 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'bin-log.000104' at position 1077960884; GTID position '0-1-424421', GTID event skip 33015

2026-01-30 22:00:50 356 [Note] Slave IO thread is reconnected to receive Gtid_log_event 0-1-424422. It is to skip 33015 already received events including the gtid one

2026-01-30 22:27:50 357 [Note] Error reading relay log event: slave SQL thread was killed

2026-01-30 22:27:50 357 [Note] Slave SQL thread exiting, replication stopped in log 'bin-log.000104' at position 163705474; GTID position '0-1-424421', master: db1.domain.com:3306

2026-01-30 22:32:09 356 [ERROR] Slave I/O thread aborted while waiting for relay log space

2026-01-30 22:32:09 356 [Note] Slave I/O thread exiting, read up to log 'bin-log.000104', position 163705516; GTID position 0-1-424421, master db1.domain.com:3306

2026-01-30 22:32:15 5212 [Note] Slave I/O thread: Start asynchronous replication to master 'rep_user@db1.domain.com:3306' in log 'bin-log.000104' at position 163705474

2026-01-30 22:32:15 5213 [Note] Slave SQL thread initialized, starting replication in log 'bin-log.000104' at position 163705474, relay log '/db/mysql/log-relay/bin-relay.000001' position: 4; GTID position '0-1-424421'

2026-01-30 22:32:15 5212 [Note] Slave I/O thread: connected to master 'rep_user@db1.domain.com:3306',replication starts at GTID position '0-1-424421'

2026-01-30 22:36:22 5212 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2026-01-30 22:36:22 5212 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'bin-log.000105' at position 145331453; GTID position '0-1-424433', GTID event skip 6627

2026-01-30 22:36:22 5212 [Note] Slave IO thread is reconnected to receive Gtid_log_event 0-1-424434. It is to skip 6627 already received events including the gtid one

2026-01-30 22:37:55 5212 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2026-01-30 22:37:55 5212 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'bin-log.000108' at position 479305461; GTID position '0-1-424511', GTID event skip 9403

2026-01-30 22:37:55 5212 [Note] Slave IO thread is reconnected to receive Gtid_log_event 0-1-424512. It is to skip 9403 already received events including the gtid one

2026-01-30 22:44:21 5212 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2026-01-30 22:44:21 5212 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'bin-log.000108' at position 1077911864; GTID position '0-1-424511', GTID event skip 27933

2026-01-30 22:44:21 5212 [Note] Slave IO thread is reconnected to receive Gtid_log_event 0-1-424512. It is to skip 27933 already received events including the gtid one

...

2026-04-14 14:09:45 5213 [Note] Error reading relay log event: slave SQL thread was killed

2026-04-14 14:09:45 5213 [Note] Slave SQL thread exiting, replication stopped in log 'bin-log.000108' at position 169830463; GTID position '0-1-424511', master: db1.domain.com:3306

2026-04-14 14:14:52 5212 [ERROR] Slave I/O thread aborted while waiting for relay log space

2026-04-14 14:14:52 5212 [Note] Slave I/O thread exiting, read up to log 'bin-log.000108', position 169830505; GTID position 0-1-424511, master db1.domain.com:3306

2026-04-14 14:15:01 29825 [Note] Slave I/O thread: Start asynchronous replication to master 'rep_user@db1.domain.com:3306' in log 'bin-log.000108' at position 169830463

2026-04-14 14:15:01 29826 [Note] Slave SQL thread initialized, starting replication in log 'bin-log.000108' at position 169830463, relay log '/db/mysql/log-relay/bin-relay.000001' position: 4; GTID position '0-1-424511'

2026-04-14 14:15:01 29825 [Note] Slave I/O thread: connected to master 'rep_user@db1.domain.com:3306',replication starts at GTID position '0-1-424511'

2026-04-14 14:15:01 29825 [ERROR] Error reading packet from server: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged. (server_errno=1236)

2026-04-14 14:15:01 29825 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.', Internal MariaDB error code: 1236

2026-04-14 14:15:01 29825 [Note] Slave I/O thread exiting, read up to log 'bin-log.000108', position 169830463; GTID position 0-1-424511, master db1.domain.com:3306

MariaDB configuration file for DB3:

[mysqld]

datadir                          = '/db/mysql/data'

socket                           = '/db/mysql/mysql.sock'

tmpdir                           = '/db/mysql/tmp'

bind_address                                               = '0.0.0.0'

port                                                                            = '3306'

server-id                        = '3'

gtid_strict_mode                 = '1'

gtid_ignore_duplicates           = '1'

log_slave_updates                = '1'

relay-log-space-limit            = '1G'

relay_log_purge                  = '1'

read_only                        = '1'

skip-slave-start

relay-log                        = '/db/mysql/log-relay/bin-relay'

report_host                      = 'db3'

log-error

log-warnings

log_bin                          = '/db/mysql/log-bin/bin-log'

expire_logs_days                 = '14'

log_bin_trust_function_creators  = '1'

max_binlog_size                  = '250M'

binlog_format                    = 'ROW'

sync_binlog                      = '1'

sql_mode                                                                        = ''

thread_cache_size                = '100'

max_connections                  = '1000'

tmp_table_size                   = '100M'

max_heap_table_size              = '100M'

max_allowed_packet               = '1G'

query_cache_size                 = '0'

query_cache_type                 = '0'

table_definition_cache           = '50000'

table_open_cache                 = '50000'

open_files_limit                 = '100000'

wait_timeout                     = '3600'

user                             = 'mysql'

innodb-buffer-pool-instances     = '4'

innodb_file_per_table            = '1'

innodb_data_home_dir                    = '/db/mysql/innodb'

innodb_log_group_home_dir        = '/db/mysql/innodb'

innodb_buffer_pool_size          = '1G'

innodb_log_file_size             = '1G'

innodb_log_files_in_group        = '2'

innodb_thread_concurrency        = '8'

innodb_flush_method              = 'O_DIRECT'

innodb_flush_log_at_trx_commit   = '1'

innodb_io_capacity               = '2100'

innodb_open_files                = '50000'

concurrent_insert = 2

# Full Text Search

ft_min_word_len                  = '3'

ft_max_word_len                  = '35'

# Encryption

loose-innodb-encryption-threads        = '4'

loose-innodb-encryption-rotate-key-age = '1'

# UTF-8 Support

init_connect='SET collation_connection = utf8_unicode_ci'

init_connect='SET NAMES utf8'

character-set-server             = utf8

collation-server                 = utf8_unicode_ci

skip-character-set-client-handshake

skip-name-resolve

!include /etc/mysql/mariadb.conf.d/file_key_management.cnf

[client]

socket = '/db/mysql/mysql.sock'

socket = /db/mysql/mysql.sock

Attachments

Issue Links

relates to

MDEV-38906 Do not resume IO Threads in the middle of an event group

Open

"Waiting for the slave SQL thread to free enough relay log space" Causes silent replication failure

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration