[MDEV-8134] The relay-log is not flushed after the slave-relay-log.999999 showed Created: 2015-05-11  Updated: 2023-10-04  Resolved: 2021-01-21

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.14, 10.0, 10.1
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9

Type: Bug Priority: Critical
Reporter: ilhwan Kim Assignee: Sujatha Sivakumar (Inactive)
Resolution: Fixed Votes: 5
Labels: contribution, slave-relay-log
Environment:

Centos 6.5 64bit


Attachments: File 999999.diff    

 Description   

I am using MariaDB 10.0.14 at CentOS 6.5.
And, as I mentioned at the title of this question, the relay-log is not
flushed after the slave-relay-log.999999 showed when using
"Salve_parallel_threads:10" setting. like showed blow.

  • binlog_format: ROW
  • Slave_parallel_threads:10

Everything are working fine except the "slave-relay-log.******" files
continue to exist at the disk which will finally cause the disk full.
If I change the value of Slave_parallel_threads setting from 10 to 0,
the log will be flushed. Howevery "PK duplicate warning" error logs come
next.

Is there any setting should be used with Slave_parallel_threads setting?
Any help will be great.

Best regareds,



 Comments   
Comment by Elena Stepanova [ 2015-06-05 ]

Hi,

Could you please clarify: by "not flushed" do you mean that the logs are not closed/re-opened? Or that they do not get rotated? Or that they do not get purged?

So far, given that the eventual problem is "disk full", I assume the latter, that old relay logs do not get purged automatically. However, I cannot reproduce it, seems to be working fine for me:

# Relay log files, after restart
slave-relay-bin.999998
slave-relay-bin.999999
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.999999
536
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.999998
./slave-relay-bin.999999
#-----------------------------------------
flush relay logs;
#-----------------------------------------
# Relay log files
slave-relay-bin.1000000
slave-relay-bin.999999
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.1000000
492
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.999999
./slave-relay-bin.1000000
#-----------------------------------------
flush relay logs;
#-----------------------------------------
# Relay log files
slave-relay-bin.1000000
slave-relay-bin.1000001
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.1000001
492
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.1000000
./slave-relay-bin.1000001
#-----------------------------------------
flush relay logs;
#-----------------------------------------
# Relay log files
slave-relay-bin.1000001
slave-relay-bin.1000002
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.1000002
492
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.1000001
./slave-relay-bin.1000002
 
select @@slave_parallel_threads, @@version;
@@slave_parallel_threads	@@version
10	10.0.14-MariaDB-log

Files won't get purged if replication is stopped or aborted, but since you are saying that everything else works fine, I assume it is not the case.

Can you somehow demonstrate the problem? E.g. list the contents of the datadir, then run flush logs on the slave or or on the master, wait a bit, list the contents of the datadir again, the contents of relay-log.info file, and show slave status output?

Also, if you are using multi-source replication or some other non-standard replication setup, please do say so. And please attach your cnf file(s).

Thanks.

Comment by ilhwan Kim [ 2015-06-08 ]

Hi,

Our Replication configuration : master 1node , slave 2node ( slave2 - master - slave1 )

1. slave1 server status

# Slave_parallel_threads = 10

# show slave status\G
 
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.2.120
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: server-bin.000003
          Read_Master_Log_Pos: 327
               Relay_Log_File: slave1-relay-log.999999
                Relay_Log_Pos: 7979
        Relay_Master_Log_File: server-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 327
              Relay_Log_Space: 8161616
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
1 row in set (0.00 sec)
 
# Contents of slave1-relay-log.index
/Maria_DBLog/slave1-relay-log.999999
/Maria_DBLog/slave1-relay-log.1000000
/Maria_DBLog/slave1-relay-log.1000001
....
/Maria_DBLog/slave1-relay-log.1000937
/Maria_DBLog/slave1-relay-log.1000938
/Maria_DBLog/slave1-relay-log.1000939
# -----------------------------------------------------------
 
# Contents of slave1-relay-log.info
/Maria_DBLog/slave1-relay-log.999999
7979
server-bin.000002
327
686
Maria_DBLog/slave1-relay-log.999999
7979
server-bin.000001
499387
# -----------------------------------------------------------

2. slave2 server status

# Slave_parallel_threads = 10 -> Same as slave1

# Stop slave; Slave_paralle_threads = 0 ; start slave
-> change to slave2-relay-log.1000936 after purge slave2-relay-log.999999

# show slave status\G
 
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.2.120
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: server-bin.000003
          Read_Master_Log_Pos: 327
               Relay_Log_File: slave2-relay-log.1000936
                Relay_Log_Pos: 536
        Relay_Master_Log_File: server-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 327
              Relay_Log_Space: 1202
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
1 row in set (0.00 sec)
 
# Contents of slave2-relay-log.index
/Maria_DBLog/slave2-relay-log.1000935
/Maria_DBLog/slave2-relay-log.1000936
# -----------------------------------------------------------
 
# Contents of slave2-relay-log.info
/Maria_DBLog/slave2-relay-log.1000936
536
server-bin.000003
327
686
 
Maria_DBLog/slave2-relay-log.999999
7979
server-bin.000001
7691686
/Maria_DBLog/slave2-relay-log.999999
7979
server-bin.000001
7691686
# -----------------------------------------------------------
 

master , slave1, slave2 -> data is same

# -----------------------------------------------------------

slave1 my.cnf
# -----------------------------------------------------------------------------
server-id = 2
report-host = dev-dbslave1

max-relay-log-size = 10K
master-info-file = /Maria_DBLog/master.info
relay-log = /Maria_DBLog/slave1-relay-log.bin
relay-log-info-file = /Maria_DBLog/slave1-relay-log.info
relay-log-index = /Maria_DBLog/slave1-relay-log.index
slave-skip-errors = all
low-priority-updates

slave_parallel_threads = 10
slave_parallel_max_queued = 524288
# -----------------------------------------------------------------------------

Thanks.

Comment by Elena Stepanova [ 2015-06-08 ]

Thanks.
Apparently, my previous test with flush relay logs was irrelevant.
Here is another one (it's a crude one just to demonstrate the problem, not to be included into the regression suite):

Test case

# IMPORTANT: Run with --mysqld=--slave-parallel-threads=1 --mysqld=--max-relay-log-size=100K
 
--source include/master-slave.inc
--source include/have_binlog_format_row.inc
 
--enable_connect_log
 
--connection master
create table t1 (i int, c varchar(1024));
flush binary logs;
flush binary logs;
 
--sync_slave_with_master
 
--echo # 
--echo # Stop slave server
--echo # 
 
--let $datadir = `select @@datadir`
 
--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
--shutdown_server 10
--source include/wait_until_disconnected.inc
 
--echo # 
--echo # Cheat: get close to 999999
--echo # by renaming relay logs and modifying index/info files
--echo # 
 
--exec sed -i "s/slave-relay-bin.00000/slave-relay-bin.99999/g" $datadir/relay-log.info
--exec sed -i "s/slave-relay-bin.00000/slave-relay-bin.99999/g" $datadir/slave-relay-bin.index
--disable_abort_on_error
--move_file $datadir/slave-relay-bin.000001 $datadir/slave-relay-bin.999991
--move_file $datadir/slave-relay-bin.000002 $datadir/slave-relay-bin.999992
--move_file $datadir/slave-relay-bin.000003 $datadir/slave-relay-bin.999993
--move_file $datadir/slave-relay-bin.000004 $datadir/slave-relay-bin.999994
--move_file $datadir/slave-relay-bin.000005 $datadir/slave-relay-bin.999995
--move_file $datadir/slave-relay-bin.000006 $datadir/slave-relay-bin.999996
--enable_abort_on_error
 
--echo # 
--echo # Restart slave server
--echo # 
 
--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
--enable_reconnect
--source include/wait_until_connected_again.inc
--source include/start_slave.inc
 
--echo #-----------------------------------------
--echo # Relay log files after restart:
--list_files $datadir slave-relay-bin.*
--echo #
--echo # Contents of relay-log.info:
--cat_file $datadir/relay-log.info
--echo #
--echo # Contents of relay index file:
--cat_file $datadir/slave-relay-bin.index
--echo #-----------------------------------------
 
--connection master
--echo #
--echo # Insert some data to generate enough amount of binary logs
--echo #
--let $count = 1000
--disable_query_log
while ($count)
{
	eval insert into t1 values (1001 - $count, repeat('a',1000));
	dec $count;
}
--enable_query_log
 
--sync_slave_with_master
 
--echo #-----------------------------------------
--echo # Relay log files:
--list_files $datadir slave-relay-bin.*
--echo #
--echo # Contents of relay-log.info:
--cat_file $datadir/relay-log.info
--echo #
--echo # Contents of relay index file:
--cat_file $datadir/slave-relay-bin.index
--echo #-----------------------------------------
 
--let $status_items=Master_Log_File,Relay_Log_File,Relay_Log_Pos,Relay_Master_Log_File,Exec_Master_Log_Pos,Seconds_Behind_Master,Slave_IO_Running,Slave_SQL_Running
--source include/show_slave_status.inc

Actual result

...
# Insert some data to generate enough amount of binary logs
#
connection slave;
#-----------------------------------------
# Relay log files:
slave-relay-bin.1000000
slave-relay-bin.1000001
slave-relay-bin.1000002
slave-relay-bin.1000003
slave-relay-bin.1000004
slave-relay-bin.1000005
slave-relay-bin.1000006
slave-relay-bin.1000007
slave-relay-bin.1000008
slave-relay-bin.1000009
slave-relay-bin.999999
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.999999
101456
master-bin.000003
1187367
#
# Contents of relay index file:
./slave-relay-bin.999999
./slave-relay-bin.1000000
./slave-relay-bin.1000001
./slave-relay-bin.1000002
./slave-relay-bin.1000003
./slave-relay-bin.1000004
./slave-relay-bin.1000005
./slave-relay-bin.1000006
./slave-relay-bin.1000007
./slave-relay-bin.1000008
./slave-relay-bin.1000009
#-----------------------------------------
Master_Log_File = 'master-bin.000003'
Relay_Log_File = 'slave-relay-bin.999999'
Relay_Log_Pos = '101456'
Relay_Master_Log_File = 'master-bin.000003'
Exec_Master_Log_Pos = '1187367'
Seconds_Behind_Master = '0'
Slave_IO_Running = 'Yes'
Slave_SQL_Running = 'Yes'

Comment by jasung gu [ 2015-06-09 ]

I have same problem.

How I can solve this problem?

Comment by ilhwan Kim [ 2016-06-29 ]

mariadb 10.0.26 version same result .. ㅜㅜ

Comment by David Zhao [ 2019-04-25 ]

I also came into this issue a few days ago, and I have located this bug and fixed it, I've attached my patch here.

The problem is that in Relay_log_info::inc_group_relay_log_pos() function, they way you compare two log name is via strcmp() function, this is fine when log name sequence number are both of the same digits( 6 digits). but when the number goes to 7 digits, a 99999 compares greater than 1000000, which is wrong, hence the bug.

Besides this bug, I also located another bug caused by the same mistake in handle_queued_pos_update(), which could cause parallel replication issues when log name sequence number goes to 7 digits.

999999.diff

Comment by Maria M Pflaum [ 2021-01-06 ]

julien.fritsch It turns out I don't need a custom build for the customer. Sorry for the confusion. I just need to let them know when it's available in the next 10.2.x CS release.

Comment by Sujatha Sivakumar (Inactive) [ 2021-01-12 ]

Hello Andrei,

Can you please review fix for MDEV-8134.

Patch: https://github.com/MariaDB/server/commit/ffbb7348f67fdb78bb6c19cbf3d8d890e9ea29d7

BuildBot: bb-10.2-sujatha

Regarding usage of 'binlog_id' for file name comparison, this variable is dedicated for binary logs. Its usage is tightly coupled with binlog checkpoint. This variable monotonically increases during server runtime, starting from 1. If binary logs are cleared by using RESET MASTER the binlog_id is not reset. Only upon server restart it is reset to 1. Hence we cannot use this variable for file_name comparison.

Comment by Sachin Setiya (Inactive) [ 2021-01-20 ]

Okay to push

Comment by Sujatha Sivakumar (Inactive) [ 2021-01-21 ]

Fix is implemented in 10.2.37.

Patch has been cherry-picked to higher versions and tested.
No major merge conflicts were observed.

10.3: https://github.com/MariaDB/server/commit/811dac176c10973d90f110480d2d353c452d78b9
10.4: https://github.com/MariaDB/server/commit/0f2e60f07763de230baa057818c161ce3d59b994
10.5: https://github.com/MariaDB/server/commit/3aeb78968f662629abd467773f7924fd66772022
10.6: https://github.com/MariaDB/server/commit/822e224170f99478b09ddfe03d316fee8b36a3ec

Generated at Thu Feb 08 07:24:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.