Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.0.14, 10.0(EOL), 10.1(EOL)
-
Centos 6.5 64bit
Description
I am using MariaDB 10.0.14 at CentOS 6.5.
And, as I mentioned at the title of this question, the relay-log is not
flushed after the slave-relay-log.999999 showed when using
"Salve_parallel_threads:10" setting. like showed blow.
- binlog_format: ROW
- Slave_parallel_threads:10
Everything are working fine except the "slave-relay-log.******" files
continue to exist at the disk which will finally cause the disk full.
If I change the value of Slave_parallel_threads setting from 10 to 0,
the log will be flushed. Howevery "PK duplicate warning" error logs come
next.
Is there any setting should be used with Slave_parallel_threads setting?
Any help will be great.
Best regareds,
Attachments
Issue Links
Activity
Hi,
Our Replication configuration : master 1node , slave 2node ( slave2 - master - slave1 )
1. slave1 server status
# Slave_parallel_threads = 10
# show slave status\G
|
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 10.0.2.120
|
Master_User: repluser
|
Master_Port: 3306
|
Connect_Retry: 10
|
Master_Log_File: server-bin.000003
|
Read_Master_Log_Pos: 327
|
Relay_Log_File: slave1-relay-log.999999
|
Relay_Log_Pos: 7979
|
Relay_Master_Log_File: server-bin.000003
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 327
|
Relay_Log_Space: 8161616
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 0
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: No
|
Gtid_IO_Pos:
|
1 row in set (0.00 sec)
|
|
# Contents of slave1-relay-log.index
|
/Maria_DBLog/slave1-relay-log.999999
|
/Maria_DBLog/slave1-relay-log.1000000
|
/Maria_DBLog/slave1-relay-log.1000001
|
....
|
/Maria_DBLog/slave1-relay-log.1000937
|
/Maria_DBLog/slave1-relay-log.1000938
|
/Maria_DBLog/slave1-relay-log.1000939
|
# -----------------------------------------------------------
|
|
# Contents of slave1-relay-log.info
|
/Maria_DBLog/slave1-relay-log.999999
|
7979
|
server-bin.000002
|
327
|
686
|
Maria_DBLog/slave1-relay-log.999999
|
7979
|
server-bin.000001
|
499387
|
# -----------------------------------------------------------
|
2. slave2 server status
# Slave_parallel_threads = 10 -> Same as slave1
# Stop slave; Slave_paralle_threads = 0 ; start slave
-> change to slave2-relay-log.1000936 after purge slave2-relay-log.999999
# show slave status\G
|
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 10.0.2.120
|
Master_User: repluser
|
Master_Port: 3306
|
Connect_Retry: 10
|
Master_Log_File: server-bin.000003
|
Read_Master_Log_Pos: 327
|
Relay_Log_File: slave2-relay-log.1000936
|
Relay_Log_Pos: 536
|
Relay_Master_Log_File: server-bin.000003
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 327
|
Relay_Log_Space: 1202
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 0
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: No
|
Gtid_IO_Pos:
|
1 row in set (0.00 sec)
|
|
# Contents of slave2-relay-log.index
|
/Maria_DBLog/slave2-relay-log.1000935
|
/Maria_DBLog/slave2-relay-log.1000936
|
# -----------------------------------------------------------
|
|
# Contents of slave2-relay-log.info
|
/Maria_DBLog/slave2-relay-log.1000936
|
536
|
server-bin.000003
|
327
|
686
|
|
Maria_DBLog/slave2-relay-log.999999
|
7979
|
server-bin.000001
|
7691686
|
/Maria_DBLog/slave2-relay-log.999999
|
7979
|
server-bin.000001
|
7691686
|
# -----------------------------------------------------------
|
|
master , slave1, slave2 -> data is same
# -----------------------------------------------------------
slave1 my.cnf
# -----------------------------------------------------------------------------
server-id = 2
report-host = dev-dbslave1
max-relay-log-size = 10K
master-info-file = /Maria_DBLog/master.info
relay-log = /Maria_DBLog/slave1-relay-log.bin
relay-log-info-file = /Maria_DBLog/slave1-relay-log.info
relay-log-index = /Maria_DBLog/slave1-relay-log.index
slave-skip-errors = all
low-priority-updates
slave_parallel_threads = 10
slave_parallel_max_queued = 524288
# -----------------------------------------------------------------------------
Thanks.
Thanks.
Apparently, my previous test with flush relay logs was irrelevant.
Here is another one (it's a crude one just to demonstrate the problem, not to be included into the regression suite):
Test case |
# IMPORTANT: Run with --mysqld=--slave-parallel-threads=1 --mysqld=--max-relay-log-size=100K
|
|
--source include/master-slave.inc
|
--source include/have_binlog_format_row.inc
|
|
--enable_connect_log
|
|
--connection master
|
create table t1 (i int, c varchar(1024));
|
flush binary logs;
|
flush binary logs;
|
|
--sync_slave_with_master
|
|
--echo #
|
--echo # Stop slave server
|
--echo #
|
|
--let $datadir = `select @@datadir`
|
|
--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
|
--shutdown_server 10
|
--source include/wait_until_disconnected.inc
|
|
--echo #
|
--echo # Cheat: get close to 999999
|
--echo # by renaming relay logs and modifying index/info files
|
--echo #
|
|
--exec sed -i "s/slave-relay-bin.00000/slave-relay-bin.99999/g" $datadir/relay-log.info
|
--exec sed -i "s/slave-relay-bin.00000/slave-relay-bin.99999/g" $datadir/slave-relay-bin.index
|
--disable_abort_on_error
|
--move_file $datadir/slave-relay-bin.000001 $datadir/slave-relay-bin.999991
|
--move_file $datadir/slave-relay-bin.000002 $datadir/slave-relay-bin.999992
|
--move_file $datadir/slave-relay-bin.000003 $datadir/slave-relay-bin.999993
|
--move_file $datadir/slave-relay-bin.000004 $datadir/slave-relay-bin.999994
|
--move_file $datadir/slave-relay-bin.000005 $datadir/slave-relay-bin.999995
|
--move_file $datadir/slave-relay-bin.000006 $datadir/slave-relay-bin.999996
|
--enable_abort_on_error
|
|
--echo #
|
--echo # Restart slave server
|
--echo #
|
|
--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
|
--enable_reconnect
|
--source include/wait_until_connected_again.inc
|
--source include/start_slave.inc
|
|
--echo #-----------------------------------------
|
--echo # Relay log files after restart:
|
--list_files $datadir slave-relay-bin.*
|
--echo #
|
--echo # Contents of relay-log.info:
|
--cat_file $datadir/relay-log.info
|
--echo #
|
--echo # Contents of relay index file:
|
--cat_file $datadir/slave-relay-bin.index
|
--echo #-----------------------------------------
|
|
--connection master
|
--echo #
|
--echo # Insert some data to generate enough amount of binary logs
|
--echo #
|
--let $count = 1000
|
--disable_query_log
|
while ($count)
|
{
|
eval insert into t1 values (1001 - $count, repeat('a',1000));
|
dec $count;
|
}
|
--enable_query_log
|
|
--sync_slave_with_master
|
|
--echo #-----------------------------------------
|
--echo # Relay log files:
|
--list_files $datadir slave-relay-bin.*
|
--echo #
|
--echo # Contents of relay-log.info:
|
--cat_file $datadir/relay-log.info
|
--echo #
|
--echo # Contents of relay index file:
|
--cat_file $datadir/slave-relay-bin.index
|
--echo #-----------------------------------------
|
|
--let $status_items=Master_Log_File,Relay_Log_File,Relay_Log_Pos,Relay_Master_Log_File,Exec_Master_Log_Pos,Seconds_Behind_Master,Slave_IO_Running,Slave_SQL_Running
|
--source include/show_slave_status.inc
|
Actual result |
...
|
# Insert some data to generate enough amount of binary logs
|
#
|
connection slave;
|
#-----------------------------------------
|
# Relay log files:
|
slave-relay-bin.1000000
|
slave-relay-bin.1000001
|
slave-relay-bin.1000002
|
slave-relay-bin.1000003
|
slave-relay-bin.1000004
|
slave-relay-bin.1000005
|
slave-relay-bin.1000006
|
slave-relay-bin.1000007
|
slave-relay-bin.1000008
|
slave-relay-bin.1000009
|
slave-relay-bin.999999
|
slave-relay-bin.index
|
#
|
# Contents of relay-log.info:
|
./slave-relay-bin.999999
|
101456
|
master-bin.000003
|
1187367
|
#
|
# Contents of relay index file:
|
./slave-relay-bin.999999
|
./slave-relay-bin.1000000
|
./slave-relay-bin.1000001
|
./slave-relay-bin.1000002
|
./slave-relay-bin.1000003
|
./slave-relay-bin.1000004
|
./slave-relay-bin.1000005
|
./slave-relay-bin.1000006
|
./slave-relay-bin.1000007
|
./slave-relay-bin.1000008
|
./slave-relay-bin.1000009
|
#-----------------------------------------
|
Master_Log_File = 'master-bin.000003'
|
Relay_Log_File = 'slave-relay-bin.999999'
|
Relay_Log_Pos = '101456'
|
Relay_Master_Log_File = 'master-bin.000003'
|
Exec_Master_Log_Pos = '1187367'
|
Seconds_Behind_Master = '0'
|
Slave_IO_Running = 'Yes'
|
Slave_SQL_Running = 'Yes'
|
I also came into this issue a few days ago, and I have located this bug and fixed it, I've attached my patch here.
The problem is that in Relay_log_info::inc_group_relay_log_pos() function, they way you compare two log name is via strcmp() function, this is fine when log name sequence number are both of the same digits( 6 digits). but when the number goes to 7 digits, a 99999 compares greater than 1000000, which is wrong, hence the bug.
Besides this bug, I also located another bug caused by the same mistake in handle_queued_pos_update(), which could cause parallel replication issues when log name sequence number goes to 7 digits.
julien.fritsch It turns out I don't need a custom build for the customer. Sorry for the confusion. I just need to let them know when it's available in the next 10.2.x CS release.
Hello Andrei,
Can you please review fix for MDEV-8134.
Patch: https://github.com/MariaDB/server/commit/ffbb7348f67fdb78bb6c19cbf3d8d890e9ea29d7
BuildBot: bb-10.2-sujatha
Regarding usage of 'binlog_id' for file name comparison, this variable is dedicated for binary logs. Its usage is tightly coupled with binlog checkpoint. This variable monotonically increases during server runtime, starting from 1. If binary logs are cleared by using RESET MASTER the binlog_id is not reset. Only upon server restart it is reset to 1. Hence we cannot use this variable for file_name comparison.
Fix is implemented in 10.2.37.
Patch has been cherry-picked to higher versions and tested.
No major merge conflicts were observed.
10.3: https://github.com/MariaDB/server/commit/811dac176c10973d90f110480d2d353c452d78b9
10.4: https://github.com/MariaDB/server/commit/0f2e60f07763de230baa057818c161ce3d59b994
10.5: https://github.com/MariaDB/server/commit/3aeb78968f662629abd467773f7924fd66772022
10.6: https://github.com/MariaDB/server/commit/822e224170f99478b09ddfe03d316fee8b36a3ec
Hi,
Could you please clarify: by "not flushed" do you mean that the logs are not closed/re-opened? Or that they do not get rotated? Or that they do not get purged?
So far, given that the eventual problem is "disk full", I assume the latter, that old relay logs do not get purged automatically. However, I cannot reproduce it, seems to be working fine for me:
# Relay log files, after restart
slave-relay-bin.999998
slave-relay-bin.999999
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.999999
536
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.999998
./slave-relay-bin.999999
#-----------------------------------------
flush relay logs;
#-----------------------------------------
# Relay log files
slave-relay-bin.1000000
slave-relay-bin.999999
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.1000000
492
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.999999
./slave-relay-bin.1000000
#-----------------------------------------
flush relay logs;
#-----------------------------------------
# Relay log files
slave-relay-bin.1000000
slave-relay-bin.1000001
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.1000001
492
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.1000000
./slave-relay-bin.1000001
#-----------------------------------------
flush relay logs;
#-----------------------------------------
# Relay log files
slave-relay-bin.1000001
slave-relay-bin.1000002
slave-relay-bin.index
#
# Contents of relay-log.info:
./slave-relay-bin.1000002
492
master-bin.000001
313
#
# Contents of relay index file:
./slave-relay-bin.1000001
./slave-relay-bin.1000002
select @@slave_parallel_threads, @@version;
@@slave_parallel_threads @@version
10 10.0.14-MariaDB-log
Files won't get purged if replication is stopped or aborted, but since you are saying that everything else works fine, I assume it is not the case.
Can you somehow demonstrate the problem? E.g. list the contents of the datadir, then run flush logs on the slave or or on the master, wait a bit, list the contents of the datadir again, the contents of relay-log.info file, and show slave status output?
Also, if you are using multi-source replication or some other non-standard replication setup, please do say so. And please attach your cnf file(s).
Thanks.