[MDEV-32953] main.rpl_mysqldump_slave Fails with "Master binlog wasn't deleted" Assertion Created: 2023-12-05  Updated: 2023-12-12  Resolved: 2023-12-07

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.4
Fix Version/s: 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3

Type: Bug Priority: Critical
Reporter: Brandon Nesterenko Assignee: Brandon Nesterenko
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Issue split
split to MDEV-32969 Mariadb-dump --delete-master-logs May... Open
Problem/Incident
is caused by MDEV-32611 Test suite is missing dump option del... Closed
Relates
relates to MDEV-20219 PURGE BINARY LOGS does not always purge Stalled

 Description   

After MDEV-32611, main.rpl_mysqldump_slave sporadically fails with the following:

main.rpl_mysqldump_slave 'row'           w5 [ fail ]
        Test ended at 2023-12-05 18:06:55
 
CURRENT_TEST: main.rpl_mysqldump_slave
mysqldump: Got error: 1049: "Unknown database 'no_such_db'" when selecting the database
mysqltest: At line 125: Master binlog wasn't deleted after mariadb-dump with --delete-master-logs.
 
The result from queries just before the failure was: 
< snip >
-- CHANGE MASTER TO MASTER_LOG_FILE='slave-bin.000001', MASTER_LOG_POS=BINLOG_START;
-- CHANGE MASTER TO MASTER_USE_GTID=slave_pos;
-- SET GLOBAL gtid_slave_pos='0-2-1003';
 
3. --master-data --single-transaction
 
-- CHANGE MASTER TO MASTER_LOG_FILE='slave-bin.000001', MASTER_LOG_POS=BINLOG_START;
CHANGE MASTER TO MASTER_USE_GTID=slave_pos;
SET GLOBAL gtid_slave_pos='0-2-1003';
connection master;
CREATE TABLE t (
id int
);
insert into t values (1); 
insert into t values (2); 
drop table t;
-- CHANGE MASTER TO MASTER_LOG_FILE='master-bin.000002', MASTER_LOG_POS=BINLOG_START;
-- SET GLOBAL gtid_slave_pos='0-1-1005';
# postdump_first_binary_log_filename: master-bin.000001
# postdump_binlog_filename: master-bin.000002
 
More results from queries before failure can be found in /home/buildbot/amd64-debian-10/build/mysql-test/var/5/log/rpl_mysqldump_slave.log

The problem is that --delete-master-logs immediately purges logs after flushing, and the active binlog dump thread can still be using the old log when the purge executes, disallowing the file from being deleted. That is because there is validation logic in the purge code to ensure that nobody is actively using that log before it deletes it:

static my_bool log_in_use_callback(THD *thd, const char *log_name)
{
  my_bool result= 0;
  mysql_mutex_lock(&thd->LOCK_thd_data);
  if (auto linfo= thd->current_linfo)
    result= !strcmp(log_name, linfo->log_file_name);
  mysql_mutex_unlock(&thd->LOCK_thd_data);
  return result;
}
 
 
bool log_in_use(const char* log_name)
{
  return server_threads.iterate(log_in_use_callback, log_name);
}

If a user is using mysqldump --delete-master-logs with active binlog dump thread connections, they can run into this with their workaround options being
1) Re-run PURGE BINARY LOGS once the replicas have caught up
2) Temporarily disconnect replicas from the primary while running the dump (though this option requires master_use_gtid=Slave_pos)



 Comments   
Comment by Yuchen Pei [ 2023-12-07 ]

Looks pretty frequent now, failing at 5 builders at say 13dd787530863954fcd17fbaf83d46e7170667f6. Shall we raise the prio to critical?

Comment by Brandon Nesterenko [ 2023-12-07 ]

Hi Andrei!

This is ready for review: PR 2900

The underlying cause of the test failure may need additional discussion, as I'm not sure if it is a bug or a feature. But the proposed fix in PR 2900 at least institutes a work-around in the test in the mean-time.

Comment by Brandon Nesterenko [ 2023-12-07 ]

Fixed and pushed into 10.4 as 9be7e03f70e

Generated at Thu Feb 08 10:35:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.