[MDEV-30860] Race condition between buffer pool flush and log file deletion in mariadb-backup --prepare Created: 2023-03-16  Updated: 2023-03-16  Resolved: 2023-03-16

Status: Closed
Project: MariaDB Server
Component/s: Backup
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: 10.11.3, 11.0.2, 10.5.20, 10.6.13, 10.7.8, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: crash, race

Issue Links:
Problem/Incident
is caused by MDEV-23855 InnoDB log checkpointing causes regre... Closed

 Description   

The test mariabackup.innodb_force_recovery occasionally failed in a 10.6 based development branch that contains a fix of MDEV-26827 and MDEV-26055. We have some rather rare failures of the test occurring on buildbot as well, like this:

10.10 4ce6e78059c1a91765c91afa5330737bd7568d0a

mariabackup.innodb_force_recovery 'innodb' w13 [ fail ]  timeout after 480 seconds
        Test ended at 2022-07-28 12:27:20
 
Test case timeout after 480 seconds
 
== /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/13/log/innodb_force_recovery.log == 
CREATE TABLE t(i INT) ENGINE INNODB;
INSERT INTO t VALUES(1);
# "innodb_force_recovery=1" should be allowed with "--prepare" only (mariabackup)
FOUND 1 /should only be used with "--prepare"/ in backup.log
# "innodb_force_recovery=1" should be allowed with "--apply-log" only (innobackupex)
FOUND 1 /should only be used with "--apply-log"/ in backup.log
# "innodb_force_recovery" should be limited to "SRV_FORCE_IGNORE_CORRUPT" (mariabackup)
FOUND 1 /innodb_force_recovery = 1/ in backup.log
# "innodb_force_recovery" should be limited to "SRV_FORCE_IGNORE_CORRUPT" (innobackupex)
FOUND 1 /innodb_force_recovery = 1/ in backup.log
# "innodb_force_recovery" should be read from "backup-my.cnf" (mariabackup)
FOUND 1 /innodb_force_recovery = 1/ in backup.log
# "innodb_force_recovery=1" should be read from "backup-my.cnf" (innobackupex)
FOUND 1 /innodb_force_recovery = 1/ in backup.log
# "innodb_force_recovery" from the command line should override "backup-my.cnf" (mariabackup)
NOT FOUND /innodb_force_recovery = 1/ in backup.log
# "innodb_force_recovery" from the command line should override "backup-my.cnf" (innobackupex)
NOT FOUND /innodb_force_recovery = 1/ in backup.log
# shutdown server
 
 == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/13/tmp/analyze-timeout-mysqld.1.err ==
mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/tmp/13/mys' (2)

This fix worked in the case that I was able to reproduce:

diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc
index 5bf085079c8..4a01326047a 100644
--- a/storage/innobase/srv/srv0start.cc
+++ b/storage/innobase/srv/srv0start.cc
@@ -905,9 +905,7 @@ static lsn_t srv_prepare_to_delete_redo_log_file(bool old_exists)
 {
   DBUG_ENTER("srv_prepare_to_delete_redo_log_file");
 
-  /* Disable checkpoints in the page cleaner. */
-  ut_ad(!recv_sys.recovery_on);
-  recv_sys.recovery_on= true;
+  ut_ad(recv_sys.recovery_on);
 
   /* Clean the buffer pool. */
   buf_flush_sync();
@@ -1589,10 +1587,10 @@ dberr_t srv_start(bool create_new_db)
 			}
 		}
 
-		recv_sys.debug_free();
-
 		if (srv_operation == SRV_OPERATION_RESTORE
 		    || srv_operation == SRV_OPERATION_RESTORE_EXPORT) {
+			buf_flush_sync();
+			recv_sys.debug_free();
 			/* After applying the redo log from
 			SRV_OPERATION_BACKUP, flush the changes
 			to the data files and truncate or delete the log.
@@ -1684,6 +1682,8 @@ dberr_t srv_start(bool create_new_db)
 				return(srv_init_abort(err));
 			}
 		}
+
+		recv_sys.debug_free();
 	}
 
 	ut_ad(err == DB_SUCCESS);

The failure that I observed was caused by the fact that buf_flush_page_cleaner() took a little longer to write out the recovered modified pages, and this thread had already executed recv_sys.debug_free(), which causes the page cleaner to attempt to write checkpoint information. The failures that I observed were caused by the fact that this thread had already closed and deleted the ib_logfile0 when buf_flush_page_cleaner() unexpectedly invoked log_checkpoint_low().

This type of failure should be caused by MDEV-23855, which made log checkpoints the responsibility of the page cleaner thread.

Because I have no stack traces of the test case timeouts, I am unable say if the above patch would fix them.


Generated at Thu Feb 08 10:19:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.