Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30860

Race condition between buffer pool flush and log file deletion in mariadb-backup --prepare

    XMLWordPrintable

Details

    Description

      The test mariabackup.innodb_force_recovery occasionally failed in a 10.6 based development branch that contains a fix of MDEV-26827 and MDEV-26055. We have some rather rare failures of the test occurring on buildbot as well, like this:

      10.10 4ce6e78059c1a91765c91afa5330737bd7568d0a

      mariabackup.innodb_force_recovery 'innodb' w13 [ fail ]  timeout after 480 seconds
              Test ended at 2022-07-28 12:27:20
       
      Test case timeout after 480 seconds
       
      == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/13/log/innodb_force_recovery.log == 
      CREATE TABLE t(i INT) ENGINE INNODB;
      INSERT INTO t VALUES(1);
      # "innodb_force_recovery=1" should be allowed with "--prepare" only (mariabackup)
      FOUND 1 /should only be used with "--prepare"/ in backup.log
      # "innodb_force_recovery=1" should be allowed with "--apply-log" only (innobackupex)
      FOUND 1 /should only be used with "--apply-log"/ in backup.log
      # "innodb_force_recovery" should be limited to "SRV_FORCE_IGNORE_CORRUPT" (mariabackup)
      FOUND 1 /innodb_force_recovery = 1/ in backup.log
      # "innodb_force_recovery" should be limited to "SRV_FORCE_IGNORE_CORRUPT" (innobackupex)
      FOUND 1 /innodb_force_recovery = 1/ in backup.log
      # "innodb_force_recovery" should be read from "backup-my.cnf" (mariabackup)
      FOUND 1 /innodb_force_recovery = 1/ in backup.log
      # "innodb_force_recovery=1" should be read from "backup-my.cnf" (innobackupex)
      FOUND 1 /innodb_force_recovery = 1/ in backup.log
      # "innodb_force_recovery" from the command line should override "backup-my.cnf" (mariabackup)
      NOT FOUND /innodb_force_recovery = 1/ in backup.log
      # "innodb_force_recovery" from the command line should override "backup-my.cnf" (innobackupex)
      NOT FOUND /innodb_force_recovery = 1/ in backup.log
      # shutdown server
       
       == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/13/tmp/analyze-timeout-mysqld.1.err ==
      mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/tmp/13/mys' (2)
      

      This fix worked in the case that I was able to reproduce:

      diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc
      index 5bf085079c8..4a01326047a 100644
      --- a/storage/innobase/srv/srv0start.cc
      +++ b/storage/innobase/srv/srv0start.cc
      @@ -905,9 +905,7 @@ static lsn_t srv_prepare_to_delete_redo_log_file(bool old_exists)
       {
         DBUG_ENTER("srv_prepare_to_delete_redo_log_file");
       
      -  /* Disable checkpoints in the page cleaner. */
      -  ut_ad(!recv_sys.recovery_on);
      -  recv_sys.recovery_on= true;
      +  ut_ad(recv_sys.recovery_on);
       
         /* Clean the buffer pool. */
         buf_flush_sync();
      @@ -1589,10 +1587,10 @@ dberr_t srv_start(bool create_new_db)
       			}
       		}
       
      -		recv_sys.debug_free();
      -
       		if (srv_operation == SRV_OPERATION_RESTORE
       		    || srv_operation == SRV_OPERATION_RESTORE_EXPORT) {
      +			buf_flush_sync();
      +			recv_sys.debug_free();
       			/* After applying the redo log from
       			SRV_OPERATION_BACKUP, flush the changes
       			to the data files and truncate or delete the log.
      @@ -1684,6 +1682,8 @@ dberr_t srv_start(bool create_new_db)
       				return(srv_init_abort(err));
       			}
       		}
      +
      +		recv_sys.debug_free();
       	}
       
       	ut_ad(err == DB_SUCCESS);
      

      The failure that I observed was caused by the fact that buf_flush_page_cleaner() took a little longer to write out the recovered modified pages, and this thread had already executed recv_sys.debug_free(), which causes the page cleaner to attempt to write checkpoint information. The failures that I observed were caused by the fact that this thread had already closed and deleted the ib_logfile0 when buf_flush_page_cleaner() unexpectedly invoked log_checkpoint_low().

      This type of failure should be caused by MDEV-23855, which made log checkpoints the responsibility of the page cleaner thread.

      Because I have no stack traces of the test case timeouts, I am unable say if the above patch would fix them.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.