Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28185

InnoDB generates redundant log checkpoints

Details

    Description

      The function log_checkpoint() or log_checkpoint_low() tries to avoid redundant writes. Here is the 10.3 version of log_checkpoint():

      	if (oldest_lsn
      	    > log_sys.last_checkpoint_lsn + SIZE_OF_MLOG_CHECKPOINT) {
      		/* Some log has been written since the previous checkpoint. */
      	} else if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) {
      		/* MariaDB 10.3 startup expects the redo log file to be
      		logically empty (not even containing a MLOG_CHECKPOINT record)
      		after a clean shutdown. Perform an extra checkpoint at
      		shutdown. */
      	} else {
      		/* Do nothing, because nothing was logged (other than
      		a MLOG_CHECKPOINT marker) since the previous checkpoint. */
      		log_mutex_exit();
      		return(true);
      	}
      ...
      	if (fil_names_clear(flush_lsn, do_write)) {
      

      This logic fails to take two things into account:

      1. The function fil_names_clear() may write some file names before the checkpoint record.
      2. Before MDEV-14425, the checkpoint record may be interrupted by 4+12 or 8+12 bytes of log block trailer and header.

      In either case, the difference of oldest_lsn and log_sys.last_checkpoint_lsn could be more than the size of the checkpoint record.

      The simplest way to fix this could be to avoid writing a new checkpoint if oldest_lsn did not change since the previous checkpoint.

      Attachments

        Issue Links

          Activity

            My attempt at introducing a log_sys.last_checkpoint_last_lsn field did not work. It could incorrectly cause log_checkpoint() to be skipped, if all log had been durably written during a previous flushing batch, but some pages had not been flushed. We should also memorize the previous end_lsn for that to work. This was revealed by the test innodb.innodb_bug34300.

            A simpler fix is to just check if the previous checkpoint only wrote a FILE_CHECKPOINT record, and skip the checkpoint in that case only. And skip it even if the record spanned two 512-byte log blocks. This fix was already implemented in MDEV-14425 for 10.8 and later.

            marko Marko Mäkelä added a comment - My attempt at introducing a log_sys.last_checkpoint_last_lsn field did not work. It could incorrectly cause log_checkpoint() to be skipped, if all log had been durably written during a previous flushing batch, but some pages had not been flushed. We should also memorize the previous end_lsn for that to work. This was revealed by the test innodb.innodb_bug34300 . A simpler fix is to just check if the previous checkpoint only wrote a FILE_CHECKPOINT record, and skip the checkpoint in that case only. And skip it even if the record spanned two 512-byte log blocks. This fix was already implemented in MDEV-14425 for 10.8 and later.

            I verified the fix with the following patch applied on top of it:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index 00d96935461..c63f850cc36 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -1790,6 +1790,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn)
                 Perform an extra checkpoint at shutdown. */;
               else
               {
            +    ut_ad(!age || age == SIZE_OF_FILE_CHECKPOINT);
                 /* Do nothing, because nothing was logged (other than a
                 FILE_CHECKPOINT record) since the previous checkpoint. */
                 mysql_mutex_unlock(&log_sys.mutex);
            

            This added assertion would eventually fail during the following execution:

            ./mtr --mysqld=--skip-innodb-file-per-table --parallel=auto --repeat=1000 innodb.innodb_bug30423{,,,,,,,,,,,,,,,,,,}
            

            In the core dump, we had age=28, and the oldest_lsn field (which was equal to the latest checkpoint LSN) was 0x3a2d811. That is, (oldest_lsn&511)==0x11, which is only 3 bytes after the start of the log block payload area (0x0c). In other words, without the fix, we would have written a redundant FILE_CHECKPOINT record.

            Redundant FILE_CHECKPOINT records will still be written if during the previous checkpoint, any FILE_MODIFY} records were written.

            marko Marko Mäkelä added a comment - I verified the fix with the following patch applied on top of it: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 00d96935461..c63f850cc36 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1790,6 +1790,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn) Perform an extra checkpoint at shutdown. */; else { + ut_ad(!age || age == SIZE_OF_FILE_CHECKPOINT); /* Do nothing, because nothing was logged (other than a FILE_CHECKPOINT record) since the previous checkpoint. */ mysql_mutex_unlock(&log_sys.mutex); This added assertion would eventually fail during the following execution: ./mtr --mysqld=--skip-innodb-file-per-table --parallel=auto --repeat=1000 innodb.innodb_bug30423{,,,,,,,,,,,,,,,,,,} In the core dump, we had age=28 , and the oldest_lsn field (which was equal to the latest checkpoint LSN) was 0x3a2d811 . That is, (oldest_lsn&511)==0x11 , which is only 3 bytes after the start of the log block payload area ( 0x0c ). In other words, without the fix, we would have written a redundant FILE_CHECKPOINT record. Redundant FILE_CHECKPOINT records will still be written if during the previous checkpoint, any FILE_MODIFY } records were written.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.