[MDEV-28185] InnoDB generates redundant log checkpoints Created: 2022-03-28  Updated: 2022-04-05  Resolved: 2022-03-30

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9
Fix Version/s: 10.5.16, 10.6.8, 10.7.4

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Relates
relates to MDEV-24845 Oddities around innodb_fatal_semaphor... Closed

 Description   

The function log_checkpoint() or log_checkpoint_low() tries to avoid redundant writes. Here is the 10.3 version of log_checkpoint():

	if (oldest_lsn
	    > log_sys.last_checkpoint_lsn + SIZE_OF_MLOG_CHECKPOINT) {
		/* Some log has been written since the previous checkpoint. */
	} else if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) {
		/* MariaDB 10.3 startup expects the redo log file to be
		logically empty (not even containing a MLOG_CHECKPOINT record)
		after a clean shutdown. Perform an extra checkpoint at
		shutdown. */
	} else {
		/* Do nothing, because nothing was logged (other than
		a MLOG_CHECKPOINT marker) since the previous checkpoint. */
		log_mutex_exit();
		return(true);
	}
...
	if (fil_names_clear(flush_lsn, do_write)) {

This logic fails to take two things into account:

  1. The function fil_names_clear() may write some file names before the checkpoint record.
  2. Before MDEV-14425, the checkpoint record may be interrupted by 4+12 or 8+12 bytes of log block trailer and header.

In either case, the difference of oldest_lsn and log_sys.last_checkpoint_lsn could be more than the size of the checkpoint record.

The simplest way to fix this could be to avoid writing a new checkpoint if oldest_lsn did not change since the previous checkpoint.



 Comments   
Comment by Marko Mäkelä [ 2022-03-29 ]

My attempt at introducing a log_sys.last_checkpoint_last_lsn field did not work. It could incorrectly cause log_checkpoint() to be skipped, if all log had been durably written during a previous flushing batch, but some pages had not been flushed. We should also memorize the previous end_lsn for that to work. This was revealed by the test innodb.innodb_bug34300.

A simpler fix is to just check if the previous checkpoint only wrote a FILE_CHECKPOINT record, and skip the checkpoint in that case only. And skip it even if the record spanned two 512-byte log blocks. This fix was already implemented in MDEV-14425 for 10.8 and later.

Comment by Marko Mäkelä [ 2022-03-30 ]

I verified the fix with the following patch applied on top of it:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index 00d96935461..c63f850cc36 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1790,6 +1790,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn)
     Perform an extra checkpoint at shutdown. */;
   else
   {
+    ut_ad(!age || age == SIZE_OF_FILE_CHECKPOINT);
     /* Do nothing, because nothing was logged (other than a
     FILE_CHECKPOINT record) since the previous checkpoint. */
     mysql_mutex_unlock(&log_sys.mutex);

This added assertion would eventually fail during the following execution:

./mtr --mysqld=--skip-innodb-file-per-table --parallel=auto --repeat=1000 innodb.innodb_bug30423{,,,,,,,,,,,,,,,,,,}

In the core dump, we had age=28, and the oldest_lsn field (which was equal to the latest checkpoint LSN) was 0x3a2d811. That is, (oldest_lsn&511)==0x11, which is only 3 bytes after the start of the log block payload area (0x0c). In other words, without the fix, we would have written a redundant FILE_CHECKPOINT record.

Redundant FILE_CHECKPOINT records will still be written if during the previous checkpoint, any FILE_MODIFY} records were written.

Generated at Thu Feb 08 09:58:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.