[MDEV-28185] InnoDB generates redundant log checkpoints - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL)
Fix Version/s: 10.5.16, 10.6.8, 10.7.4
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

The function log_checkpoint() or log_checkpoint_low() tries to avoid redundant writes. Here is the 10.3 version of log_checkpoint():

	if (oldest_lsn

	    > log_sys.last_checkpoint_lsn + SIZE_OF_MLOG_CHECKPOINT) {

		/* Some log has been written since the previous checkpoint. */

	} else if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) {

		/* MariaDB 10.3 startup expects the redo log file to be

		logically empty (not even containing a MLOG_CHECKPOINT record)

		after a clean shutdown. Perform an extra checkpoint at

		shutdown. */

	} else {

		/* Do nothing, because nothing was logged (other than

		a MLOG_CHECKPOINT marker) since the previous checkpoint. */

		log_mutex_exit();

		return(true);

...

	if (fil_names_clear(flush_lsn, do_write)) {

This logic fails to take two things into account:

The function fil_names_clear() may write some file names before the checkpoint record.
Before ~~MDEV-14425~~, the checkpoint record may be interrupted by 4+12 or 8+12 bytes of log block trailer and header.

In either case, the difference of oldest_lsn and log_sys.last_checkpoint_lsn could be more than the size of the checkpoint record.

The simplest way to fix this could be to avoid writing a new checkpoint if oldest_lsn did not change since the previous checkpoint.

Attachments

Issue Links

relates to

MDEV-24845 Oddities around innodb_fatal_semaphore_wait_threshold and global.innodb_disallow_writes

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2022-03-29 16:59

My attempt at introducing a log_sys.last_checkpoint_last_lsn field did not work. It could incorrectly cause log_checkpoint() to be skipped, if all log had been durably written during a previous flushing batch, but some pages had not been flushed. We should also memorize the previous end_lsn for that to work. This was revealed by the test innodb.innodb_bug34300.

A simpler fix is to just check if the previous checkpoint only wrote a FILE_CHECKPOINT record, and skip the checkpoint in that case only. And skip it even if the record spanned two 512-byte log blocks. This fix was already implemented in ~~MDEV-14425~~ for 10.8 and later.

Marko Mäkelä added a comment - 2022-03-29 16:59 My attempt at introducing a log_sys.last_checkpoint_last_lsn field did not work. It could incorrectly cause log_checkpoint() to be skipped, if all log had been durably written during a previous flushing batch, but some pages had not been flushed. We should also memorize the previous end_lsn for that to work. This was revealed by the test innodb.innodb_bug34300 . A simpler fix is to just check if the previous checkpoint only wrote a FILE_CHECKPOINT record, and skip the checkpoint in that case only. And skip it even if the record spanned two 512-byte log blocks. This fix was already implemented in MDEV-14425 for 10.8 and later.

Marko Mäkelä added a comment - 2022-03-30 04:57

I verified the fix with the following patch applied on top of it:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index 00d96935461..c63f850cc36 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -1790,6 +1790,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn)

     Perform an extra checkpoint at shutdown. */;

   else

+    ut_ad(!age || age == SIZE_OF_FILE_CHECKPOINT);

     /* Do nothing, because nothing was logged (other than a

     FILE_CHECKPOINT record) since the previous checkpoint. */

     mysql_mutex_unlock(&log_sys.mutex);

This added assertion would eventually fail during the following execution:

./mtr --mysqld=--skip-innodb-file-per-table --parallel=auto --repeat=1000 innodb.innodb_bug30423{,,,,,,,,,,,,,,,,,,}

In the core dump, we had age=28, and the oldest_lsn field (which was equal to the latest checkpoint LSN) was 0x3a2d811. That is, (oldest_lsn&511)==0x11, which is only 3 bytes after the start of the log block payload area (0x0c). In other words, without the fix, we would have written a redundant FILE_CHECKPOINT record.

Redundant FILE_CHECKPOINT records will still be written if during the previous checkpoint, any FILE_MODIFY} records were written.

Marko Mäkelä added a comment - 2022-03-30 04:57 I verified the fix with the following patch applied on top of it: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 00d96935461..c63f850cc36 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1790,6 +1790,7 @@ static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn) Perform an extra checkpoint at shutdown. */; else { + ut_ad(!age || age == SIZE_OF_FILE_CHECKPOINT); /* Do nothing, because nothing was logged (other than a FILE_CHECKPOINT record) since the previous checkpoint. */ mysql_mutex_unlock(&log_sys.mutex); This added assertion would eventually fail during the following execution: ./mtr --mysqld=--skip-innodb-file-per-table --parallel=auto --repeat=1000 innodb.innodb_bug30423{,,,,,,,,,,,,,,,,,,} In the core dump, we had age=28 , and the oldest_lsn field (which was equal to the latest checkpoint LSN) was 0x3a2d811 . That is, (oldest_lsn&511)==0x11 , which is only 3 bytes after the start of the log block payload area ( 0x0c ). In other words, without the fix, we would have written a redundant FILE_CHECKPOINT record. Redundant FILE_CHECKPOINT records will still be written if during the previous checkpoint, any FILE_MODIFY } records were written.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2022-03-28 12:33

Updated:: 2024-04-05 18:34

Resolved:: 2022-03-30 04:57

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server