[MDEV-19739] Avoid unnecessary page reads on recovery with innodb_log_optimize_ddl=ON Created: 2019-06-12  Updated: 2023-11-27  Resolved: 2019-06-12

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - InnoDB
Affects Version/s: 10.2.24, 10.3.15, 10.4.5
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Won't Fix Votes: 0
Labels: recovery

Issue Links:
Relates
relates to MDEV-12699 Improve crash recovery of corrupted d... Closed
relates to MDEV-19738 Doublewrite buffer is unnecessarily u... Closed

 Description   

The MDEV-12699 fix is too conservative when it comes to the MLOG_INDEX_LOAD records, which are written on certain types of ALTER TABLE operations when innodb_log_optimize_ddl=ON (the default setting). The patch explains it:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index dbce6f90fc2..c6bc91f8273 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -2434,7 +2434,8 @@ void recv_apply_hashed_log_recs(bool last_batch)
 					goto skip;
 				}
 
-				if (space->enable_lsn) {
+				if (UT_LIST_GET_FIRST(recv_addr->rec_list)
+				    ->start_lsn < space->enable_lsn) {
 do_read:
 					fil_space_release(space);
 					recv_addr->state = RECV_NOT_PROCESSED;

There is no need to read pages that were (re)initialized after the latest MLOG_INDEX_LOAD record for the tablespace was written. This was noticed while testing MDEV-19738.



 Comments   
Comment by Marko Mäkelä [ 2019-06-12 ]

I tried a more extensive change on 10.2. Below, the first hunk seemed to be OK in my very light testing.

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index dbce6f90fc2..ec9a75968cd 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -2434,7 +2434,8 @@ void recv_apply_hashed_log_recs(bool last_batch)
 					goto skip;
 				}
 
-				if (space->enable_lsn) {
+				if (UT_LIST_GET_FIRST(recv_addr->rec_list)
+				    ->start_lsn < space->enable_lsn) {
 do_read:
 					fil_space_release(space);
 					recv_addr->state = RECV_NOT_PROCESSED;
@@ -2757,6 +2758,31 @@ recv_mlog_index_load(ulint space_id, ulint page_no, lsn_t lsn)
 	}
 }
 
+/** Determine if a log record needs to be buffered and applied.
+@param[in]	store		whether to store page operations
+@param[in]	space_id	tablespace identifier
+@param[in]	lsn		end log sequence number of the mini-transaction
+@return	whether the log record needs to be buffered */
+static bool recv_needed(store_t store, ulint space_id, lsn_t lsn)
+{
+	ut_ad(lsn);
+
+	if (store == STORE_NO) {
+		return false;
+	}
+
+	if (fil_space_t* space = fil_space_get(space_id)) {
+		return space->enable_lsn < lsn;
+	}
+
+	if (store == STORE_IF_EXISTS) {
+		return false;
+	}
+
+	recv_spaces_t::iterator it = recv_spaces.find(space_id);
+	return it == recv_spaces.end() || it->second.enable_lsn < lsn;
+}
+
 /** Parse log records from a buffer and optionally store them to a
 hash table to wait merging to file pages.
 @param[in]	checkpoint_lsn	the LSN of the latest checkpoint
@@ -2892,16 +2918,8 @@ bool recv_parse_log_recs(lsn_t checkpoint_lsn, store_t store, bool apply)
 			break;
 #endif /* UNIV_LOG_LSN_DEBUG */
 		default:
-			switch (store) {
-			case STORE_NO:
-				break;
-			case STORE_IF_EXISTS:
-				if (fil_space_get_flags(space)
-				    == ULINT_UNDEFINED) {
-					break;
-				}
-				/* fall through */
-			case STORE_YES:
+			if (recv_needed(store, space,
+					recv_sys->recovered_lsn)) {
 				recv_add_to_hash_table(
 					type, space, page_no, body,
 					ptr + len, old_lsn,
@@ -3076,16 +3094,8 @@ bool recv_parse_log_recs(lsn_t checkpoint_lsn, store_t store, bool apply)
 				recv_parse_or_apply_log_rec_body(). */
 				break;
 			default:
-				switch (store) {
-				case STORE_NO:
-					break;
-				case STORE_IF_EXISTS:
-					if (fil_space_get_flags(space)
-					    == ULINT_UNDEFINED) {
-						break;
-					}
-					/* fall through */
-				case STORE_YES:
+				if (recv_needed(store, space,
+						new_recovered_lsn)) {
 					recv_add_to_hash_table(
 						type, space, page_no,
 						body, ptr + len,

The subsequent hunks probably make the code in the first hunk unreachable. Those hunks will also break crash recovery in the test innodb.innodb-64k-crash. This test creates two tables, inserts some data, then creates a number of indexes, inserts some more data, and finally kills and restarts the server. Recovery would hit an assertion failure:

10.2 90fec9602f2e24abd9cbb3cd2e1cadaa436a7ceb

2019-06-12 17:02:28 139917914093504 [Note] InnoDB: Starting a batch to recover 236 pages from redo log.
2019-06-12 17:02:28 0x7f4120c6d700  InnoDB: Assertion failure in file /mariadb/10.2/storage/innobase/log/log0recv.cc line 1546
#7  0x000055c6749ffe23 in recv_parse_or_apply_log_rec_body (type=MLOG_1BYTE, ptr=0x7f412207f2a0 "\002j\200\376", end_ptr=0x7f412207f2a4 "", space_id=4, page_no=0, apply=true, block=0x7f4121f9e4f0, mtr=0x7f4120c6b830) at /mariadb/10.2/storage/innobase/log/log0recv.cc:1546
#8  0x000055c6749f7855 in recv_recover_page (block=0x7f4121f9e4f0, mtr=..., recv_addr=0x7f412207f258, init_lsn=0) at /mariadb/10.2/storage/innobase/log/log0recv.cc:2153
#9  0x000055c6749f6abc in recv_recover_page (bpage=0x7f4121f9e4f0) at /mariadb/10.2/storage/innobase/log/log0recv.cc:2272
#10 0x000055c674c29f6d in buf_page_io_complete (bpage=0x7f4121f9e4f0, dblwr=true, evict=false) at /mariadb/10.2/storage/innobase/buf/buf0buf.cc:6067

This page (the first page of the tablespace) was never written after the initial .ibd file creation. With the patch, we wrongly skipped some redo log records for this page. The assertion caught this, because FIL_PAGE_TYPE had not been set in the dummy page that was written as part of the file creation.

This patch could work if we would perform log checkpoints before writing MLOG_INDEX_LOAD records, or if we flushed all pages of the tablespace before writing the MLOG_INDEX_LOAD record. Currently, before writing MLOG_INDEX_LOAD we do wait for FlushObserver::flush() to return, but that function will only wait for all pages of the newly created indexes to be written out. For other pages of the tablespace, buf_page_t::flush_observer will be NULL. While debugging the test case, I observed that the page 0 was the last-but-one entry in buf_pool->flush_list. That would explain why the block had never been written out.

Unfortunately, it looks like the page reads will be necessary. The only way to avoid them would be to ensure that all pages of the tablespace will be written to the data file before writing the MLOG_INDEX_LOAD record. (An even more extreme way would be to remove the FlushObserver altogether and to initiate a redo log checkpoint.) This might defeat the purpose of the innodb_log_optimize_ddl parameter, which was to improve performance during ALTER TABLE operations.

Generated at Thu Feb 08 08:53:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.