Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19739

Avoid unnecessary page reads on recovery with innodb_log_optimize_ddl=ON

Details

    Description

      The MDEV-12699 fix is too conservative when it comes to the MLOG_INDEX_LOAD records, which are written on certain types of ALTER TABLE operations when innodb_log_optimize_ddl=ON (the default setting). The patch explains it:

      diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
      index dbce6f90fc2..c6bc91f8273 100644
      --- a/storage/innobase/log/log0recv.cc
      +++ b/storage/innobase/log/log0recv.cc
      @@ -2434,7 +2434,8 @@ void recv_apply_hashed_log_recs(bool last_batch)
       					goto skip;
       				}
       
      -				if (space->enable_lsn) {
      +				if (UT_LIST_GET_FIRST(recv_addr->rec_list)
      +				    ->start_lsn < space->enable_lsn) {
       do_read:
       					fil_space_release(space);
       					recv_addr->state = RECV_NOT_PROCESSED;
      

      There is no need to read pages that were (re)initialized after the latest MLOG_INDEX_LOAD record for the tablespace was written. This was noticed while testing MDEV-19738.

      Attachments

        Issue Links

          Activity

            I tried a more extensive change on 10.2. Below, the first hunk seemed to be OK in my very light testing.

            diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
            index dbce6f90fc2..ec9a75968cd 100644
            --- a/storage/innobase/log/log0recv.cc
            +++ b/storage/innobase/log/log0recv.cc
            @@ -2434,7 +2434,8 @@ void recv_apply_hashed_log_recs(bool last_batch)
             					goto skip;
             				}
             
            -				if (space->enable_lsn) {
            +				if (UT_LIST_GET_FIRST(recv_addr->rec_list)
            +				    ->start_lsn < space->enable_lsn) {
             do_read:
             					fil_space_release(space);
             					recv_addr->state = RECV_NOT_PROCESSED;
            @@ -2757,6 +2758,31 @@ recv_mlog_index_load(ulint space_id, ulint page_no, lsn_t lsn)
             	}
             }
             
            +/** Determine if a log record needs to be buffered and applied.
            +@param[in]	store		whether to store page operations
            +@param[in]	space_id	tablespace identifier
            +@param[in]	lsn		end log sequence number of the mini-transaction
            +@return	whether the log record needs to be buffered */
            +static bool recv_needed(store_t store, ulint space_id, lsn_t lsn)
            +{
            +	ut_ad(lsn);
            +
            +	if (store == STORE_NO) {
            +		return false;
            +	}
            +
            +	if (fil_space_t* space = fil_space_get(space_id)) {
            +		return space->enable_lsn < lsn;
            +	}
            +
            +	if (store == STORE_IF_EXISTS) {
            +		return false;
            +	}
            +
            +	recv_spaces_t::iterator it = recv_spaces.find(space_id);
            +	return it == recv_spaces.end() || it->second.enable_lsn < lsn;
            +}
            +
             /** Parse log records from a buffer and optionally store them to a
             hash table to wait merging to file pages.
             @param[in]	checkpoint_lsn	the LSN of the latest checkpoint
            @@ -2892,16 +2918,8 @@ bool recv_parse_log_recs(lsn_t checkpoint_lsn, store_t store, bool apply)
             			break;
             #endif /* UNIV_LOG_LSN_DEBUG */
             		default:
            -			switch (store) {
            -			case STORE_NO:
            -				break;
            -			case STORE_IF_EXISTS:
            -				if (fil_space_get_flags(space)
            -				    == ULINT_UNDEFINED) {
            -					break;
            -				}
            -				/* fall through */
            -			case STORE_YES:
            +			if (recv_needed(store, space,
            +					recv_sys->recovered_lsn)) {
             				recv_add_to_hash_table(
             					type, space, page_no, body,
             					ptr + len, old_lsn,
            @@ -3076,16 +3094,8 @@ bool recv_parse_log_recs(lsn_t checkpoint_lsn, store_t store, bool apply)
             				recv_parse_or_apply_log_rec_body(). */
             				break;
             			default:
            -				switch (store) {
            -				case STORE_NO:
            -					break;
            -				case STORE_IF_EXISTS:
            -					if (fil_space_get_flags(space)
            -					    == ULINT_UNDEFINED) {
            -						break;
            -					}
            -					/* fall through */
            -				case STORE_YES:
            +				if (recv_needed(store, space,
            +						new_recovered_lsn)) {
             					recv_add_to_hash_table(
             						type, space, page_no,
             						body, ptr + len,
            

            The subsequent hunks probably make the code in the first hunk unreachable. Those hunks will also break crash recovery in the test innodb.innodb-64k-crash. This test creates two tables, inserts some data, then creates a number of indexes, inserts some more data, and finally kills and restarts the server. Recovery would hit an assertion failure:

            10.2 90fec9602f2e24abd9cbb3cd2e1cadaa436a7ceb

            2019-06-12 17:02:28 139917914093504 [Note] InnoDB: Starting a batch to recover 236 pages from redo log.
            2019-06-12 17:02:28 0x7f4120c6d700  InnoDB: Assertion failure in file /mariadb/10.2/storage/innobase/log/log0recv.cc line 1546
            …
            #7  0x000055c6749ffe23 in recv_parse_or_apply_log_rec_body (type=MLOG_1BYTE, ptr=0x7f412207f2a0 "\002j\200\376", end_ptr=0x7f412207f2a4 "", space_id=4, page_no=0, apply=true, block=0x7f4121f9e4f0, mtr=0x7f4120c6b830) at /mariadb/10.2/storage/innobase/log/log0recv.cc:1546
            #8  0x000055c6749f7855 in recv_recover_page (block=0x7f4121f9e4f0, mtr=..., recv_addr=0x7f412207f258, init_lsn=0) at /mariadb/10.2/storage/innobase/log/log0recv.cc:2153
            #9  0x000055c6749f6abc in recv_recover_page (bpage=0x7f4121f9e4f0) at /mariadb/10.2/storage/innobase/log/log0recv.cc:2272
            #10 0x000055c674c29f6d in buf_page_io_complete (bpage=0x7f4121f9e4f0, dblwr=true, evict=false) at /mariadb/10.2/storage/innobase/buf/buf0buf.cc:6067
            

            This page (the first page of the tablespace) was never written after the initial .ibd file creation. With the patch, we wrongly skipped some redo log records for this page. The assertion caught this, because FIL_PAGE_TYPE had not been set in the dummy page that was written as part of the file creation.

            This patch could work if we would perform log checkpoints before writing MLOG_INDEX_LOAD records, or if we flushed all pages of the tablespace before writing the MLOG_INDEX_LOAD record. Currently, before writing MLOG_INDEX_LOAD we do wait for FlushObserver::flush() to return, but that function will only wait for all pages of the newly created indexes to be written out. For other pages of the tablespace, buf_page_t::flush_observer will be NULL. While debugging the test case, I observed that the page 0 was the last-but-one entry in buf_pool->flush_list. That would explain why the block had never been written out.

            Unfortunately, it looks like the page reads will be necessary. The only way to avoid them would be to ensure that all pages of the tablespace will be written to the data file before writing the MLOG_INDEX_LOAD record. (An even more extreme way would be to remove the FlushObserver altogether and to initiate a redo log checkpoint.) This might defeat the purpose of the innodb_log_optimize_ddl parameter, which was to improve performance during ALTER TABLE operations.

            marko Marko Mäkelä added a comment - I tried a more extensive change on 10.2. Below, the first hunk seemed to be OK in my very light testing. diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index dbce6f90fc2..ec9a75968cd 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -2434,7 +2434,8 @@ void recv_apply_hashed_log_recs(bool last_batch) goto skip; } - if (space->enable_lsn) { + if (UT_LIST_GET_FIRST(recv_addr->rec_list) + ->start_lsn < space->enable_lsn) { do_read: fil_space_release(space); recv_addr->state = RECV_NOT_PROCESSED; @@ -2757,6 +2758,31 @@ recv_mlog_index_load(ulint space_id, ulint page_no, lsn_t lsn) } } +/** Determine if a log record needs to be buffered and applied. +@param[in] store whether to store page operations +@param[in] space_id tablespace identifier +@param[in] lsn end log sequence number of the mini-transaction +@return whether the log record needs to be buffered */ +static bool recv_needed(store_t store, ulint space_id, lsn_t lsn) +{ + ut_ad(lsn); + + if (store == STORE_NO) { + return false; + } + + if (fil_space_t* space = fil_space_get(space_id)) { + return space->enable_lsn < lsn; + } + + if (store == STORE_IF_EXISTS) { + return false; + } + + recv_spaces_t::iterator it = recv_spaces.find(space_id); + return it == recv_spaces.end() || it->second.enable_lsn < lsn; +} + /** Parse log records from a buffer and optionally store them to a hash table to wait merging to file pages. @param[in] checkpoint_lsn the LSN of the latest checkpoint @@ -2892,16 +2918,8 @@ bool recv_parse_log_recs(lsn_t checkpoint_lsn, store_t store, bool apply) break; #endif /* UNIV_LOG_LSN_DEBUG */ default: - switch (store) { - case STORE_NO: - break; - case STORE_IF_EXISTS: - if (fil_space_get_flags(space) - == ULINT_UNDEFINED) { - break; - } - /* fall through */ - case STORE_YES: + if (recv_needed(store, space, + recv_sys->recovered_lsn)) { recv_add_to_hash_table( type, space, page_no, body, ptr + len, old_lsn, @@ -3076,16 +3094,8 @@ bool recv_parse_log_recs(lsn_t checkpoint_lsn, store_t store, bool apply) recv_parse_or_apply_log_rec_body(). */ break; default: - switch (store) { - case STORE_NO: - break; - case STORE_IF_EXISTS: - if (fil_space_get_flags(space) - == ULINT_UNDEFINED) { - break; - } - /* fall through */ - case STORE_YES: + if (recv_needed(store, space, + new_recovered_lsn)) { recv_add_to_hash_table( type, space, page_no, body, ptr + len, The subsequent hunks probably make the code in the first hunk unreachable. Those hunks will also break crash recovery in the test innodb.innodb-64k-crash . This test creates two tables, inserts some data, then creates a number of indexes, inserts some more data, and finally kills and restarts the server. Recovery would hit an assertion failure: 10.2 90fec9602f2e24abd9cbb3cd2e1cadaa436a7ceb 2019-06-12 17:02:28 139917914093504 [Note] InnoDB: Starting a batch to recover 236 pages from redo log. 2019-06-12 17:02:28 0x7f4120c6d700 InnoDB: Assertion failure in file /mariadb/10.2/storage/innobase/log/log0recv.cc line 1546 … #7 0x000055c6749ffe23 in recv_parse_or_apply_log_rec_body (type=MLOG_1BYTE, ptr=0x7f412207f2a0 "\002j\200\376", end_ptr=0x7f412207f2a4 "", space_id=4, page_no=0, apply=true, block=0x7f4121f9e4f0, mtr=0x7f4120c6b830) at /mariadb/10.2/storage/innobase/log/log0recv.cc:1546 #8 0x000055c6749f7855 in recv_recover_page (block=0x7f4121f9e4f0, mtr=..., recv_addr=0x7f412207f258, init_lsn=0) at /mariadb/10.2/storage/innobase/log/log0recv.cc:2153 #9 0x000055c6749f6abc in recv_recover_page (bpage=0x7f4121f9e4f0) at /mariadb/10.2/storage/innobase/log/log0recv.cc:2272 #10 0x000055c674c29f6d in buf_page_io_complete (bpage=0x7f4121f9e4f0, dblwr=true, evict=false) at /mariadb/10.2/storage/innobase/buf/buf0buf.cc:6067 This page (the first page of the tablespace) was never written after the initial .ibd file creation. With the patch, we wrongly skipped some redo log records for this page. The assertion caught this, because FIL_PAGE_TYPE had not been set in the dummy page that was written as part of the file creation. This patch could work if we would perform log checkpoints before writing MLOG_INDEX_LOAD records, or if we flushed all pages of the tablespace before writing the MLOG_INDEX_LOAD record. Currently, before writing MLOG_INDEX_LOAD we do wait for FlushObserver::flush() to return, but that function will only wait for all pages of the newly created indexes to be written out. For other pages of the tablespace, buf_page_t::flush_observer will be NULL . While debugging the test case, I observed that the page 0 was the last-but-one entry in buf_pool->flush_list . That would explain why the block had never been written out. Unfortunately, it looks like the page reads will be necessary. The only way to avoid them would be to ensure that all pages of the tablespace will be written to the data file before writing the MLOG_INDEX_LOAD record. (An even more extreme way would be to remove the FlushObserver altogether and to initiate a redo log checkpoint.) This might defeat the purpose of the innodb_log_optimize_ddl parameter, which was to improve performance during ALTER TABLE operations.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.