[MDEV-14481] Execute InnoDB crash recovery in the background - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Do
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- performance
- recovery

Description

InnoDB startup unnecessarily waits for recovered redo log records to be applied to the data files.

In fact, normally while the function trx_sys_init_at_db_start() is executing, the pages that it is requesting from the buffer pool will have any recovered redo log applied to them in the background.

Basically, we only need to remove or refactor some calls in the InnoDB server startup. Some of this was performed in ~~MDEV-19514~~ and ~~MDEV-21216~~.
The crash recovery would ‘complete’ at the time of the next redo log checkpoint is written.

We should rewrite or remove ~~recv_recovery_from_checkpoint_finish()~~ recv_sys.apply(true) so that ~~it will not wait for any page flushing to complete~~ (already done in ~~MDEV-27022~~). While doing this, we must also ~~remove the buf_pool_t::flush_rbt~~ (removed in ~~MDEV-23399~~) and ~~use the normal flushing mechanism that strictly obeys the ARIES protocol for write-ahead logging~~ (implemented in ~~MDEV-24626~~).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

1.svg
457 kB
2021-11-26 04:10
Screenshot from 2021-11-26 10-08-41.png
117 kB
2021-11-26 04:10

Issue Links

blocks

MDEV-12700 Allow innodb_read_only startup without prior slow shutdown

Closed

includes

MDEV-27022 Buffer pool is being flushed during recovery

Closed

is blocked by

MDEV-9843 InnoDB hangs on startup between "InnoDB: Apply batch completed" and "rollback segment(s) are active", various tests fail sporadically in buildbot on p8-rhel6-bintar-debug

Closed

MDEV-13564 TRUNCATE TABLE and undo tablespace truncation are not compatible with Mariabackup

Closed

MDEV-13869 MariaDB slow start

Closed

MDEV-19514 Defer change buffer merge until pages are requested

Closed

MDEV-21216 InnoDB performs dirty read of TRX_SYS page before crash recovery

Closed

MDEV-23399 10.5 performance regression with IO-bound tpcc

Closed

MDEV-24626 Remove synchronous write of page0 and flushing file during file creation

Closed

MDEV-27022 Buffer pool is being flushed during recovery

Closed

relates to

MDEV-12699 Improve crash recovery of corrupted data pages

Closed

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

MDEV-13564 TRUNCATE TABLE and undo tablespace truncation are not compatible with Mariabackup

Closed

MDEV-14935 Remove bogus conditions related to not redo-logging PAGE_MAX_TRX_ID changes

Closed

MDEV-18733 MariaDB slow start after crash recovery

Closed

MDEV-19229 Allow innodb_undo_tablespaces to be changed after database creation

Closed

MDEV-27610 Unnecessary wait in InnoDB crash recovery

Closed

MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Closed

MDEV-30069 InnoDB: Trying to write ... bytes at ... outside the bounds of the file ...

Closed

MDEV-9663 InnoDB assertion failure: *cursor->index->name == TEMP_INDEX_PREFIX, or !cursor->index->is_committed()

Closed

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-26326 MDEV-24626 (remove synchronous page0 write) seems to cause mariabackup to skip valid ibd file.

Closed

(5 is blocked by, 12 relates to)

Activity

Ascending order - Click to sort in descending order

View 24 older comments

Eugene Kosov (Inactive) added a comment - 2022-01-12 18:00

I put and assertion here and it looks that the code is a dead code. This is not a proof yet, however.

This means that already all CPU-intensive work is done in tpool. And the only thing to optimize is a wait at the end of recv_sys_t::apply(). This wait mostly need to know when to call a cleanup for recovery. The wait + cleanup could be put in a separate thread or the wait could be removed completely and cleanup moved to a log checkpoint. Simply making recv_sys_t::apply() asynchronous results in a races of a buffer pool pages. Also, wait condition should be changed from waiting for zero pending reads to something else.

Eugene Kosov (Inactive) added a comment - 2022-01-12 18:00 I put and assertion here and it looks that the code is a dead code. This is not a proof yet, however. This means that already all CPU-intensive work is done in tpool. And the only thing to optimize is a wait at the end of recv_sys_t::apply() . This wait mostly need to know when to call a cleanup for recovery. The wait + cleanup could be put in a separate thread or the wait could be removed completely and cleanup moved to a log checkpoint. Simply making recv_sys_t::apply() asynchronous results in a races of a buffer pool pages. Also, wait condition should be changed from waiting for zero pending reads to something else.

Marko Mäkelä added a comment - 2022-01-24 14:45

Yes, that code path could indeed be unreachable. In case a background read-ahead is concurrently completing for this same page, that read should be protected by a page X-latch and IO-fix, which will not be released before buf_page_read_complete() will have invoked recv_recover_page(). So, there cannot possibly be any changes to apply. If this analysis is correct, we could remove quite a bit of code without any ill effect:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc

index 8e79a9b7e87..28010af4f7a 100644

--- a/storage/innobase/log/log0recv.cc

+++ b/storage/innobase/log/log0recv.cc

@@ -2713,27 +2713,7 @@ void recv_sys_t::apply(bool last_batch)

         continue;

       case page_recv_t::RECV_NOT_PROCESSED:

-        mtr.start();

-        mtr.set_log_mode(MTR_LOG_NO_REDO);

-        if (buf_block_t *block= buf_page_get_low(page_id, 0, RW_X_LATCH,

-                                                 nullptr, BUF_GET_IF_IN_POOL,

-                                                 __FILE__, __LINE__,

-                                                 &mtr, nullptr, false))

-        {

-          buf_block_dbg_add_level(block, SYNC_NO_ORDER_CHECK);

-          recv_recover_page(block, mtr, p);

-          ut_ad(mtr.has_committed());

-        }

-        else

-        {

-          mtr.commit();

-          recv_read_in_area(page_id);

-          break;

-        }

-        map::iterator r= p++;

-        r->second.log.clear();

-        pages.erase(r);

-        continue;

+        recv_read_in_area(page_id);

       goto next_page;

Marko Mäkelä added a comment - 2022-01-24 14:45 Yes, that code path could indeed be unreachable. In case a background read-ahead is concurrently completing for this same page, that read should be protected by a page X-latch and IO-fix, which will not be released before buf_page_read_complete() will have invoked recv_recover_page() . So, there cannot possibly be any changes to apply. If this analysis is correct, we could remove quite a bit of code without any ill effect: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 8e79a9b7e87..28010af4f7a 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -2713,27 +2713,7 @@ void recv_sys_t::apply(bool last_batch) } continue; case page_recv_t::RECV_NOT_PROCESSED: - mtr.start(); - mtr.set_log_mode(MTR_LOG_NO_REDO); - if (buf_block_t *block= buf_page_get_low(page_id, 0, RW_X_LATCH, - nullptr, BUF_GET_IF_IN_POOL, - __FILE__, __LINE__, - &mtr, nullptr, false)) - { - buf_block_dbg_add_level(block, SYNC_NO_ORDER_CHECK); - recv_recover_page(block, mtr, p); - ut_ad(mtr.has_committed()); - } - else - { - mtr.commit(); - recv_read_in_area(page_id); - break; - } - map::iterator r= p++; - r->second.log.clear(); - pages.erase(r); - continue; + recv_read_in_area(page_id); } goto next_page;

Marko Mäkelä added a comment - 2022-01-25 07:32 - edited

I filed ~~MDEV-27610~~ for the removing the unnecessary wait and the dead code.

I thought that my patch might need to be revised further, to replace the goto next_page with p++, but that would cause widespread test failures. It turns out that recv_read_in_area(page_id) will transition the current block as well as possibly some following blocks from RECV_NOT_PROCESSED to RECV_BEING_READ. So, the goto next_page was doing the right thing.

Marko Mäkelä added a comment - 2022-01-25 07:32 - edited I filed MDEV-27610 for the removing the unnecessary wait and the dead code. I thought that my patch might need to be revised further, to replace the goto next_page with p++ , but that would cause widespread test failures. It turns out that recv_read_in_area(page_id) will transition the current block as well as possibly some following blocks from RECV_NOT_PROCESSED to RECV_BEING_READ . So, the goto next_page was doing the right thing.

Marko Mäkelä added a comment - 2023-02-09 11:53

It could make sense to make this an opt-in feature, by introducing a Boolean global parameter:

If the server is started up with innodb_recover_in_background=ON (disabled by default), the last recovery batch would occur while SQL commands are accepted.
An explicit statement SET GLOBAL innodb_recover_in_background=OFF could be executed to ensure that the recovery has been completed, before starting "serious" workload. This would free up the buffer pool space that was allocated for redo log records during recovery.

Marko Mäkelä added a comment - 2023-02-09 11:53 It could make sense to make this an opt-in feature, by introducing a Boolean global parameter: If the server is started up with innodb_recover_in_background=ON (disabled by default), the last recovery batch would occur while SQL commands are accepted. An explicit statement SET GLOBAL innodb_recover_in_background=OFF could be executed to ensure that the recovery has been completed, before starting "serious" workload. This would free up the buffer pool space that was allocated for redo log records during recovery.

Marko Mäkelä added a comment - 2023-04-12 14:02

This is fundamentally conflicting with the recovery performance improvements that I implemented in ~~MDEV-29911~~. Basically, ~~MDEV-29911~~ is allocating most of the available buffer pool for log records, to reduce I/O traffic for data pages. This means that while the final recovery batch is running, only a tiny buffer pool might be available to users. Because recovery is multi-threaded, the most part of recovery time should actually be related to parsing and storing log records (in a single thread), and actually applying the log to buffer pool pages is fast (especially when innodb_read_io_threads is set to a high value).

I do not think that implementing recovery in background would add that much value, compared to the complexity of implementing and testing this.

Marko Mäkelä added a comment - 2023-04-12 14:02 This is fundamentally conflicting with the recovery performance improvements that I implemented in MDEV-29911 . Basically, MDEV-29911 is allocating most of the available buffer pool for log records, to reduce I/O traffic for data pages. This means that while the final recovery batch is running, only a tiny buffer pool might be available to users. Because recovery is multi-threaded, the most part of recovery time should actually be related to parsing and storing log records (in a single thread), and actually applying the log to buffer pool pages is fast (especially when innodb_read_io_threads is set to a high value). I do not think that implementing recovery in background would add that much value, compared to the complexity of implementing and testing this.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 2 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 2017-11-23 09:05

Updated:: 2023-04-13 10:57

Resolved:: 2023-04-12 14:02

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration