Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14481

Execute InnoDB crash recovery in the background

Details

    Description

      InnoDB startup unnecessarily waits for recovered redo log records to be applied to the data files.

      In fact, normally while the function trx_sys_init_at_db_start() is executing, the pages that it is requesting from the buffer pool will have any recovered redo log applied to them in the background.

      Basically, we only need to remove or refactor some calls in the InnoDB server startup. Some of this was performed in MDEV-19514 and MDEV-21216.
      The crash recovery would ‘complete’ at the time of the next redo log checkpoint is written.

      We should rewrite or remove recv_recovery_from_checkpoint_finish() recv_sys.apply(true) so that it will not wait for any page flushing to complete (already done in MDEV-27022). While doing this, we must also remove the buf_pool_t::flush_rbt (removed in MDEV-23399) and use the normal flushing mechanism that strictly obeys the ARIES protocol for write-ahead logging (implemented in MDEV-24626).

      Attachments

        Issue Links

          Activity

            I put and assertion here and it looks that the code is a dead code. This is not a proof yet, however.

            This means that already all CPU-intensive work is done in tpool. And the only thing to optimize is a wait at the end of recv_sys_t::apply(). This wait mostly need to know when to call a cleanup for recovery. The wait + cleanup could be put in a separate thread or the wait could be removed completely and cleanup moved to a log checkpoint. Simply making recv_sys_t::apply() asynchronous results in a races of a buffer pool pages. Also, wait condition should be changed from waiting for zero pending reads to something else.

            kevg Eugene Kosov (Inactive) added a comment - I put and assertion here and it looks that the code is a dead code. This is not a proof yet, however. This means that already all CPU-intensive work is done in tpool. And the only thing to optimize is a wait at the end of recv_sys_t::apply() . This wait mostly need to know when to call a cleanup for recovery. The wait + cleanup could be put in a separate thread or the wait could be removed completely and cleanup moved to a log checkpoint. Simply making recv_sys_t::apply() asynchronous results in a races of a buffer pool pages. Also, wait condition should be changed from waiting for zero pending reads to something else.

            Yes, that code path could indeed be unreachable. In case a background read-ahead is concurrently completing for this same page, that read should be protected by a page X-latch and IO-fix, which will not be released before buf_page_read_complete() will have invoked recv_recover_page(). So, there cannot possibly be any changes to apply. If this analysis is correct, we could remove quite a bit of code without any ill effect:

            diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
            index 8e79a9b7e87..28010af4f7a 100644
            --- a/storage/innobase/log/log0recv.cc
            +++ b/storage/innobase/log/log0recv.cc
            @@ -2713,27 +2713,7 @@ void recv_sys_t::apply(bool last_batch)
                     }
                     continue;
                   case page_recv_t::RECV_NOT_PROCESSED:
            -        mtr.start();
            -        mtr.set_log_mode(MTR_LOG_NO_REDO);
            -        if (buf_block_t *block= buf_page_get_low(page_id, 0, RW_X_LATCH,
            -                                                 nullptr, BUF_GET_IF_IN_POOL,
            -                                                 __FILE__, __LINE__,
            -                                                 &mtr, nullptr, false))
            -        {
            -          buf_block_dbg_add_level(block, SYNC_NO_ORDER_CHECK);
            -          recv_recover_page(block, mtr, p);
            -          ut_ad(mtr.has_committed());
            -        }
            -        else
            -        {
            -          mtr.commit();
            -          recv_read_in_area(page_id);
            -          break;
            -        }
            -        map::iterator r= p++;
            -        r->second.log.clear();
            -        pages.erase(r);
            -        continue;
            +        recv_read_in_area(page_id);
                   }
             
                   goto next_page;
            

            marko Marko Mäkelä added a comment - Yes, that code path could indeed be unreachable. In case a background read-ahead is concurrently completing for this same page, that read should be protected by a page X-latch and IO-fix, which will not be released before buf_page_read_complete() will have invoked recv_recover_page() . So, there cannot possibly be any changes to apply. If this analysis is correct, we could remove quite a bit of code without any ill effect: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 8e79a9b7e87..28010af4f7a 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -2713,27 +2713,7 @@ void recv_sys_t::apply(bool last_batch) } continue; case page_recv_t::RECV_NOT_PROCESSED: - mtr.start(); - mtr.set_log_mode(MTR_LOG_NO_REDO); - if (buf_block_t *block= buf_page_get_low(page_id, 0, RW_X_LATCH, - nullptr, BUF_GET_IF_IN_POOL, - __FILE__, __LINE__, - &mtr, nullptr, false)) - { - buf_block_dbg_add_level(block, SYNC_NO_ORDER_CHECK); - recv_recover_page(block, mtr, p); - ut_ad(mtr.has_committed()); - } - else - { - mtr.commit(); - recv_read_in_area(page_id); - break; - } - map::iterator r= p++; - r->second.log.clear(); - pages.erase(r); - continue; + recv_read_in_area(page_id); } goto next_page;
            marko Marko Mäkelä added a comment - - edited

            I filed MDEV-27610 for the removing the unnecessary wait and the dead code.

            I thought that my patch might need to be revised further, to replace the goto next_page with p++, but that would cause widespread test failures. It turns out that recv_read_in_area(page_id) will transition the current block as well as possibly some following blocks from RECV_NOT_PROCESSED to RECV_BEING_READ. So, the goto next_page was doing the right thing.

            marko Marko Mäkelä added a comment - - edited I filed MDEV-27610 for the removing the unnecessary wait and the dead code. I thought that my patch might need to be revised further, to replace the goto next_page with p++ , but that would cause widespread test failures. It turns out that recv_read_in_area(page_id) will transition the current block as well as possibly some following blocks from RECV_NOT_PROCESSED to RECV_BEING_READ . So, the goto next_page was doing the right thing.

            It could make sense to make this an opt-in feature, by introducing a Boolean global parameter:

            1. If the server is started up with innodb_recover_in_background=ON (disabled by default), the last recovery batch would occur while SQL commands are accepted.
            2. An explicit statement SET GLOBAL innodb_recover_in_background=OFF could be executed to ensure that the recovery has been completed, before starting "serious" workload. This would free up the buffer pool space that was allocated for redo log records during recovery.
            marko Marko Mäkelä added a comment - It could make sense to make this an opt-in feature, by introducing a Boolean global parameter: If the server is started up with innodb_recover_in_background=ON (disabled by default), the last recovery batch would occur while SQL commands are accepted. An explicit statement SET GLOBAL innodb_recover_in_background=OFF could be executed to ensure that the recovery has been completed, before starting "serious" workload. This would free up the buffer pool space that was allocated for redo log records during recovery.

            This is fundamentally conflicting with the recovery performance improvements that I implemented in MDEV-29911. Basically, MDEV-29911 is allocating most of the available buffer pool for log records, to reduce I/O traffic for data pages. This means that while the final recovery batch is running, only a tiny buffer pool might be available to users. Because recovery is multi-threaded, the most part of recovery time should actually be related to parsing and storing log records (in a single thread), and actually applying the log to buffer pool pages is fast (especially when innodb_read_io_threads is set to a high value).

            I do not think that implementing recovery in background would add that much value, compared to the complexity of implementing and testing this.

            marko Marko Mäkelä added a comment - This is fundamentally conflicting with the recovery performance improvements that I implemented in MDEV-29911 . Basically, MDEV-29911 is allocating most of the available buffer pool for log records, to reduce I/O traffic for data pages. This means that while the final recovery batch is running, only a tiny buffer pool might be available to users. Because recovery is multi-threaded, the most part of recovery time should actually be related to parsing and storing log records (in a single thread), and actually applying the log to buffer pool pages is fast (especially when innodb_read_io_threads is set to a high value). I do not think that implementing recovery in background would add that much value, compared to the complexity of implementing and testing this.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.