[MDEV-14481] Execute InnoDB crash recovery in the background Created: 2017-11-23 Updated: 2023-04-13 Resolved: 2023-04-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | N/A |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Won't Do | Votes: | 2 |
| Labels: | performance, recovery | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
InnoDB startup unnecessarily waits for recovered redo log records to be applied to the data files. In fact, normally while the function trx_sys_init_at_db_start() is executing, the pages that it is requesting from the buffer pool will have any recovered redo log applied to them in the background. Basically, we only need to remove or refactor some calls in the InnoDB server startup. Some of this was performed in We should rewrite or remove |
| Comments |
| Comment by Marko Mäkelä [ 2018-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If the aim of this task is to shorten the downtime after a crash, then as noted in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-02-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-09-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In MariaDB Server 10.4, as part of removing the crash-upgrade support for the pre- | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The buf_pool_invalidate() call was already removed as part of
This would address the above comment regarding With this change, I got an anomaly that did not look familiar to me:
The problem is occasionally repeatable by running the test alone:
Without the patch, this problem does not occur. Note: I did not test whether the crash recovery now actually executes in the background. That could already be the case. The above patch is only trying to remove a read of every .ibd file when any crash recovery was needed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems that in order to properly fix this, we must get rid of the separate buf_pool->flush_rbt and this code:
This buffer pool flush is unnecessarily forcing crash recovery to wait for all changes to be written back to the files. This flush would necessarily happen on the next redo log checkpoint. We must be careful to ensure that buf_page_t::oldest_modification is correctly set during crash recovery. The unnecessary flush happens before this point:
There is no call to log_checkpoint() before the server has started accepting connections. So, the main thing that needs to be done seems to be the merging of buf_pool->flush_list and buf_pool->flush_rbt, which should be a good idea on its own merit, to improve the InnoDB page flushing performance. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-02-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Merging flush_list and flush_rbt could be part of | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-03-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As part of this, we should probably prevent anything from being added to the adaptive hash index before crash recovery has been completed. Alternatively, we should not omit calls to btr_search_drop_page_hash_index(block) while redo log is being applied. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-04-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-04-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-04-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I removed the call to buf_page_invalidate() in While thiru was working on | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-11-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As part of this work, we must remove the variable recv_lsn_checks_on and improve buf_page_check_lsn() so that any pending log records in recv_sys will be considered. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is a simple test case:
And here is some instrumentation that attempts to artificially delay the last recovery batch by 10 seconds, so that it could finish some time after we have started executing SQL:
The output currently looks like this, for an invocation like this. I thought that it would be a good idea to disable log checkpoints, to slow down recovery further:
We would hope the now() time stamps to prove that the SQL was running concurrently with the last crash recovery batch. Currently, that is not the case, but instead the 10-second sleep that our instrumentation introduced is delaying the InnoDB startup:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by zhai weixiang [ 2019-12-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I used to implement a similar feature. just write a note of bug i hitted: during runtime , it may create a new page without reading from disk, so the pending redo log is not applied. If foreground thread modified the page in memory, and after that the background thread may apply redo to it and take it back to older version.... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
zhaiwx1987, your comment reminds me of I think that the scenario that you mention should be prevented if a call to buf_page_create() from other than the recovery code will discard any buffered redo log for the page. Actually, in the scenario, there should have been a MLOG_INIT_FREE_PAGE record ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
When we enable recovery in the background, we will also allow buffer pool resizing during recovery. It could be simplest to disallow the buffer pool from shrinking before recovery has been completed. The parsed redo log records are currently stored in the buffer pool, in BUF_BLOCK_STATE_MEMORY blocks. If we allow recovery with innodb_read_only ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As part of this task, we must remove or rewrite most conditions on recv_recovery_is_on(). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-02-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As part of this task, we should think what to do about innodb_force_recovery. If recovery is executed on the background, we would no longer be able to abort startup because of a corrupted page. We could still abort because of a corrupted log record. If innodb_force_recovery is not set, the expectation should be that no data will be lost, and something could be done to fix the corruption (say, mount a file system where a data file was stored) and to restart the recovery. This could imply that innodb_force_recovery=0 (the default setting) must prevent log checkpoints from being executed until everything has been fully recovered. Users could explicitly SET GLOBAL innodb_force_recovery=1 to ‘unblock’ the database. Executing recovery in the background might raise further the importance of page corruption handling ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-08-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The use of the predicate srv_startup_is_before_trx_rollback_phase to disable read-ahead in buf0rea.cc will likely have to be replaced in this task, because we could reset that predicate before the redo log apply completes. In fact, if we are running with innodb_read_only=ON, we could complete the rollback of recovered transactions in the buffer pool while redo log for some other pages might still not have been applied. (In that mode, we would not write anything to the redo log or to the data files; we could recovered pages in the buffer pool.) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This was blocked by | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Many preparations for this task were already done in In the last batch in recv_sys_t::apply(bool last_batch), we would want to sort the buf_pool.flush_list instead of emptying it (buf_flush_sync()). That could possibly be done as a separate bug fix (to speed up startup in 10.5+ when crash recovery is needed). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eugene Kosov (Inactive) [ 2021-11-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is a profiling info for some data. What can be improved? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-11-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think 1) is not a huge issue. Whether you mmap, or not, the biggest problem is bringing data from disk to memory, not the copies between memory and memory. a huge mmap , with tens of GB, of a full file, potentially larger than RAM, might behave hostile to other application running on the same machine. Doing "direct reads", instead of populating filesystem cache, is as good as mmap is. provided you can do direct reads, or use some posix_fadvise, or what not. The biggest potential is recv_sys_t::parse, including this Rbtree , as I see it. parallelizing as in 3) , could be good, too | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-11-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For what it is worth, in When we have the log in persistent memory, we would mmap the ib_logfile0 for log writes anyway. So, we might simply mmap the whole log_sys.buf to the file at all times. Note: I am not advocating the use of mmap when we are using a normal log file. That would not fly in 32-bit architectures where virtual address space is already scarce. In more distant future, we might want to experiment with asynchronous reads of the log during recovery (when the log is not stored in persistent memory). The challenge here would be that memory needs to be reserved for the read buffers for a longer period of time. We might want to reserve that memory from the buffer pool, like we currently do for the parsed redo log snippets in recv_sys_t::alloc(). We might allocate all that memory ahead of the asynchronous read. Copying data from the log read buffer to recv_sys.pages seems to be unavoidable, because recv_sys_t::add() would both ‘compact’ and ‘expand’ the data (remove checksums and other ‘framing’, and add log_phys_t framing that imposes alignment constraints). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-11-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
wlad, I believe that parallelizing recv_sys_t::apply() could bring the most benefit if a large portion of the pages can be recovered without reading them into the buffer pool. A scenario could be that the server was killed during a massive INSERT that was allocating many new pages. (This optimization was first introduced in It could turn out that most of the time, most recovered pages will have to be read into the buffer pool. In that case, the redo log would be applied in the page read completion callback. I do not expect that parallelizing the recv_read_in_area() calls (submitting the asynchronous read requests) in recv_sys_t::apply() would help much. During recovery, the buffer pool will be occupied by recv_sys.pages as well as data pages needing recovery. Starting heavy workload concurrently with recovery could cause heavy thrashing due to the small portion of usable buffer pool. kevg, we would need an interface that allows users to wait for recovery to finish before starting heavy workload. We will also have to think what to do about loading the ib_buffer_pool. If recovery is needed, maybe we should only process ib_buffer_pool if the user is explicitly waiting for recovery to finish. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eugene Kosov (Inactive) [ 2022-01-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I put and assertion here and it looks that the code is a dead code. This is not a proof yet, however. This means that already all CPU-intensive work is done in tpool. And the only thing to optimize is a wait at the end of recv_sys_t::apply(). This wait mostly need to know when to call a cleanup for recovery. The wait + cleanup could be put in a separate thread or the wait could be removed completely and cleanup moved to a log checkpoint. Simply making recv_sys_t::apply() asynchronous results in a races of a buffer pool pages. Also, wait condition should be changed from waiting for zero pending reads to something else. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, that code path could indeed be unreachable. In case a background read-ahead is concurrently completing for this same page, that read should be protected by a page X-latch and IO-fix, which will not be released before buf_page_read_complete() will have invoked recv_recover_page(). So, there cannot possibly be any changes to apply. If this analysis is correct, we could remove quite a bit of code without any ill effect:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I filed I thought that my patch might need to be revised further, to replace the goto next_page with p++, but that would cause widespread test failures. It turns out that recv_read_in_area(page_id) will transition the current block as well as possibly some following blocks from RECV_NOT_PROCESSED to RECV_BEING_READ. So, the goto next_page was doing the right thing. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It could make sense to make this an opt-in feature, by introducing a Boolean global parameter:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-04-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is fundamentally conflicting with the recovery performance improvements that I implemented in I do not think that implementing recovery in background would add that much value, compared to the complexity of implementing and testing this. |