Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
-
None
Description
A user is facing this scenario where mariabackup --prepare has been running for 12+ hours, and it's reporting page numbers to recover. Then it goes on to multiple iterations of recovery with no indication of how much work is remaining or how much time remains to complete the process.
It will be great if `--prepare` can report the percentage of pages recovery completed per iteration and a global percentage completion consolidating all iteration batches.
Attachments
- big_restore.log
- 33 kB
- preparelog.txt
- 332 kB
Issue Links
- blocks
-
MDEV-32042 Special handling of crash recovery in buf_page_get_gen() may cause overhead
-
- Closed
-
-
MDEV-34830 LSN in the future is not being treated as serious corruption
-
- Closed
-
- causes
-
MDEV-31386 InnoDB: Failing assertion: page_type == i_s_page_type[page_type].type_value
-
- Closed
-
-
MDEV-31827 InnoDB multi-batch recovery stops prematurely
-
- Closed
-
- is blocked by
-
MDEV-31080 InnoDB: Failing assertion: space->size == check.size in fil_validate() during deferred tablespace recovery
-
- Closed
-
- relates to
-
MDEV-11027 InnoDB log recovery is too noisy
-
- Closed
-
-
MDEV-31350 test innodb.recovery_memory failed on '21 failed attempts to flush a page'
-
- Closed
-
-
MDEV-31353 InnoDB recovery hangs after reporting corruption
-
- Closed
-
-
MDEV-31362 recv_sys_t::apply(bool): Assertion `!last_batch || recovered_lsn == scanned_lsn' failed
-
- Closed
-
-
MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0
-
- Closed
-
-
MDEV-34070 mariadb-backup --prepare with larger --use-memory may be slower
-
- Confirmed
-
-
MDEV-34907 Bogus debug assertion failure in multi-batch recovery while parsing FILE_ records
-
- Closed
-
-
MDEV-14481 Execute InnoDB crash recovery in the background
-
- Closed
-
-
MDEV-30053 mariabackup should indicate memory needed for prepare
-
- Open
-
-
MDEV-32029 Assertion failures in log_sort_flush_list upon crash recovery
-
- Closed
-
-
MDEV-34156 InnoDB fails to apply the redo log to page_compressed tablespace
-
- Closed
-
Activity
marko the reporting is not giving any idea about how many batches are left - we really need a % of completion of the overall prepare phase.
Since it is reading the redo log file it shouldn't be complicated to produce a percentage by checking the current LSN versus the final LSN, what do you think?
Thanks
Rick
I believe that there may be some overlap between recovery batches, that is, the same pages may be updated in multiple batches.
We should be able to give an estimate of how many batches will be needed after the current one, but it may be impossible to give an accurate estimate of the number of pages that are awaiting recovery in subsequent batches.
I would really like to see a complete log of --prepare or a server crash recovery that involves very many batches, so that I can validate some ideas that I have about this.
marko `--prepare` log has been attached for your reference.
To anyone who tries to reproduce this: If preparing a backup only took a short time, there cannot have been much log written since the latest checkpoint. You would probably need some concurrent write workload on the server while the backup is running. While backup is running, the server can receive further writes and be producing more log. Backup cannot switch to a later checkpoint on the fly; it will copy all log from the checkpoint that was the latest when the backup started.
Log checkpoints starting with MariaDB Server 10.5.7 (after the refactoring in MDEV-23855 and MDEV-23399) are triggered at the end of each batch of writing modified pages from the buffer pool to the data files. That can happen every second, if you have enabled "background flushing" (see MDEV-27295). The innodb_log_file_size and innodb_buffer_pool_size matters a lot.
When preparing a backup, the buffer pool size is specified by the --use-memory parameter. preparelog.txt seems to be using the default size of about 100MiB. The interesting part of the log starts like this:
2022-10-29 12:50:33 0 [Note] InnoDB: Initializing buffer pool, total size = 104857600, chunk size = 104857600
|
2022-10-29 12:50:33 0 [Note] InnoDB: Completed initialization of buffer pool
|
2022-10-29 12:50:33 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=113635532803184,113635804884887
|
2022-10-29 12:50:48 0 [Note] InnoDB: Read redo log up to LSN=113648002206720
|
The difference between the two LSNs in the checkpoint message is 272,081,703 bytes. This roughly corresponds to the distance between the previous and the latest checkpoint.
In the InnoDB crash recovery, until all of the circular ib_logfile0 has been parsed, we cannot know the last recovery LSN. We would know an upper bound of it: in the worst case, there will be log records right up to the file offset where the latest checkpoint started. The formula used to be more complicated, but with MDEV-14425 (MariaDB Server 10.8), the maximum end LSN should be the checkpoint LSN plus the size of the payload area (the size of ib_logfile0 minus the 12,288-byte header).
In mariadb-backup --prepare, the last recovery LSN would actually be known: it is roughly the sum of the checkpoint LSN and the size of the append-only ib_logfile0 that was created during backup.
In the log, the latest message about reading log that I can see is the following:
2022-10-30 1:44:21 0 [Note] InnoDB: To recover: 527 pages from log
|
2022-10-30 1:44:35 0 [Note] InnoDB: Starting a batch to recover 5062 pages from redo log.
|
2022-10-30 1:44:35 0 [Note] InnoDB: Starting a batch to recover 5289 pages from redo log.
|
2022-10-30 1:44:36 0 [Note] InnoDB: Read redo log up to LSN=113635785294848
|
2022-10-30 1:44:36 0 [Note] InnoDB: Starting a batch to recover 5390 pages from redo log.
|
This actually is 19,590,039 less than the LSN was when the checkpoint was triggered before backup was initiated.
In MDEV-21351 (MariaDB 10.5.1), the memory management during crash recovery was simplified and improved. The rule in recv_sys_t::is_memory_exhausted() is that the number of buffer pool pages that are allocated for redo log records must not exceed a third of the size of the buffer pool. Because memory blocks can be freed as soon as all log records in it have been applied to data pages, new recovery batches can be start. That is why we are seeing so many of those tiny recovery batches (such as 5,062 pages) above, starting several times per second.
The largest batch seems to be the following:
2022-10-29 12:52:29 0 [Note] InnoDB: Starting a batch to recover 82554 pages from redo log.
|
This means that up to 33MiB of redo log records (one third of the 100MiB buffer pool) are referring to that many pages. Initially, the buffer pool has space for 4,266 pages. Maybe some page reads can be avoided (MDEV-12699) when the pages are being freed or completely initialized according to log records. The 82,554 pages are 19 times the size of the available buffer pool.
According to the log, things seem to run extremely slowly:
2022-10-29 13:14:03 0 [Note] InnoDB: To recover: 988 pages from log
|
2022-10-29 13:14:18 0 [Note] InnoDB: To recover: 929 pages from log
|
How could it take 15 seconds to apply changes to only 59 pages? Was there any other I/O or CPU load concurrently running on the computer or the storage system where the backup was being prepared? What is the maximum write speed of the storage where the backup was being prepared? Are tables in ROW_FORMAT=COMPRESSED or page_compressed being used?
What is the size of the ib_logfile0 in the backup directory?
It might make sense to change the rule "use up to one third of the buffer pool for log records" to something like "reserve two thirds of the buffer pool for data pages to process in a batch". Any performance improvements would only be possible if someone can provide a copy of a backup with which this can be reproduced.
Depending on the server workload, one trick to improve --prepare speed would be to trigger a log checkpoint right before starting the backup:
SET @save_pct= @@GLOBAL.innodb_max_dirty_pages_pct; |
SET @save_pct_lwm= @@GLOBAL.innodb_max_dirty_pages_pct_lwm; |
|
SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.0; |
SET GLOBAL innodb_max_dirty_pages_pct=0.0; |
|
-- wait for the following to reach 0
|
SELECT variable_value FROM information_schema.global_status |
WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY'; |
|
SET GLOBAL innodb_max_dirty_pages_pct = @save_pct; |
SET GLOBAL innodb_max_dirty_pages_pct_lwm = @save_pct_lwm; |
-- start the backup |
If this occurred with MariaDB 10.6.5, there have been some bug fixes to recovery performance since then: MDEV-27022, MDEV-27058, MDEV-27183, MDEV-27610, not to mention several correctness fixes (mostly affecting the recovery of DDL operations).
It would be interesting to know how fast the mariadb-backup --prepare would run when using a more recent executable.
While testing the upgrade logic of MDEV-29694, I noticed that we are missing progress reporting for scanning and parsing the log:
10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f |
2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319
|
2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319
|
2022-12-05 10:31:47 0 [Note] InnoDB: Starting a batch to recover 333533 pages from redo log.
|
2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log
|
2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log
|
2022-12-05 10:32:16 0 [Note] InnoDB: 4 transaction(s) which must be rolled back or cleaned up in total 4 row operations to undo
|
2022-12-05 10:32:16 0 [Note] InnoDB: Trx id counter is 27822492
|
2022-12-05 10:32:16 0 [Note] InnoDB: Starting final batch to recover 273219 pages from redo log.
|
2022-12-05 10:32:23 0 [Note] InnoDB: 128 rollback segments are active.
|
We are supposed to generate progress reports every 15 seconds. That was not the case here, between the following lines:
2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319
|
2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319
|
and
2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log
|
2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log
|
This was the recovery of a debug server that was killed in the middle of Sysbench oltp_update_index, with innodb_change_buffering_debug=1, running on RAM disk.
The recovery was conducted on a non-debug server using innodb_log_file_size=4G, innodb_buffer_pool_size=10G
Because the killed server was a debug build, so it was writing additional page checksum records (MDEV-18976). Disabling the creation of those records had the following effect on recovery performance:
10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f |
2022-12-05 11:58:26 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7523651578
|
2022-12-05 11:58:55 0 [Note] InnoDB: Read redo log up to LSN=7523651578
|
2022-12-05 11:59:30 0 [Note] InnoDB: Starting a batch to recover 339735 pages from redo log.
|
2022-12-05 11:59:30 0 [Note] InnoDB: To recover: 339735 pages from log
|
2022-12-05 11:59:47 0 [Note] InnoDB: To recover: 158840 pages from log
|
2022-12-05 11:59:47 0 [Note] InnoDB: Starting final batch to recover 158583 pages from redo log.
|
2022-12-05 11:59:48 0 [Note] InnoDB: 128 rollback segments are active.
|
Also in this case, several messages between 11:58.26 and 11:59:30 are significantly more than 15 seconds apart.
For the previous run, we could observe a recovery speed of 59928 pages / 29 seconds = 2066 pages/s, or if we count data page bytes, 33 MB/s.
With this run, for the first batch, we can calculate 180895 pages / 17 seconds = 10641 pages/s, or 166 MiB/s. That is more reasonable.
In any case, I can confirm that the progress reporting needs to be improved. The performance feels somewhat low too; we need to check for performance bottlenecks with a profiler.
If I recover using a smaller buffer pool, then there will be frequent messages about starting a new recovery batch, like this:
2023-03-30 14:16:10 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-03-30 14:16:10 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
|
2023-03-30 14:16:25 0 [Note] InnoDB: Starting a batch to recover 12978 pages from redo log.
|
2023-03-30 14:16:25 0 [Note] InnoDB: To recover: 12978 pages from log
|
2023-03-30 14:16:26 0 [Note] InnoDB: Starting a batch to recover 13571 pages from redo log.
|
2023-03-30 14:16:30 0 [Note] InnoDB: Starting a batch to recover 80468 pages from redo log.
|
2023-03-30 14:16:37 0 [Note] InnoDB: Starting a batch to recover 84957 pages from redo log.
|
2023-03-30 14:16:40 0 [Note] InnoDB: To recover: 50796 pages from log
|
The "To recover:" messages would appear every 15 seconds. However, if I start the recovery of the same data directory with a larger buffer pool, then it will be silent for longer periods of time:
2023-03-30 14:18:05 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
|
2023-03-30 14:18:38 0 [Note] InnoDB: Trying to recover page [page id: space=5, page number=2166] from the doublewrite buffer.
|
2023-03-30 14:18:38 0 [Note] InnoDB: Recovered page [page id: space=5, page number=2166] to './test/sbtest1.ibd' from the doublewrite buffer.
|
2023-03-30 14:18:39 0 [Note] InnoDB: Read redo log up to LSN=890094812
|
2023-03-30 14:19:12 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log.
|
2023-03-30 14:19:12 0 [Note] InnoDB: To recover: 153487 pages from log
|
2023-03-30 14:19:56 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log.
|
2023-03-30 14:19:56 0 [Note] InnoDB: To recover: 183721 pages from log
|
2023-03-30 14:20:11 0 [Note] InnoDB: To recover: 18334 pages from log
|
2023-03-30 14:20:46 0 [Note] InnoDB: Starting a batch to recover 192135 pages from redo log.
|
2023-03-30 14:20:46 0 [Note] InnoDB: To recover: 192135 pages from log
|
2023-03-30 14:21:01 0 [Note] InnoDB: To recover: 30398 pages from log
|
2023-03-30 14:21:34 0 [Note] InnoDB: Starting a batch to recover 188234 pages from redo log.
|
2023-03-30 14:21:34 0 [Note] InnoDB: To recover: 188234 pages from log
|
2023-03-30 14:21:49 0 [Note] InnoDB: To recover: 48679 pages from log
|
2023-03-30 14:22:01 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
|
2023-03-30 14:22:01 0 [Note] InnoDB: Trx id counter is 37362293
|
2023-03-30 14:22:01 0 [Note] InnoDB: Starting final batch to recover 104686 pages from redo log.
|
2023-03-30 14:22:04 0 [Note] InnoDB: To recover: 29018 pages from log
|
2023-03-30 14:22:05 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-03-30 14:22:05 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-03-30 14:22:05 0 [Note] InnoDB: Rollback of non-prepared transactions completed
|
2023-03-30 14:22:05 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
|
2023-03-30 14:22:05 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321
|
The "To recover:" messages should be output every 15 seconds, but there were no messages at 14:20:26 or 14:20:41. The "Starting a batch to recover" messages are unconditionally output at the start of each batch.
During this experiment, it also caught my eye that the recovery is occupying only one CPU core. I suppose that recovery avoids reading most data pages thanks to MDEV-12699, and most of the recovery will simply initialize pages in the buffer pool. To analyze that, I started a perf record -g during a recovery of the same data directory contents with the same innodb_buffer_pool_size=10g, and I noticed another anomaly for this repeated run:
2023-03-30 14:30:58 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
|
2023-03-30 14:31:33 0 [Note] InnoDB: Read redo log up to LSN=890094812
|
2023-03-30 14:32:07 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log.
|
2023-03-30 14:32:07 0 [Note] InnoDB: To recover: 153487 pages from log
|
2023-03-30 14:32:55 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log.
|
2023-03-30 14:32:55 0 [Note] InnoDB: To recover: 183721 pages from log
|
There was no progress message about "Read redo log up to" for the first 35 seconds of the recovery. Also, the second "To recover" message was issued 48 seconds after the first one. It looks like the messages are somehow being missed sporadically. Another try with innodb_buffer_pool_size=20g:
2023-03-30 14:38:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
|
2023-03-30 14:39:22 0 [Note] InnoDB: Read redo log up to LSN=890094812
|
2023-03-30 14:40:33 0 [Note] InnoDB: Starting a batch to recover 249675 pages from redo log.
|
2023-03-30 14:40:33 0 [Note] InnoDB: To recover: 249675 pages from log
|
2023-03-30 14:40:48 0 [Note] InnoDB: To recover: 65746 pages from log
|
2023-03-30 14:42:20 0 [Note] InnoDB: Starting a batch to recover 299196 pages from redo log.
|
2023-03-30 14:42:20 0 [Note] InnoDB: To recover: 299196 pages from log
|
2023-03-30 14:42:35 0 [Note] InnoDB: To recover: 204605 pages from log
|
2023-03-30 14:42:50 0 [Note] InnoDB: To recover: 76653 pages from log
|
2023-03-30 14:43:05 0 [Note] InnoDB: To recover: 3039 pages from log
|
2023-03-30 14:43:18 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
|
2023-03-30 14:43:18 0 [Note] InnoDB: Trx id counter is 37362293
|
2023-03-30 14:43:18 0 [Note] InnoDB: Starting final batch to recover 104676 pages from redo log.
|
2023-03-30 14:43:20 0 [Note] InnoDB: To recover: 79977 pages from log
|
2023-03-30 14:43:22 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-03-30 14:43:22 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-03-30 14:43:22 0 [Note] InnoDB: Rollback of non-prepared transactions completed
|
2023-03-30 14:43:22 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
|
2023-03-30 14:43:22 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed
|
2023-03-30 14:43:22 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
|
2023-03-30 14:43:22 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
|
2023-03-30 14:43:22 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321
|
Right before issuing the first message "Starting a batch" it was busy parsing the redo log. The "Read redo log up to LSN" message is intended to be issued every 15 seconds, just like the "To recover:" ones. Because perf failed to collect samples for some reason, I ran it once more, this time with innodb_buffer_pool_size=30g, to be met with similar initial silence:
2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
|
2023-03-30 14:52:09 0 [Note] InnoDB: Read redo log up to LSN=890094812
|
2023-03-30 14:53:59 0 [Note] InnoDB: Starting a batch to recover 343907 pages from redo log.
|
2023-03-30 14:53:59 0 [Note] InnoDB: To recover: 343907 pages from log
|
2023-03-30 14:54:14 0 [Note] InnoDB: To recover: 162142 pages from log
|
2023-03-30 14:54:29 0 [Note] InnoDB: To recover: 16129 pages from log
|
2023-03-30 14:55:18 0 [Note] InnoDB: To recover: 200958 pages from log
|
2023-03-30 14:55:20 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
|
2023-03-30 14:55:20 0 [Note] InnoDB: Trx id counter is 37362293
|
2023-03-30 14:55:20 0 [Note] InnoDB: Starting final batch to recover 200464 pages from redo log.
|
2023-03-30 14:55:33 0 [Note] InnoDB: To recover: 61463 pages from log
|
2023-03-30 14:55:42 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-03-30 14:55:42 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-03-30 14:55:42 0 [Note] InnoDB: Rollback of non-prepared transactions completed
|
2023-03-30 14:55:42 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
|
2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321
|
During the 3½ minutes of the (necessarily) single-threaded initial parsing of the redo log, only one "Read redo log up to LSN" message was issued. The "To recover" messages were issued almost reliably during this run. The perf report data shows that some page reads did take place, so some log should have been applied in multiple threads. Yet, the CPU utilization hovered close to 100% (of one CPU core) for the entire run.
I got a bit distracted with recovery performance testing, reading the perf report. In my previous comment, I mentioned recovery with innodb_buffer_pool_size=30g:
2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
|
…
|
2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321
|
The total time was 5 minutes and 6 seconds. I managed to optimize away 2 minutes:
2023-03-30 18:09:29 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-03-30 18:11:38 0 [Note] InnoDB: To recover: 447139 pages from log
|
2023-03-30 18:11:45 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
|
2023-03-30 18:11:45 0 [Note] InnoDB: Trx id counter is 37362293
|
2023-03-30 18:11:45 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log.
|
2023-03-30 18:11:53 0 [Note] InnoDB: To recover: 296136 pages from log
|
2023-03-30 18:12:08 0 [Note] InnoDB: To recover: 89868 pages from log
|
2023-03-30 18:12:23 0 [Note] InnoDB: To recover: 23926 pages from log
|
2023-03-30 18:12:33 0 [Note] InnoDB: 128 rollback segments are active.
|
…
|
2023-03-30 18:12:33 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321
|
That is 3 minutes and 4 seconds for the recovery. Most of the time goes into parsing and storing log records, which I think must be single-threaded.
I tried to revert one change, to see if it was for the better or for the worse:
2023-03-30 18:37:54 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-03-30 18:40:11 0 [Note] InnoDB: To recover: 447139 pages from log
|
2023-03-30 18:40:21 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
|
2023-03-30 18:40:21 0 [Note] InnoDB: Trx id counter is 37362293
|
2023-03-30 18:40:21 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log.
|
2023-03-30 18:40:26 0 [Note] InnoDB: To recover: 364557 pages from log
|
2023-03-30 18:40:41 0 [Note] InnoDB: To recover: 128886 pages from log
|
2023-03-30 18:40:56 0 [Note] InnoDB: To recover: 43070 pages from log
|
2023-03-30 18:41:10 0 [Note] InnoDB: 128 rollback segments are active.
|
…
|
2023-03-30 18:41:10 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
|
This is 3 minutes and 16 seconds, a little worse.
The last 48 or 49 seconds of these runs is actual recovery, which could be subjected to parallelization. If we assume 16 concurrent threads, ideally we could get that shaved to 3 seconds, that is, the total time for the faster run would be 2 minutes and 32 seconds. I think that I will try that as well. The more reliable progress reporting would only be some icing on the cake.
The messages InnoDB: To recover: would only be output when applying log to a page in recv_recover_page(). The initial parsing stage can take a long time. I added an intentionally typo-ed message to the log parsing loop in recv_scan_log(). This is how the output would look like (for recovering a different data directory than yesterday):
2023-03-31 12:06:21 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
|
2023-03-31 12:06:36 0 [Note] InnoDB: To resover: 110099 pages from log
|
2023-03-31 12:06:51 0 [Note] InnoDB: To resover: 160222 pages from log
|
2023-03-31 12:07:06 0 [Note] InnoDB: To resover: 213224 pages from log
|
2023-03-31 12:07:21 0 [Note] InnoDB: To resover: 262026 pages from log
|
2023-03-31 12:07:41 0 [Note] InnoDB: Read redo log up to LSN=45451
|
2023-03-31 12:07:56 0 [Note] InnoDB: To resover: 91522 pages from log
|
2023-03-31 12:08:11 0 [Note] InnoDB: To resover: 130295 pages from log
|
2023-03-31 12:08:26 0 [Note] InnoDB: To resover: 169361 pages from log
|
2023-03-31 12:08:41 0 [Note] InnoDB: To resover: 212187 pages from log
|
2023-03-31 12:08:56 0 [Note] InnoDB: To resover: 247702 pages from log
|
2023-03-31 12:09:11 0 [Note] InnoDB: To resover: 283953 pages from log
|
2023-03-31 12:09:24 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.
|
2023-03-31 12:09:26 0 [Note] InnoDB: To recover: 314811 pages from log
|
2023-03-31 12:09:41 0 [Note] InnoDB: To recover: 100224 pages from log
|
2023-03-31 12:09:56 0 [Note] InnoDB: To recover: 8134 pages from log
|
2023-03-31 12:10:11 0 [Note] InnoDB: To resover: 100637 pages from log
|
2023-03-31 12:10:26 0 [Note] InnoDB: To resover: 140568 pages from log
|
2023-03-31 12:10:41 0 [Note] InnoDB: To resover: 185950 pages from log
|
2023-03-31 12:10:56 0 [Note] InnoDB: To resover: 225344 pages from log
|
2023-03-31 12:11:11 0 [Note] InnoDB: To resover: 259438 pages from log
|
2023-03-31 12:11:22 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo
|
2023-03-31 12:11:22 0 [Note] InnoDB: Trx id counter is 41494841
|
2023-03-31 12:11:22 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.
|
2023-03-31 12:11:26 0 [Note] InnoDB: To recover: 212473 pages from log
|
2023-03-31 12:11:41 0 [Note] InnoDB: To recover: 49345 pages from log
|
2023-03-31 12:11:51 0 [Note] InnoDB: 128 rollback segments are active.
|
…
|
2023-03-31 12:11:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858
|
That is, for the first 3 minutes recovery would likely not execute recv_recover_page() at all, only recv_scan_log() (typo-ed To resover). After the first To recover message at 12:09:26 both types of messages are being interleaved.
The tight loop in recv_scan_log() might not the right place for the message output, though. Let us compare how quickly the same data directory recovered before I added the message output:
2023-03-31 11:58:42 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
|
2023-03-31 12:00:00 0 [Note] InnoDB: Read redo log up to LSN=45451
|
2023-03-31 12:01:35 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.
|
2023-03-31 12:01:35 0 [Note] InnoDB: To recover: 315871 pages from log
|
2023-03-31 12:01:50 0 [Note] InnoDB: To recover: 120018 pages from log
|
2023-03-31 12:02:05 0 [Note] InnoDB: To recover: 1754 pages from log
|
2023-03-31 12:03:19 0 [Note] InnoDB: To recover: 272422 pages from log
|
2023-03-31 12:03:23 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo
|
2023-03-31 12:03:23 0 [Note] InnoDB: Trx id counter is 41494841
|
2023-03-31 12:03:23 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.
|
2023-03-31 12:03:34 0 [Note] InnoDB: To recover: 116378 pages from log
|
2023-03-31 12:03:49 0 [Note] InnoDB: To recover: 4730 pages from log
|
2023-03-31 12:03:51 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-03-31 12:03:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-03-31 12:03:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858
|
With the messages added: 12:11:51 - 12:06:21 = 5 minutes and 30 seconds
Without the messages: 12:03:51 - 11:58:42 = 5 minutes and 9 seconds
I will try to find a more suitable place for the message output, to avoid introducing any significant performance regression.
I confirmed with ATTRIBUTE_NOINLINE that virtually all CPU cycles of recv_sys_t::recover_low() are being spent in recv_sys_t::free(). I believe that implementing MDEV-29445 will help with that. In total, these two functions waste over 10% of single-threaded CPU time:
5.18% mariadbd mariadbd [.] recv_sys_t::add
|
4.78% mariadbd mariadbd [.] recv_sys_t::free
|
I reran the recovery with an extra parameter --innodb-buffer-pool-chunk-size=30g to work around MDEV-25342, which would normally divide the buffer pool into 64 chunks:
2023-03-31 13:21:37 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB
|
2023-03-31 13:21:38 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-03-31 13:21:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
|
…
|
2023-03-31 13:26:07 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858
|
There was no impact on the recovery time (still 5½ minutes, because the extra To resover messages were present). The ranking of recv_sys_t::free() dropped a bit:
2.92% mariadbd mariadbd [.] recv_sys_t::free
|
The remaining performance hog in that function is an atomic operation. We are repurposing a field in the buffer block descriptor in this custom allocator, and we do not actually need any atomic operation because we are holding recv_sys.mutex:
@@ -1411,8 +1422,11 @@ inline void recv_sys_t::free(const void *data)
|
ut_ad(block->page.state() == buf_page_t::MEMORY);
|
ut_ad(static_cast<uint16_t>(block->page.access_time - 1) <
|
srv_page_size);
|
- ut_ad(block->page.access_time >= 1U << 16);
|
- if (!((block->page.access_time -= 1U << 16) >> 16))
|
+ unsigned a= block->page.access_time;
|
+ ut_ad(a >= 1U << 16);
|
+ a-= 1U << 16;
|
+ block->page.access_time= a;
|
+ if (!(a >> 16))
|
{
|
UT_LIST_REMOVE(blocks, block);
|
MEM_MAKE_ADDRESSABLE(block->page.frame, srv_page_size); |
In recv_sys_t::alloc() we are already using relaxed loads and stores instead of the costly atomic read-modify-write (lock xadd in 80486 and its successors). With the above change, the recv_sys_t::free() improved and the recovery time improved by a whopping 25% (82 seconds) from 5½ minutes to 4 minutes and 8 seconds:
1.15% mariadbd mariadbd [.] recv_sys_t::free
|
It looks like the remaining time in this function is dominated by the loop that iterates over the buffer pool chunks (in this case, only 1, but frequently executed conditions destroy performance). That would be addressed by MDEV-29445.
I was thinking that something like this could improve performance a lot in case a single recovery batch is not sufficient:
diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
|
index 148ab4ce55b..d1479456e62 100644
|
--- a/storage/innobase/log/log0recv.cc
|
+++ b/storage/innobase/log/log0recv.cc
|
@@ -4212,7 +4212,8 @@ dberr_t recv_recovery_from_checkpoint_start()
|
ut_ad(srv_force_recovery <= SRV_FORCE_NO_UNDO_LOG_SCAN);
|
|
if (rescan) {
|
- recv_sys.lsn = log_sys.next_checkpoint_lsn;
|
+ recv_sys.lsn = std::max(log_sys.next_checkpoint_lsn,
|
+ recv_sys.last_stored_lsn);
|
rescan = recv_scan_log(true);
|
if ((recv_sys.is_corrupt_log()
|
&& !srv_force_recovery) |
This exact change will make the test innodb.recovery_memory, and I imagine that it is somehow related to the FILE_ records for file name tracking. During the first scan of the redo log, we may parse log records for tablespace files that may have been deleted or renamed after the time of the latest checkpoint. Until we see a FILE_RENAME or FILE_DELETE record for the tablespace ID, we cannot know what is the right thing to do. Currently, the log between the latest checkpoint and its corresponding FILE_CHECKPOINT record is being parsed in "store if exists" mode: If the tablespace for some pages is not found, we will discard the records. It could be better to unconditionally store all log records in recv_sys.pages and check for inconsistency upon reaching the FILE_CHECKPOINT record. Some added progress reporting highlights this problem:
2023-03-31 16:52:01 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB
|
2023-03-31 16:52:01 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-03-31 16:52:01 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
|
2023-03-31 16:52:16 0 [Note] InnoDB: Parsed redo log up to LSN=1742292932; to recover: 106297 pages
|
2023-03-31 16:52:31 0 [Note] InnoDB: Parsed redo log up to LSN=2311978604; to recover: 156337 pages
|
2023-03-31 16:52:46 0 [Note] InnoDB: Parsed redo log up to LSN=2942556105; to recover: 210401 pages
|
2023-03-31 16:53:01 0 [Note] InnoDB: Parsed redo log up to LSN=3504149533; to recover: 259067 pages
|
2023-03-31 16:53:23 0 [Note] InnoDB: Read redo log up to LSN=45451
|
2023-03-31 16:53:38 0 [Note] InnoDB: Parsed redo log up to LSN=1645302567; to recover: 93217 pages
|
2023-03-31 16:53:53 0 [Note] InnoDB: Parsed redo log up to LSN=2166257717; to recover: 138937 pages
|
2023-03-31 16:54:08 0 [Note] InnoDB: Parsed redo log up to LSN=2746862824; to recover: 188829 pages
|
2023-03-31 16:54:23 0 [Note] InnoDB: Parsed redo log up to LSN=3229740111; to recover: 231309 pages
|
2023-03-31 16:54:38 0 [Note] InnoDB: Parsed redo log up to LSN=3706313376; to recover: 272336 pages
|
2023-03-31 16:54:53 0 [Note] InnoDB: Parsed redo log up to LSN=4173821625; to recover: 312515 pages
|
2023-03-31 16:54:54 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.
|
2023-03-31 16:55:08 0 [Note] InnoDB: To recover: 95902 pages from log
|
2023-03-31 16:55:26 0 [Note] InnoDB: Parsed redo log up to LSN=4213203801; to recover: 0 pages
|
2023-03-31 16:55:40 0 [Note] InnoDB: Parsed redo log up to LSN=4665667725; to recover: 124497 pages
|
2023-03-31 16:55:55 0 [Note] InnoDB: Parsed redo log up to LSN=5227132263; to recover: 174975 pages
|
2023-03-31 16:56:10 0 [Note] InnoDB: Parsed redo log up to LSN=5722640474; to recover: 221146 pages
|
2023-03-31 16:56:25 0 [Note] InnoDB: Parsed redo log up to LSN=6131373971; to recover: 259411 pages
|
2023-03-31 16:56:34 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo
|
2023-03-31 16:56:34 0 [Note] InnoDB: Trx id counter is 41494841
|
2023-03-31 16:56:34 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.
|
2023-03-31 16:56:40 0 [Note] InnoDB: To recover: 155518 pages from log
|
…
|
2023-03-31 16:56:53 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494847
|
The time between 16:52:01 and 16:53:23 (82 seconds) was basically wasted, because we will restart parsing the log from the checkpoint. Another way of looking at it is to interpolate the timestamp corresponding to the LSN that was displayed at 16:53:01. It could be around 16:54:30 (a little before a somewhat larger LSN was reported at 16:54:38). Again, the difference is more than 80 seconds. That is quite a bit, considering that the total recovery time is 4 minutes and 52 seconds (292 seconds). We could save another 29% of time there. Together with the previous savings in recv_sys_t::free(), this should halve the recovery time!
This could affect many users, because my recovery attempt was with a larger buffer pool (30G) than was used for originally creating the data directory (20G). With an equal-sized or smaller buffer pool and infrequent log checkpoints, a multi-batch recovery should be very likely.
Also the memory limit (do not let log records occupy more than ⅓ of the buffer pool) feels rather arbitrary. Do we really need to reserve ⅔ of a huge buffer pool for copies of data pages during recovery, or would something like 128 or 256 pages suffice? Recovery only posts batches of 32 asynchronous page reads when it needs to read pages, and during the recovery of my test datasets I hardly see any page reads taking place. However, we must keep in mind that this is a rather special test: no checkpoint had been written after the server was bootstrapped, and practically everything can be recovered (MDEV-12699) without reading any pages from the data files. I will have to test also a case where there was a checkpoint between loading the test data and running the UPDATE workload.
I tried "pushing down" the memory limit to recv_sys_t::add(), but it needs some more work, because the test innodb.recovery_memory would occasionally fail.
Today, I made some progress on "pushing down OOM checks". In case we run out of memory, we will resume parsing the log from the start of the current mini-transaction, not from the start of the log. The memory usage limit is more reasonable too: for larger buffer pools, only BUF_LRU_MIN_LEN (256) pages are reserved for the to-be-recovered pages. Even that is somewhat of an overkill, because it is 4 times the maximum value of innodb_read_io_threads=64.
2023-04-03 16:38:35 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=950555258
|
2023-04-03 16:38:50 0 [Note] InnoDB: Parsed redo log up to LSN=2014717452; to recover: 126885 pages
|
2023-04-03 16:39:05 0 [Note] InnoDB: Parsed redo log up to LSN=2656759712; to recover: 181122 pages
|
2023-04-03 16:39:20 0 [Note] InnoDB: Parsed redo log up to LSN=3326397264; to recover: 234593 pages
|
2023-04-03 16:39:35 0 [Note] InnoDB: Parsed redo log up to LSN=3906055802; to recover: 283483 pages
|
2023-04-03 16:39:50 0 [Note] InnoDB: Parsed redo log up to LSN=4484524600; to recover: 332314 pages
|
2023-04-03 16:40:05 0 [Note] InnoDB: Parsed redo log up to LSN=5066015464; to recover: 381400 pages
|
2023-04-03 16:40:20 0 [Note] InnoDB: Parsed redo log up to LSN=5680977027; to recover: 428113 pages
|
2023-04-03 16:40:35 0 [Note] InnoDB: To recover: 454513 pages from log
|
2023-04-03 16:40:36 0 [Note] InnoDB: 26 transaction(s) which must be rolled back or cleaned up in total 26 row operations to undo
|
2023-04-03 16:40:36 0 [Note] InnoDB: Trx id counter is 37992126
|
2023-04-03 16:40:36 0 [Note] InnoDB: Starting final batch to recover 454423 pages from redo log.
|
2023-04-03 16:40:50 0 [Note] InnoDB: To recover: 206743 pages from log
|
2023-04-03 16:41:05 0 [Note] InnoDB: To recover: 67047 pages from log
|
2023-04-03 16:41:20 0 [Note] InnoDB: To recover: 22798 pages from log
|
2023-04-03 16:41:30 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-04-03 16:41:30 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-04-03 16:41:31 0 [Note] InnoDB: log sequence number 6041702829 (memory-mapped); transaction id 37992153
|
This run is not exactly comparable to what I ran on Friday, because there is almost 1GiB less of log to recover (the checkpoint LSN starts at almost 1G). Anyway, the 82-second waste is gone, and we are down from 292 seconds to 176 seconds. For a more fair comparison to the previous run, let us multiply that by 6/5: 1.2×176s=3.5 minutes, much less than the almost 5 minutes.
To speed up recovery further, I think that both RECV_WILL_NOT_READ and RECV_NOT_PROCESSED should be treated by buf_read_recv_pages(). For RECV_WILL_NOT_READ, we need to introduce a "fake read" operation, which would invoke the recovery function directly, without actually reading anything from the file system.
I am afraid that the parsing (here: about 40 MiB/s for the first 121 seconds) must remain single-threaded. With parallel processing, it should be feasible to cut the remaining 55 seconds to 2 or 3 seconds. The original run that produced the log was with 64 concurrent connections, maybe 1 to 1½ minutes for loading the data and another 1½ minutes for the workload. That is, with that final fix, the database recovery should consume a similar amount of time as creating the database took in the first place.
It looks like I may have gotten the improved OOM processing to work. We will avoid re-parsing everything from the recovery checkpoint onwards, but we do have to re-parse everything from the start of the mini-transaction at which we ran out of memory for log records. For larger buffer pools, only 256 pages will be reserved for actual buffer pool pages; the rest can be consumed by log records.
I implemented a proof of concept for asynchronous "fake reads" for multi-threaded recovery. In my quick test with innodb_log_file_size=10G and innodb_buffer_pool_size=20G, it now recovers in a single batch when using the same-size buffer pool. The concurrency is not good yet: the best I saw with innodb_read_io_threads=64 (the maximum) was slightly less than 370% of one CPU core being utilized, and typically it was around 150% during the log application. I will streamline the work submission, and I will have to retest with innodb_buffer_pool_chunk_size=20g because the recv_sys.mutex could be a bottleneck in the memory management:
24,89% mariadbd mariadbd [.] page_recv_t::recs_t::clear
|
18,16% mariadbd mariadbd [.] recv_sys_t::parse<recv_buf, true>
|
12,39% mariadbd mariadbd [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t>
|
5,20% mariadbd mariadbd [.] recv_recover_page
|
4,80% mariadbd mariadbd [.] recv_sys_t::add
|
3,80% mariadbd mariadbd [.] log_phys_t::apply
|
3,64% mariadbd [kernel.kallsyms] [k] clear_page_erms
|
1,21% mariadbd libc.so.6 [.] _int_free
|
1,06% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm
|
0,98% mariadbd mariadbd [.] crc32c_3way
|
0,90% mariadbd mariadbd [.] page_apply_insert_dynamic
|
Using a single buffer pool chunk sped it up, and the CPU utilization could have been over 300% for most part (still a far cry from the 40 that nproc reports):
21,85% mariadbd mariadbd [.] recv_sys_t::parse<recv_buf, true>
|
14,65% mariadbd mariadbd [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t>
|
7,24% mariadbd [kernel.kallsyms] [k] clear_page_erms
|
5,62% mariadbd mariadbd [.] recv_recover_page
|
4,53% mariadbd mariadbd [.] recv_sys_t::add
|
4,52% mariadbd mariadbd [.] log_phys_t::apply
|
3,90% mariadbd mariadbd [.] page_recv_t::recs_t::clear
|
1,50% mariadbd libc.so.6 [.] _int_free
|
1,26% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm
|
1,26% mariadbd mariadbd [.] page_apply_insert_dynamic
|
1,17% mariadbd [kernel.kallsyms] [k] copy_page
|
1,13% mariadbd mariadbd [.] crc32c_3way
|
It now took this much wall clock time:
2023-04-04 16:27:31 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=928593106
|
2023-04-04 16:27:46 0 [Note] InnoDB: Parsed redo log up to LSN=2027073563; to recover: 128070 pages
|
…
|
2023-04-04 16:29:01 0 [Note] InnoDB: Parsed redo log up to LSN=5331384873; to recover: 402752 pages
|
2023-04-04 16:29:16 0 [Note] InnoDB: To recover: 453542 pages from log
|
2023-04-04 16:29:21 0 [Note] InnoDB: 15 transaction(s) which must be rolled back or cleaned up in total 15 row operations to undo
|
2023-04-04 16:29:21 0 [Note] InnoDB: Trx id counter is 37845466
|
2023-04-04 16:29:21 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.
|
…
|
2023-04-04 16:30:04 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482
|
The log parsing speed is about (6018444686-928593106)/105s = 46 MiB/s. That time is unaffected by the innodb_buffer_pool_chunk_size; I checked two previous runs.
Applying the log took 43 seconds. With the default innodb_buffer_pool_chunk_size and innodb_read_io_threads=4, it took 189 seconds :
2023-04-04 16:03:32 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.
|
…
|
2023-04-04 16:06:41 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482
|
With the default innodb_buffer_pool_chunk_size and the maximum innodb_read_io_threads=64, it took a couple of seconds more (191 seconds):
2023-04-04 16:13:01 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.
|
…
|
2023-04-04 16:16:12 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482
|
We must reduce contention on recv_sys.mutex. One way could be to lazily clear entries from recv_sys.pages. Hopefully then it will be acceptably fast even without specifying a maximal innodb_buffer_pool_chunk_size.
When we run out of memory while storing parsed log records into recv_sys.pages, we only need to rewind to the checkpoint LSN if the first call to recv_init_crash_recovery_spaces() reported that some tablespaces may be missing. Later, a special version of this validation loop could be implemented for DDL workloads.
I tried to recover from a crash that used a 20GiB buffer pool and 10GiB redo log, by using only 1GiB buffer pool:
sql/mariadbd --gdb --innodb-log-file-size=10g --innodb-buffer-pool-{,chunk-}size=1g --innodb-read-io-threads=64 --datadir /dev/shm/sbtest.backup/
|
Progress reporting looks like this:
2023-04-08 21:32:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688
|
2023-04-08 21:32:49 0 [Note] InnoDB: Recovery ran out of memory at LSN 1083025911
|
2023-04-08 21:32:58 0 [Note] InnoDB: End of log at LSN=4483551634
|
2023-04-08 21:32:59 0 [Note] InnoDB: To recover: LSN 1153652154/4483551634; 32645 pages
|
2023-04-08 21:33:08 0 [Note] InnoDB: To recover: LSN 1637462775/4483551634; 91426 pages
|
2023-04-08 21:33:23 0 [Note] InnoDB: Parsed redo log up to LSN=1961012462; to recover: 106911 pages
|
2023-04-08 21:33:23 0 [Note] InnoDB: To recover: LSN 1987860863/4483551634; 109501 pages
|
2023-04-08 21:33:38 0 [Note] InnoDB: Parsed redo log up to LSN=2220397229; to recover: 97796 pages
|
2023-04-08 21:33:41 0 [Note] InnoDB: To recover: LSN 2330818601/4483551634; 108920 pages
|
2023-04-08 21:33:56 0 [Note] InnoDB: Parsed redo log up to LSN=2581361215; to recover: 99426 pages
|
2023-04-08 21:33:58 0 [Note] InnoDB: To recover: LSN 2667451691/4483551634; 107164 pages
|
2023-04-08 21:34:12 0 [Note] InnoDB: To recover: LSN 3002207709/4483551634; 105708 pages
|
2023-04-08 21:34:27 0 [Note] InnoDB: Parsed redo log up to LSN=3248729981; to recover: 109527 pages
|
2023-04-08 21:34:29 0 [Note] InnoDB: To recover: LSN 3318003147/4483551634; 116461 pages
|
2023-04-08 21:34:44 0 [Note] InnoDB: Parsed redo log up to LSN=3517427589; to recover: 105170 pages
|
2023-04-08 21:34:47 0 [Note] InnoDB: To recover: LSN 3638007653/4483551634; 117075 pages
|
2023-04-08 21:35:02 0 [Note] InnoDB: Parsed redo log up to LSN=3826053254; to recover: 104253 pages
|
2023-04-08 21:35:06 0 [Note] InnoDB: To recover: LSN 3959684337/4483551634; 117624 pages
|
2023-04-08 21:35:21 0 [Note] InnoDB: Parsed redo log up to LSN=4139117168; to recover: 103447 pages
|
2023-04-08 21:35:25 0 [Note] InnoDB: To recover: LSN 4280878385/4483551634; 117791 pages
|
2023-04-08 21:35:40 0 [Note] InnoDB: Parsed redo log up to LSN=4432727985; to recover: 100850 pages
|
2023-04-08 21:35:41 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo
|
2023-04-08 21:35:41 0 [Note] InnoDB: Trx id counter is 26268325
|
2023-04-08 21:35:42 0 [Note] InnoDB: To recover: 105496 pages
|
2023-04-08 21:35:46 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-04-08 21:35:46 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-04-08 21:35:46 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340
|
At the start of each recovery batch, a message "To recover:" will be output unconditionally. Subsequent messages of the same or different type will be output every 15 seconds. The final recovery batch (105496 pages) started at 21:35:42 and finished at 21:35:46. For reporting progress before the final batch, we report both the current recovered LSN and the final LSN, as well as the number of pages currently waiting to be recovered. The number of pages increases during parsing and decreases while applying log.
For comparison, I recovered the same dataset using a 10GiB buffer pool (half the size of the original buffer pool). It will run out of memory, even though the redo log is only 10GiB:
2023-04-08 21:40:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688
|
2023-04-08 21:40:22 0 [Note] InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages
|
2023-04-08 21:40:37 0 [Note] InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages
|
2023-04-08 21:40:52 0 [Note] InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages
|
2023-04-08 21:41:07 0 [Note] InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages
|
2023-04-08 21:41:07 0 [Note] InnoDB: Recovery ran out of memory at LSN 4189599815
|
2023-04-08 21:41:07 0 [Note] InnoDB: End of log at LSN=4483551634
|
2023-04-08 21:41:08 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages
|
2023-04-08 21:41:23 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages
|
2023-04-08 21:41:38 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages
|
2023-04-08 21:41:53 0 [Note] InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages
|
2023-04-08 21:41:57 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo
|
2023-04-08 21:41:57 0 [Note] InnoDB: Trx id counter is 26268325
|
2023-04-08 21:41:57 0 [Note] InnoDB: To recover: 114621 pages
|
2023-04-08 21:42:01 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-04-08 21:42:01 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
…
|
2023-04-08 21:42:01 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340
|
Finally, with 20GiB buffer pool size (same size as the server that was killed during workload)
2023-04-08 21:43:04 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688
|
2023-04-08 21:43:19 0 [Note] InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages
|
2023-04-08 21:43:34 0 [Note] InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages
|
2023-04-08 21:43:49 0 [Note] InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages
|
2023-04-08 21:44:04 0 [Note] InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages
|
2023-04-08 21:44:10 0 [Note] InnoDB: End of log at LSN=4483551634
|
2023-04-08 21:44:12 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo
|
2023-04-08 21:44:12 0 [Note] InnoDB: Trx id counter is 26268325
|
2023-04-08 21:44:14 0 [Note] InnoDB: To recover: 331797 pages
|
2023-04-08 21:44:36 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-04-08 21:44:36 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
2023-04-08 21:44:36 0 [Note] InnoDB: To roll back: 14 transactions, 14 rows
|
…
|
2023-04-08 21:44:36 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268328
|
Summary:
buffer pool size | time until end-of-log detection | total recovery time |
---|---|---|
1GiB | 12s | 180s |
10GiB | 60s | 114s |
20GiB | 66s | 92s |
When using a smaller buffer pool, the end LSN will be reached faster, because once we run out of memory, the rest will be just parsed and discarded. When all parsed log fits in the buffer pool, parsing will take 66 seconds and applying the log takes 26 seconds. When pages need to be read from the file system and written to slower storage than the RAM disk that I was using, it will probably look quite different. In any case, using more RAM and CPUs should be helpful.
When pages need to be read from the file system, recv_sys.mutex may continue to be a significant bottleneck, depending on the type of storage. I only avoid acquiring it in the new "fake asynchronous read" code path.
Before MDEV-14425 simplified the log format in MariaDB Server 10.8, there were two log parsers: log block parser and log record parser. The log records were not contiguous in the log file: they could be interrupted by headers and footers of 512-byte log blocks. The mapping from log sequence number (LSN) to byte offsets is more complicated. Starting with 10.8, the log file offset is simply a base offset + LSN modulo (ib_logfile0 file size - 12288 bytes), and the only discontinuity is when the ib_logfile0 wraps around from the end to byte offset 12288.
Due to this, it is not feasible to port some of the log parsing optimizations to 10.6 or earlier versions.
I am working on a 10.6 port of this. It will be less efficient due to the more complicated memory management and parsing of redo log records.
The following is from a recovery of Sysbench oltp_update_index:
2023-04-17 12:21:55 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
|
2023-04-17 12:22:10 0 [Note] InnoDB: Read redo log up to LSN=1935453184
|
2023-04-17 12:22:25 0 [Note] InnoDB: Read redo log up to LSN=2584783872
|
2023-04-17 12:22:40 0 [Note] InnoDB: Read redo log up to LSN=3218648064
|
2023-04-17 12:22:55 0 [Note] InnoDB: Read redo log up to LSN=3803884544
|
2023-04-17 12:23:02 0 [Note] InnoDB: Recovery ran out of memory at LSN 4068555446
|
2023-04-17 12:23:10 0 [Note] InnoDB: Read redo log up to LSN=5356432384
|
2023-04-17 12:23:25 0 [Note] InnoDB: Read redo log up to LSN=1747627008
|
2023-04-17 12:23:40 0 [Note] InnoDB: Read redo log up to LSN=2296949760
|
2023-04-17 12:23:55 0 [Note] InnoDB: Read redo log up to LSN=2934352896
|
2023-04-17 12:24:10 0 [Note] InnoDB: Read redo log up to LSN=3416960000
|
2023-04-17 12:24:25 0 [Note] InnoDB: Read redo log up to LSN=3911756800
|
2023-04-17 12:24:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 338590 pages
|
2023-04-17 12:24:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 325799 pages
|
2023-04-17 12:25:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 301991 pages
|
2023-04-17 12:25:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 273831 pages
|
2023-04-17 12:25:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 241575 pages
|
2023-04-17 12:25:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 194727 pages
|
2023-04-17 12:26:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 105895 pages
|
2023-04-17 12:26:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 36263 pages
|
2023-04-17 12:26:43 0 [Note] InnoDB: Read redo log up to LSN=4375424000
|
2023-04-17 12:26:58 0 [Note] InnoDB: Read redo log up to LSN=4886932480
|
2023-04-17 12:27:13 0 [Note] InnoDB: Read redo log up to LSN=5513325568
|
2023-04-17 12:27:18 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo
|
2023-04-17 12:27:18 0 [Note] InnoDB: Trx id counter is 37315565
|
2023-04-17 12:27:20 0 [Note] InnoDB: To recover: 205090 pages
|
2023-04-17 12:27:51 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-04-17 12:27:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
2023-04-17 12:27:51 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows
|
…
|
2023-04-17 12:27:51 0 [Note] InnoDB: Rollback of non-prepared transactions completed
|
2023-04-17 12:27:51 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
|
2023-04-17 12:27:51 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585
|
In 10.6, if we run out of memory, we will always restart parsing the log from the start.
32,71% mariadbd mariadbd [.] recv_sys_t::erase
|
19,96% mariadbd mariadbd [.] recv_sys_t::parse
|
13,90% mariadbd mariadbd [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>
|
3,19% mariadbd libstdc++.so.6.0.30 [.] std::_Rb_tree_increment
|
3,16% mariadbd mariadbd [.] recv_sys_t::add
|
2,91% mariadbd mariadbd [.] recv_recover_page
|
The CPU time was dominated by adding an removing items, because I forgot to set innodb_buffer_pool_chunk_size equal to the innodb_buffer_pool_size (10GiB). With a 30GiB buffer pool (same as before the server was killed) in one chunk, it recovers faster:
2023-04-17 12:32:59 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
|
2023-04-17 12:33:14 0 [Note] InnoDB: Read redo log up to LSN=1984539648
|
2023-04-17 12:33:29 0 [Note] InnoDB: Read redo log up to LSN=2710875136
|
2023-04-17 12:33:44 0 [Note] InnoDB: Read redo log up to LSN=3358895104
|
2023-04-17 12:33:59 0 [Note] InnoDB: Read redo log up to LSN=3965299712
|
2023-04-17 12:34:14 0 [Note] InnoDB: Read redo log up to LSN=4557417472
|
2023-04-17 12:34:29 0 [Note] InnoDB: Read redo log up to LSN=5219527680
|
2023-04-17 12:34:43 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo
|
2023-04-17 12:34:43 0 [Note] InnoDB: Trx id counter is 37315565
|
2023-04-17 12:34:45 0 [Note] InnoDB: To recover: 450915 pages
|
2023-04-17 12:35:07 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-04-17 12:35:07 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
|
2023-04-17 12:35:07 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows
|
…
|
2023-04-17 12:35:07 0 [Note] InnoDB: Rollback of non-prepared transactions completed
|
2023-04-17 12:35:07 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
|
2023-04-17 12:35:07 0 [Note] InnoDB: Creating shared tablespace for temporary tables
|
2023-04-17 12:35:07 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
|
2023-04-17 12:35:07 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
|
2023-04-17 12:35:07 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585
|
Now the CPU usage makes more sense too:
24,31% mariadbd mariadbd [.] recv_sys_t::parse
|
18,11% mariadbd mariadbd [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>
|
11,15% mariadbd mariadbd [.] recv_sys_t::erase
|
6,68% mariadbd mariadbd [.] recv_recover_page
|
5,43% mariadbd mariadbd [.] log_phys_t::apply
|
3,74% mariadbd mariadbd [.] recv_sys_t::add
|
Finally, here is a recovery with a 5GiB buffer pool in one chunk:
2023-04-17 12:38:34 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
|
2023-04-17 12:38:49 0 [Note] InnoDB: Read redo log up to LSN=1963895808
|
2023-04-17 12:39:01 0 [Note] InnoDB: Recovery ran out of memory at LSN 2534560930
|
2023-04-17 12:39:04 0 [Note] InnoDB: Read redo log up to LSN=3312233472
|
2023-04-17 12:39:19 0 [Note] InnoDB: Read redo log up to LSN=1599646720
|
2023-04-17 12:39:34 0 [Note] InnoDB: Read redo log up to LSN=2160831488
|
2023-04-17 12:39:48 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages
|
2023-04-17 12:40:03 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages
|
2023-04-17 12:40:18 0 [Note] InnoDB: Read redo log up to LSN=3195776000
|
2023-04-17 12:40:33 0 [Note] InnoDB: Read redo log up to LSN=3687099392
|
2023-04-17 12:40:48 0 [Note] InnoDB: Read redo log up to LSN=4165315584
|
2023-04-17 12:40:54 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages
|
2023-04-17 12:41:09 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages
|
2023-04-17 12:41:24 0 [Note] InnoDB: Read redo log up to LSN=4508724224
|
2023-04-17 12:41:39 0 [Note] InnoDB: Read redo log up to LSN=5094550528
|
2023-04-17 12:41:54 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo
|
2023-04-17 12:41:54 0 [Note] InnoDB: Trx id counter is 37315565
|
2023-04-17 12:41:54 0 [Note] InnoDB: To recover: 205230 pages
|
…
|
2023-04-17 12:42:02 0 [Note] InnoDB: Rollback of non-prepared transactions completed
|
2023-04-17 12:42:02 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
|
2023-04-17 12:42:02 0 [Note] InnoDB: Creating shared tablespace for temporary tables
|
2023-04-17 12:42:02 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
|
2023-04-17 12:42:02 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
|
2023-04-17 12:42:02 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585
|
Summary:
buffer pool size | time until parsing OOM or EOF | total recovery time |
---|---|---|
5GiB | 27s | 208s |
10GiB | 67s | 356s |
30GiB | 106s | 128s |
The 10GiB case was with multiple buffer pool chunks. That is why it consumed much more time than a recovery with 5GiB buffer pool in one chunk.
Here are some perf report numbers for the 10.6 recovery with innodb_buffer_pool_size=5G in one chunk:
26,68% mariadbd mariadbd [.] recv_sys_t::parse
|
17,08% mariadbd mariadbd [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>
|
6,52% mariadbd mariadbd [.] recv_sys_t::erase
|
4,53% mariadbd libstdc++.so.6.0.30 [.] std::_Rb_tree_increment
|
4,49% mariadbd libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
|
4,34% mariadbd mariadbd [.] recv_sys_t::add
|
4,07% mariadbd mariadbd [.] log_phys_t::apply
|
4,04% mariadbd mariadbd [.] recv_recover_page
|
2,26% mariadbd mariadbd [.] fil_space_t::get
|
2,09% mariadbd libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
|
1,49% mariadbd libc.so.6 [.] _int_free
|
1,35% mariadbd mariadbd [.] page_apply_insert_dynamic
|
1,26% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm
|
1,01% mariadbd [kernel.kallsyms] [k] clear_page_erms
|
0,92% mariadbd [kernel.kallsyms] [k] copy_user_enhanced_fast_string
|
0,87% mariadbd [kernel.kallsyms] [k] syscall_exit_to_user_mode
|
0,80% mariadbd mariadbd [.] crc32c_3way
|
One possible improvement would be to try to invoke recv_sys.pages.clear() instead of erase()ing individual elements. Only a couple of percent of CPU was actually spent on interesting or known-to-be CPU-intensive work (applying log to a page, or computing checksums).
The out-of-memory handling in the 10.6 (pre MDEV-14425) version of this will be inferior to the later version in two aspects:
If recovery runs of of memory in the 10.6 version,
- It will restart parsing and storing the log from the checkpoint. It can never resume from the "OOM LSN".
- Each batch will scan the log to the end. I did not get my optimization of this to work; some DDL crash recovery tests would occasionally fail.
Unfortunately, both the 10.8 and 10.6 version of this are acting up. One test that occasionally fails is innodb.alter_copy. I have not yet tested if that test is stable in the original development version. In 11.0 the buffer pool and recovery were simplified thanks to MDEV-29694 (removing the InnoDB change buffer). That could play a role here.
I got an rr replay trace of the test innodb.alter_copy from the 10.8 version where fil_validate() fails during IORequest::write_complete(), on a tablespace that is in the process of being created in recv_sys_t::recover_deferred(). I think that we must protect that better by holding fil_system.mutex until the tablespace has been fully created. This was fixed in MDEV-31080.
I was asked to improve the wording of the "Recovery ran out of memory at LSN" message. I think that the following should be more informative:
[Note] InnoDB: Multi-batch recovery needed at LSN 4189599815
|
There is not enough time to test this thoroughly before the upcoming quarterly releases.
With the current fix that I have for 10.6 (and which might be feasible to port to 10.5 as well), some excessive or unnecessary log scanning will take place. When the buffer pool is small, the log reads will basically be O(n²) instead of O. This will be better in the version that uses the
MDEV-14425 log format.
I am currently working around the slow prepare of our 1.7TB database when doing manual backup and restore by specifying Memory:
sudo mariabackup --prepare --use-memory=464G
But the problem is that when a SST happens as part of normal recovery it uses the default 100MB for use-memory which then results in hours and hours of prep.
Is it not possible to set --use-memory somewhere in Configuration so that SST recovery script will use this?
stephanvos, can you please file a separate bug against wsrep_sst_mariabackup.sh so that this parameter can be specified? It is also thinkable that mariadb-backup --prepare would read the use_memory parameter from an appropriate section of a configuration file.
Thanks I will do so.
From: Marko Mäkelä (Jira) <jira@mariadb.org>
Sent: Friday, 20 December 2024 07:43
To: Stephan Vos <stephan.vos@metacom.net>
Subject: [JIRA] Marko Mäkelä mentioned you on MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress (Jira)
[ https://jira.mariadb.org/browse/MDEV-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel<https://jira.mariadb.org/browse/MDEV-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel> ]
Marko Mäkelä mentioned you on MDEV-29911
--------------------------------
stephanvos, can you please file a separate bug against wsrep_sst_mariabackup.sh so that this parameter can be specified? It is also thinkable that mariadb-backup --prepare would read the use_memory parameter from an appropriate section of a configuration file.
–
This message was sent by Atlassian Jira
(v8.20.16#820016)
Thank you, I see that you filed MDEV-35749 for the wsrep_sst_mariabackup.sh problem.
In
MDEV-11027, the recovery progress reporting was refactored. For the normal server startup, progress should be reported every 15 seconds.I do not immediately see any reason why it would be different for mariadb-backup --backup.
Can you provide a test case that demonstrates the lack of progress can be reproduced?