[MDEV-29911] InnoDB recovery and mariadb-backup --prepare fail to report detailed progress Created: 2022-10-29  Updated: 2023-08-30  Resolved: 2023-05-19

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2

Type: Bug Priority: Critical
Reporter: Faisal Saeed (Inactive) Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: None

Attachments: File big_restore.log     Text File preparelog.txt    
Issue Links:
Blocks
blocks MDEV-32042 Special handling of crash recovery in... Closed
is blocked by MDEV-31080 InnoDB: Failing assertion: space->siz... Closed
Problem/Incident
causes MDEV-31386 InnoDB: Failing assertion: page_type ... Closed
causes MDEV-31827 InnoDB multi-batch recovery stops pre... Closed
Relates
relates to MDEV-11027 InnoDB log recovery is too noisy Closed
relates to MDEV-31350 test innodb.recovery_memory failed on... Closed
relates to MDEV-31353 InnoDB recovery hangs after reporting... Closed
relates to MDEV-31362 recv_sys_t::apply(bool): Assertion `!... Closed
relates to MDEV-14481 Execute InnoDB crash recovery in the ... Closed
relates to MDEV-30053 mariabackup should indicate memory ne... Open
relates to MDEV-32029 Assertion failures in log_sort_flush_... Closed

 Description   

A user is facing this scenario where mariabackup --prepare has been running for 12+ hours, and it's reporting page numbers to recover. Then it goes on to multiple iterations of recovery with no indication of how much work is remaining or how much time remains to complete the process.

It will be great if `--prepare` can report the percentage of pages recovery completed per iteration and a global percentage completion consolidating all iteration batches.



 Comments   
Comment by Marko Mäkelä [ 2022-10-30 ]

In MDEV-11027, the recovery progress reporting was refactored. For the normal server startup, progress should be reported every 15 seconds.

I do not immediately see any reason why it would be different for mariadb-backup --backup.

Can you provide a test case that demonstrates the lack of progress can be reproduced?

Comment by Rick Pizzi [ 2022-10-30 ]

marko the reporting is not giving any idea about how many batches are left - we really need a % of completion of the overall prepare phase.
Since it is reading the redo log file it shouldn't be complicated to produce a percentage by checking the current LSN versus the final LSN, what do you think?

Thanks
Rick

Comment by Marko Mäkelä [ 2022-11-02 ]

I believe that there may be some overlap between recovery batches, that is, the same pages may be updated in multiple batches.

We should be able to give an estimate of how many batches will be needed after the current one, but it may be impossible to give an accurate estimate of the number of pages that are awaiting recovery in subsequent batches.

I would really like to see a complete log of --prepare or a server crash recovery that involves very many batches, so that I can validate some ideas that I have about this.

Comment by Faisal Saeed (Inactive) [ 2022-11-02 ]

marko `--prepare` log has been attached for your reference.

Comment by Marko Mäkelä [ 2022-11-03 ]

To anyone who tries to reproduce this: If preparing a backup only took a short time, there cannot have been much log written since the latest checkpoint. You would probably need some concurrent write workload on the server while the backup is running. While backup is running, the server can receive further writes and be producing more log. Backup cannot switch to a later checkpoint on the fly; it will copy all log from the checkpoint that was the latest when the backup started.

Log checkpoints starting with MariaDB Server 10.5.7 (after the refactoring in MDEV-23855 and MDEV-23399) are triggered at the end of each batch of writing modified pages from the buffer pool to the data files. That can happen every second, if you have enabled "background flushing" (see MDEV-27295). The innodb_log_file_size and innodb_buffer_pool_size matters a lot.

When preparing a backup, the buffer pool size is specified by the --use-memory parameter. preparelog.txt seems to be using the default size of about 100MiB. The interesting part of the log starts like this:

2022-10-29 12:50:33 0 [Note] InnoDB: Initializing buffer pool, total size = 104857600, chunk size = 104857600
2022-10-29 12:50:33 0 [Note] InnoDB: Completed initialization of buffer pool
2022-10-29 12:50:33 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=113635532803184,113635804884887
2022-10-29 12:50:48 0 [Note] InnoDB: Read redo log up to LSN=113648002206720

The difference between the two LSNs in the checkpoint message is 272,081,703 bytes. This roughly corresponds to the distance between the previous and the latest checkpoint.

In the InnoDB crash recovery, until all of the circular ib_logfile0 has been parsed, we cannot know the last recovery LSN. We would know an upper bound of it: in the worst case, there will be log records right up to the file offset where the latest checkpoint started. The formula used to be more complicated, but with MDEV-14425 (MariaDB Server 10.8), the maximum end LSN should be the checkpoint LSN plus the size of the payload area (the size of ib_logfile0 minus the 12,288-byte header).

In mariadb-backup --prepare, the last recovery LSN would actually be known: it is roughly the sum of the checkpoint LSN and the size of the append-only ib_logfile0 that was created during backup.

In the log, the latest message about reading log that I can see is the following:

2022-10-30  1:44:21 0 [Note] InnoDB: To recover: 527 pages from log
2022-10-30  1:44:35 0 [Note] InnoDB: Starting a batch to recover 5062 pages from redo log.
2022-10-30  1:44:35 0 [Note] InnoDB: Starting a batch to recover 5289 pages from redo log.
2022-10-30  1:44:36 0 [Note] InnoDB: Read redo log up to LSN=113635785294848
2022-10-30  1:44:36 0 [Note] InnoDB: Starting a batch to recover 5390 pages from redo log.

This actually is 19,590,039 less than the LSN was when the checkpoint was triggered before backup was initiated.

In MDEV-21351 (MariaDB 10.5.1), the memory management during crash recovery was simplified and improved. The rule in recv_sys_t::is_memory_exhausted() is that the number of buffer pool pages that are allocated for redo log records must not exceed a third of the size of the buffer pool. Because memory blocks can be freed as soon as all log records in it have been applied to data pages, new recovery batches can be start. That is why we are seeing so many of those tiny recovery batches (such as 5,062 pages) above, starting several times per second.

The largest batch seems to be the following:

2022-10-29 12:52:29 0 [Note] InnoDB: Starting a batch to recover 82554 pages from redo log.

This means that up to 33MiB of redo log records (one third of the 100MiB buffer pool) are referring to that many pages. Initially, the buffer pool has space for 4,266 pages. Maybe some page reads can be avoided (MDEV-12699) when the pages are being freed or completely initialized according to log records. The 82,554 pages are 19 times the size of the available buffer pool.

According to the log, things seem to run extremely slowly:

2022-10-29 13:14:03 0 [Note] InnoDB: To recover: 988 pages from log
2022-10-29 13:14:18 0 [Note] InnoDB: To recover: 929 pages from log

How could it take 15 seconds to apply changes to only 59 pages? Was there any other I/O or CPU load concurrently running on the computer or the storage system where the backup was being prepared? What is the maximum write speed of the storage where the backup was being prepared? Are tables in ROW_FORMAT=COMPRESSED or page_compressed being used?

What is the size of the ib_logfile0 in the backup directory?

It might make sense to change the rule "use up to one third of the buffer pool for log records" to something like "reserve two thirds of the buffer pool for data pages to process in a batch". Any performance improvements would only be possible if someone can provide a copy of a backup with which this can be reproduced.

Depending on the server workload, one trick to improve --prepare speed would be to trigger a log checkpoint right before starting the backup:

SET @save_pct= @@GLOBAL.innodb_max_dirty_pages_pct;
SET @save_pct_lwm= @@GLOBAL.innodb_max_dirty_pages_pct_lwm;
 
SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.0;
SET GLOBAL innodb_max_dirty_pages_pct=0.0;
 
-- wait for the following to reach 0
SELECT variable_value FROM information_schema.global_status
WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
 
SET GLOBAL innodb_max_dirty_pages_pct = @save_pct;
SET GLOBAL innodb_max_dirty_pages_pct_lwm = @save_pct_lwm;
-- start the backup

Comment by Marko Mäkelä [ 2022-11-07 ]

If this occurred with MariaDB 10.6.5, there have been some bug fixes to recovery performance since then: MDEV-27022, MDEV-27058, MDEV-27183, MDEV-27610, not to mention several correctness fixes (mostly affecting the recovery of DDL operations).

It would be interesting to know how fast the mariadb-backup --prepare would run when using a more recent executable.

Comment by Marko Mäkelä [ 2022-12-05 ]

While testing the upgrade logic of MDEV-29694, I noticed that we are missing progress reporting for scanning and parsing the log:

10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f

2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319
2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319
2022-12-05 10:31:47 0 [Note] InnoDB: Starting a batch to recover 333533 pages from redo log.
2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log
2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log
2022-12-05 10:32:16 0 [Note] InnoDB: 4 transaction(s) which must be rolled back or cleaned up in total 4 row operations to undo
2022-12-05 10:32:16 0 [Note] InnoDB: Trx id counter is 27822492
2022-12-05 10:32:16 0 [Note] InnoDB: Starting final batch to recover 273219 pages from redo log.
2022-12-05 10:32:23 0 [Note] InnoDB: 128 rollback segments are active.

We are supposed to generate progress reports every 15 seconds. That was not the case here, between the following lines:

2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319
2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319

and

2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log
2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log

This was the recovery of a debug server that was killed in the middle of Sysbench oltp_update_index, with innodb_change_buffering_debug=1, running on RAM disk.
The recovery was conducted on a non-debug server using innodb_log_file_size=4G, innodb_buffer_pool_size=10G

Because the killed server was a debug build, so it was writing additional page checksum records (MDEV-18976). Disabling the creation of those records had the following effect on recovery performance:

10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f

2022-12-05 11:58:26 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7523651578
2022-12-05 11:58:55 0 [Note] InnoDB: Read redo log up to LSN=7523651578
2022-12-05 11:59:30 0 [Note] InnoDB: Starting a batch to recover 339735 pages from redo log.
2022-12-05 11:59:30 0 [Note] InnoDB: To recover: 339735 pages from log
2022-12-05 11:59:47 0 [Note] InnoDB: To recover: 158840 pages from log
2022-12-05 11:59:47 0 [Note] InnoDB: Starting final batch to recover 158583 pages from redo log.
2022-12-05 11:59:48 0 [Note] InnoDB: 128 rollback segments are active.

Also in this case, several messages between 11:58.26 and 11:59:30 are significantly more than 15 seconds apart.

For the previous run, we could observe a recovery speed of 59928 pages / 29 seconds = 2066 pages/s, or if we count data page bytes, 33 MB/s.

With this run, for the first batch, we can calculate 180895 pages / 17 seconds = 10641 pages/s, or 166 MiB/s. That is more reasonable.

In any case, I can confirm that the progress reporting needs to be improved. The performance feels somewhat low too; we need to check for performance bottlenecks with a profiler.

Comment by Marko Mäkelä [ 2023-03-30 ]

If I recover using a smaller buffer pool, then there will be frequent messages about starting a new recovery batch, like this:

2023-03-30 14:16:10 0 [Note] InnoDB: Completed initialization of buffer pool
2023-03-30 14:16:10 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
2023-03-30 14:16:25 0 [Note] InnoDB: Starting a batch to recover 12978 pages from redo log.
2023-03-30 14:16:25 0 [Note] InnoDB: To recover: 12978 pages from log
2023-03-30 14:16:26 0 [Note] InnoDB: Starting a batch to recover 13571 pages from redo log.
2023-03-30 14:16:30 0 [Note] InnoDB: Starting a batch to recover 80468 pages from redo log.
2023-03-30 14:16:37 0 [Note] InnoDB: Starting a batch to recover 84957 pages from redo log.
2023-03-30 14:16:40 0 [Note] InnoDB: To recover: 50796 pages from log

The "To recover:" messages would appear every 15 seconds. However, if I start the recovery of the same data directory with a larger buffer pool, then it will be silent for longer periods of time:

2023-03-30 14:18:05 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
2023-03-30 14:18:38 0 [Note] InnoDB: Trying to recover page [page id: space=5, page number=2166] from the doublewrite buffer.
2023-03-30 14:18:38 0 [Note] InnoDB: Recovered page [page id: space=5, page number=2166] to './test/sbtest1.ibd' from the doublewrite buffer.
2023-03-30 14:18:39 0 [Note] InnoDB: Read redo log up to LSN=890094812
2023-03-30 14:19:12 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log.
2023-03-30 14:19:12 0 [Note] InnoDB: To recover: 153487 pages from log
2023-03-30 14:19:56 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log.
2023-03-30 14:19:56 0 [Note] InnoDB: To recover: 183721 pages from log
2023-03-30 14:20:11 0 [Note] InnoDB: To recover: 18334 pages from log
2023-03-30 14:20:46 0 [Note] InnoDB: Starting a batch to recover 192135 pages from redo log.
2023-03-30 14:20:46 0 [Note] InnoDB: To recover: 192135 pages from log
2023-03-30 14:21:01 0 [Note] InnoDB: To recover: 30398 pages from log
2023-03-30 14:21:34 0 [Note] InnoDB: Starting a batch to recover 188234 pages from redo log.
2023-03-30 14:21:34 0 [Note] InnoDB: To recover: 188234 pages from log
2023-03-30 14:21:49 0 [Note] InnoDB: To recover: 48679 pages from log
2023-03-30 14:22:01 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
2023-03-30 14:22:01 0 [Note] InnoDB: Trx id counter is 37362293
2023-03-30 14:22:01 0 [Note] InnoDB: Starting final batch to recover 104686 pages from redo log.
2023-03-30 14:22:04 0 [Note] InnoDB: To recover: 29018 pages from log
2023-03-30 14:22:05 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-30 14:22:05 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-03-30 14:22:05 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-03-30 14:22:05 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-03-30 14:22:05 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

The "To recover:" messages should be output every 15 seconds, but there were no messages at 14:20:26 or 14:20:41. The "Starting a batch to recover" messages are unconditionally output at the start of each batch.

During this experiment, it also caught my eye that the recovery is occupying only one CPU core. I suppose that recovery avoids reading most data pages thanks to MDEV-12699, and most of the recovery will simply initialize pages in the buffer pool. To analyze that, I started a perf record -g during a recovery of the same data directory contents with the same innodb_buffer_pool_size=10g, and I noticed another anomaly for this repeated run:

2023-03-30 14:30:58 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
2023-03-30 14:31:33 0 [Note] InnoDB: Read redo log up to LSN=890094812
2023-03-30 14:32:07 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log.
2023-03-30 14:32:07 0 [Note] InnoDB: To recover: 153487 pages from log
2023-03-30 14:32:55 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log.
2023-03-30 14:32:55 0 [Note] InnoDB: To recover: 183721 pages from log

There was no progress message about "Read redo log up to" for the first 35 seconds of the recovery. Also, the second "To recover" message was issued 48 seconds after the first one. It looks like the messages are somehow being missed sporadically. Another try with innodb_buffer_pool_size=20g:

2023-03-30 14:38:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
2023-03-30 14:39:22 0 [Note] InnoDB: Read redo log up to LSN=890094812
2023-03-30 14:40:33 0 [Note] InnoDB: Starting a batch to recover 249675 pages from redo log.
2023-03-30 14:40:33 0 [Note] InnoDB: To recover: 249675 pages from log
2023-03-30 14:40:48 0 [Note] InnoDB: To recover: 65746 pages from log
2023-03-30 14:42:20 0 [Note] InnoDB: Starting a batch to recover 299196 pages from redo log.
2023-03-30 14:42:20 0 [Note] InnoDB: To recover: 299196 pages from log
2023-03-30 14:42:35 0 [Note] InnoDB: To recover: 204605 pages from log
2023-03-30 14:42:50 0 [Note] InnoDB: To recover: 76653 pages from log
2023-03-30 14:43:05 0 [Note] InnoDB: To recover: 3039 pages from log
2023-03-30 14:43:18 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
2023-03-30 14:43:18 0 [Note] InnoDB: Trx id counter is 37362293
2023-03-30 14:43:18 0 [Note] InnoDB: Starting final batch to recover 104676 pages from redo log.
2023-03-30 14:43:20 0 [Note] InnoDB: To recover: 79977 pages from log
2023-03-30 14:43:22 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-30 14:43:22 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-03-30 14:43:22 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-03-30 14:43:22 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-03-30 14:43:22 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed
2023-03-30 14:43:22 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-03-30 14:43:22 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-03-30 14:43:22 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

Right before issuing the first message "Starting a batch" it was busy parsing the redo log. The "Read redo log up to LSN" message is intended to be issued every 15 seconds, just like the "To recover:" ones. Because perf failed to collect samples for some reason, I ran it once more, this time with innodb_buffer_pool_size=30g, to be met with similar initial silence:

2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
2023-03-30 14:52:09 0 [Note] InnoDB: Read redo log up to LSN=890094812
2023-03-30 14:53:59 0 [Note] InnoDB: Starting a batch to recover 343907 pages from redo log.
2023-03-30 14:53:59 0 [Note] InnoDB: To recover: 343907 pages from log
2023-03-30 14:54:14 0 [Note] InnoDB: To recover: 162142 pages from log
2023-03-30 14:54:29 0 [Note] InnoDB: To recover: 16129 pages from log
2023-03-30 14:55:18 0 [Note] InnoDB: To recover: 200958 pages from log
2023-03-30 14:55:20 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
2023-03-30 14:55:20 0 [Note] InnoDB: Trx id counter is 37362293
2023-03-30 14:55:20 0 [Note] InnoDB: Starting final batch to recover 200464 pages from redo log.
2023-03-30 14:55:33 0 [Note] InnoDB: To recover: 61463 pages from log
2023-03-30 14:55:42 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-30 14:55:42 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-03-30 14:55:42 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-03-30 14:55:42 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

During the 3½ minutes of the (necessarily) single-threaded initial parsing of the redo log, only one "Read redo log up to LSN" message was issued. The "To recover" messages were issued almost reliably during this run. The perf report data shows that some page reads did take place, so some log should have been applied in multiple threads. Yet, the CPU utilization hovered close to 100% (of one CPU core) for the entire run.

Comment by Marko Mäkelä [ 2023-03-30 ]

I got a bit distracted with recovery performance testing, reading the perf report. In my previous comment, I mentioned recovery with innodb_buffer_pool_size=30g:

2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812
2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

The total time was 5 minutes and 6 seconds. I managed to optimize away 2 minutes:

2023-03-30 18:09:29 0 [Note] InnoDB: Completed initialization of buffer pool
2023-03-30 18:11:38 0 [Note] InnoDB: To recover: 447139 pages from log
2023-03-30 18:11:45 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
2023-03-30 18:11:45 0 [Note] InnoDB: Trx id counter is 37362293
2023-03-30 18:11:45 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log.
2023-03-30 18:11:53 0 [Note] InnoDB: To recover: 296136 pages from log
2023-03-30 18:12:08 0 [Note] InnoDB: To recover: 89868 pages from log
2023-03-30 18:12:23 0 [Note] InnoDB: To recover: 23926 pages from log
2023-03-30 18:12:33 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-30 18:12:33 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

That is 3 minutes and 4 seconds for the recovery. Most of the time goes into parsing and storing log records, which I think must be single-threaded.

I tried to revert one change, to see if it was for the better or for the worse:

2023-03-30 18:37:54 0 [Note] InnoDB: Completed initialization of buffer pool
2023-03-30 18:40:11 0 [Note] InnoDB: To recover: 447139 pages from log
2023-03-30 18:40:21 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo
2023-03-30 18:40:21 0 [Note] InnoDB: Trx id counter is 37362293
2023-03-30 18:40:21 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log.
2023-03-30 18:40:26 0 [Note] InnoDB: To recover: 364557 pages from log
2023-03-30 18:40:41 0 [Note] InnoDB: To recover: 128886 pages from log
2023-03-30 18:40:56 0 [Note] InnoDB: To recover: 43070 pages from log
2023-03-30 18:41:10 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-30 18:41:10 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

This is 3 minutes and 16 seconds, a little worse.

The last 48 or 49 seconds of these runs is actual recovery, which could be subjected to parallelization. If we assume 16 concurrent threads, ideally we could get that shaved to 3 seconds, that is, the total time for the faster run would be 2 minutes and 32 seconds. I think that I will try that as well. The more reliable progress reporting would only be some icing on the cake.

Comment by Marko Mäkelä [ 2023-03-31 ]

The messages InnoDB: To recover: would only be output when applying log to a page in recv_recover_page(). The initial parsing stage can take a long time. I added an intentionally typo-ed message to the log parsing loop in recv_scan_log(). This is how the output would look like (for recovering a different data directory than yesterday):

2023-03-31 12:06:21 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
2023-03-31 12:06:36 0 [Note] InnoDB: To resover: 110099 pages from log
2023-03-31 12:06:51 0 [Note] InnoDB: To resover: 160222 pages from log
2023-03-31 12:07:06 0 [Note] InnoDB: To resover: 213224 pages from log
2023-03-31 12:07:21 0 [Note] InnoDB: To resover: 262026 pages from log
2023-03-31 12:07:41 0 [Note] InnoDB: Read redo log up to LSN=45451
2023-03-31 12:07:56 0 [Note] InnoDB: To resover: 91522 pages from log
2023-03-31 12:08:11 0 [Note] InnoDB: To resover: 130295 pages from log
2023-03-31 12:08:26 0 [Note] InnoDB: To resover: 169361 pages from log
2023-03-31 12:08:41 0 [Note] InnoDB: To resover: 212187 pages from log
2023-03-31 12:08:56 0 [Note] InnoDB: To resover: 247702 pages from log
2023-03-31 12:09:11 0 [Note] InnoDB: To resover: 283953 pages from log
2023-03-31 12:09:24 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.
2023-03-31 12:09:26 0 [Note] InnoDB: To recover: 314811 pages from log
2023-03-31 12:09:41 0 [Note] InnoDB: To recover: 100224 pages from log
2023-03-31 12:09:56 0 [Note] InnoDB: To recover: 8134 pages from log
2023-03-31 12:10:11 0 [Note] InnoDB: To resover: 100637 pages from log
2023-03-31 12:10:26 0 [Note] InnoDB: To resover: 140568 pages from log
2023-03-31 12:10:41 0 [Note] InnoDB: To resover: 185950 pages from log
2023-03-31 12:10:56 0 [Note] InnoDB: To resover: 225344 pages from log
2023-03-31 12:11:11 0 [Note] InnoDB: To resover: 259438 pages from log
2023-03-31 12:11:22 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo
2023-03-31 12:11:22 0 [Note] InnoDB: Trx id counter is 41494841
2023-03-31 12:11:22 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.
2023-03-31 12:11:26 0 [Note] InnoDB: To recover: 212473 pages from log
2023-03-31 12:11:41 0 [Note] InnoDB: To recover: 49345 pages from log
2023-03-31 12:11:51 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-31 12:11:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858

That is, for the first 3 minutes recovery would likely not execute recv_recover_page() at all, only recv_scan_log() (typo-ed To resover). After the first To recover message at 12:09:26 both types of messages are being interleaved.

The tight loop in recv_scan_log() might not the right place for the message output, though. Let us compare how quickly the same data directory recovered before I added the message output:

2023-03-31 11:58:42 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
2023-03-31 12:00:00 0 [Note] InnoDB: Read redo log up to LSN=45451
2023-03-31 12:01:35 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.
2023-03-31 12:01:35 0 [Note] InnoDB: To recover: 315871 pages from log
2023-03-31 12:01:50 0 [Note] InnoDB: To recover: 120018 pages from log
2023-03-31 12:02:05 0 [Note] InnoDB: To recover: 1754 pages from log
2023-03-31 12:03:19 0 [Note] InnoDB: To recover: 272422 pages from log
2023-03-31 12:03:23 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo
2023-03-31 12:03:23 0 [Note] InnoDB: Trx id counter is 41494841
2023-03-31 12:03:23 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.
2023-03-31 12:03:34 0 [Note] InnoDB: To recover: 116378 pages from log
2023-03-31 12:03:49 0 [Note] InnoDB: To recover: 4730 pages from log
2023-03-31 12:03:51 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-31 12:03:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-03-31 12:03:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858

With the messages added: 12:11:51 - 12:06:21 = 5 minutes and 30 seconds
Without the messages: 12:03:51 - 11:58:42 = 5 minutes and 9 seconds

I will try to find a more suitable place for the message output, to avoid introducing any significant performance regression.

Comment by Marko Mäkelä [ 2023-03-31 ]

I confirmed with ATTRIBUTE_NOINLINE that virtually all CPU cycles of recv_sys_t::recover_low() are being spent in recv_sys_t::free(). I believe that implementing MDEV-29445 will help with that. In total, these two functions waste over 10% of single-threaded CPU time:

   5.18%  mariadbd         mariadbd              [.] recv_sys_t::add
   4.78%  mariadbd         mariadbd              [.] recv_sys_t::free

I reran the recovery with an extra parameter --innodb-buffer-pool-chunk-size=30g to work around MDEV-25342, which would normally divide the buffer pool into 64 chunks:

2023-03-31 13:21:37 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB
2023-03-31 13:21:38 0 [Note] InnoDB: Completed initialization of buffer pool
2023-03-31 13:21:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
2023-03-31 13:26:07 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858

There was no impact on the recovery time (still 5½ minutes, because the extra To resover messages were present). The ranking of recv_sys_t::free() dropped a bit:

   2.92%  mariadbd         mariadbd              [.] recv_sys_t::free

The remaining performance hog in that function is an atomic operation. We are repurposing a field in the buffer block descriptor in this custom allocator, and we do not actually need any atomic operation because we are holding recv_sys.mutex:

@@ -1411,8 +1422,11 @@ inline void recv_sys_t::free(const void *data)
     ut_ad(block->page.state() == buf_page_t::MEMORY);
     ut_ad(static_cast<uint16_t>(block->page.access_time - 1) <
           srv_page_size);
-    ut_ad(block->page.access_time >= 1U << 16);
-    if (!((block->page.access_time -= 1U << 16) >> 16))
+    unsigned a= block->page.access_time;
+    ut_ad(a >= 1U << 16);
+    a-= 1U << 16;
+    block->page.access_time= a;
+    if (!(a >> 16))
     {
       UT_LIST_REMOVE(blocks, block);
       MEM_MAKE_ADDRESSABLE(block->page.frame, srv_page_size);

In recv_sys_t::alloc() we are already using relaxed loads and stores instead of the costly atomic read-modify-write (lock xadd in 80486 and its successors). With the above change, the recv_sys_t::free() improved and the recovery time improved by a whopping 25% (82 seconds) from 5½ minutes to 4 minutes and 8 seconds:

   1.15%  mariadbd         mariadbd             [.] recv_sys_t::free

It looks like the remaining time in this function is dominated by the loop that iterates over the buffer pool chunks (in this case, only 1, but frequently executed conditions destroy performance). That would be addressed by MDEV-29445.

Comment by Marko Mäkelä [ 2023-03-31 ]

I was thinking that something like this could improve performance a lot in case a single recovery batch is not sufficient:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 148ab4ce55b..d1479456e62 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -4212,7 +4212,8 @@ dberr_t recv_recovery_from_checkpoint_start()
 		ut_ad(srv_force_recovery <= SRV_FORCE_NO_UNDO_LOG_SCAN);
 
 		if (rescan) {
-			recv_sys.lsn = log_sys.next_checkpoint_lsn;
+			recv_sys.lsn = std::max(log_sys.next_checkpoint_lsn,
+						recv_sys.last_stored_lsn);
 			rescan = recv_scan_log(true);
 			if ((recv_sys.is_corrupt_log()
 			     && !srv_force_recovery)

This exact change will make the test innodb.recovery_memory, and I imagine that it is somehow related to the FILE_ records for file name tracking. During the first scan of the redo log, we may parse log records for tablespace files that may have been deleted or renamed after the time of the latest checkpoint. Until we see a FILE_RENAME or FILE_DELETE record for the tablespace ID, we cannot know what is the right thing to do. Currently, the log between the latest checkpoint and its corresponding FILE_CHECKPOINT record is being parsed in "store if exists" mode: If the tablespace for some pages is not found, we will discard the records. It could be better to unconditionally store all log records in recv_sys.pages and check for inconsistency upon reaching the FILE_CHECKPOINT record. Some added progress reporting highlights this problem:

2023-03-31 16:52:01 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB
2023-03-31 16:52:01 0 [Note] InnoDB: Completed initialization of buffer pool
2023-03-31 16:52:01 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451
2023-03-31 16:52:16 0 [Note] InnoDB: Parsed redo log up to LSN=1742292932; to recover: 106297 pages
2023-03-31 16:52:31 0 [Note] InnoDB: Parsed redo log up to LSN=2311978604; to recover: 156337 pages
2023-03-31 16:52:46 0 [Note] InnoDB: Parsed redo log up to LSN=2942556105; to recover: 210401 pages
2023-03-31 16:53:01 0 [Note] InnoDB: Parsed redo log up to LSN=3504149533; to recover: 259067 pages
2023-03-31 16:53:23 0 [Note] InnoDB: Read redo log up to LSN=45451
2023-03-31 16:53:38 0 [Note] InnoDB: Parsed redo log up to LSN=1645302567; to recover: 93217 pages
2023-03-31 16:53:53 0 [Note] InnoDB: Parsed redo log up to LSN=2166257717; to recover: 138937 pages
2023-03-31 16:54:08 0 [Note] InnoDB: Parsed redo log up to LSN=2746862824; to recover: 188829 pages
2023-03-31 16:54:23 0 [Note] InnoDB: Parsed redo log up to LSN=3229740111; to recover: 231309 pages
2023-03-31 16:54:38 0 [Note] InnoDB: Parsed redo log up to LSN=3706313376; to recover: 272336 pages
2023-03-31 16:54:53 0 [Note] InnoDB: Parsed redo log up to LSN=4173821625; to recover: 312515 pages
2023-03-31 16:54:54 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.
2023-03-31 16:55:08 0 [Note] InnoDB: To recover: 95902 pages from log
2023-03-31 16:55:26 0 [Note] InnoDB: Parsed redo log up to LSN=4213203801; to recover: 0 pages
2023-03-31 16:55:40 0 [Note] InnoDB: Parsed redo log up to LSN=4665667725; to recover: 124497 pages
2023-03-31 16:55:55 0 [Note] InnoDB: Parsed redo log up to LSN=5227132263; to recover: 174975 pages
2023-03-31 16:56:10 0 [Note] InnoDB: Parsed redo log up to LSN=5722640474; to recover: 221146 pages
2023-03-31 16:56:25 0 [Note] InnoDB: Parsed redo log up to LSN=6131373971; to recover: 259411 pages
2023-03-31 16:56:34 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo
2023-03-31 16:56:34 0 [Note] InnoDB: Trx id counter is 41494841
2023-03-31 16:56:34 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.
2023-03-31 16:56:40 0 [Note] InnoDB: To recover: 155518 pages from log
2023-03-31 16:56:53 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494847

The time between 16:52:01 and 16:53:23 (82 seconds) was basically wasted, because we will restart parsing the log from the checkpoint. Another way of looking at it is to interpolate the timestamp corresponding to the LSN that was displayed at 16:53:01. It could be around 16:54:30 (a little before a somewhat larger LSN was reported at 16:54:38). Again, the difference is more than 80 seconds. That is quite a bit, considering that the total recovery time is 4 minutes and 52 seconds (292 seconds). We could save another 29% of time there. Together with the previous savings in recv_sys_t::free(), this should halve the recovery time!

This could affect many users, because my recovery attempt was with a larger buffer pool (30G) than was used for originally creating the data directory (20G). With an equal-sized or smaller buffer pool and infrequent log checkpoints, a multi-batch recovery should be very likely.

Also the memory limit (do not let log records occupy more than â…“ of the buffer pool) feels rather arbitrary. Do we really need to reserve â…” of a huge buffer pool for copies of data pages during recovery, or would something like 128 or 256 pages suffice? Recovery only posts batches of 32 asynchronous page reads when it needs to read pages, and during the recovery of my test datasets I hardly see any page reads taking place. However, we must keep in mind that this is a rather special test: no checkpoint had been written after the server was bootstrapped, and practically everything can be recovered (MDEV-12699) without reading any pages from the data files. I will have to test also a case where there was a checkpoint between loading the test data and running the UPDATE workload.

I tried "pushing down" the memory limit to recv_sys_t::add(), but it needs some more work, because the test innodb.recovery_memory would occasionally fail.

Comment by Marko Mäkelä [ 2023-04-03 ]

Today, I made some progress on "pushing down OOM checks". In case we run out of memory, we will resume parsing the log from the start of the current mini-transaction, not from the start of the log. The memory usage limit is more reasonable too: for larger buffer pools, only BUF_LRU_MIN_LEN (256) pages are reserved for the to-be-recovered pages. Even that is somewhat of an overkill, because it is 4 times the maximum value of innodb_read_io_threads=64.

2023-04-03 16:38:35 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=950555258
2023-04-03 16:38:50 0 [Note] InnoDB: Parsed redo log up to LSN=2014717452; to recover: 126885 pages
2023-04-03 16:39:05 0 [Note] InnoDB: Parsed redo log up to LSN=2656759712; to recover: 181122 pages
2023-04-03 16:39:20 0 [Note] InnoDB: Parsed redo log up to LSN=3326397264; to recover: 234593 pages
2023-04-03 16:39:35 0 [Note] InnoDB: Parsed redo log up to LSN=3906055802; to recover: 283483 pages
2023-04-03 16:39:50 0 [Note] InnoDB: Parsed redo log up to LSN=4484524600; to recover: 332314 pages
2023-04-03 16:40:05 0 [Note] InnoDB: Parsed redo log up to LSN=5066015464; to recover: 381400 pages
2023-04-03 16:40:20 0 [Note] InnoDB: Parsed redo log up to LSN=5680977027; to recover: 428113 pages
2023-04-03 16:40:35 0 [Note] InnoDB: To recover: 454513 pages from log
2023-04-03 16:40:36 0 [Note] InnoDB: 26 transaction(s) which must be rolled back or cleaned up in total 26 row operations to undo
2023-04-03 16:40:36 0 [Note] InnoDB: Trx id counter is 37992126
2023-04-03 16:40:36 0 [Note] InnoDB: Starting final batch to recover 454423 pages from redo log.
2023-04-03 16:40:50 0 [Note] InnoDB: To recover: 206743 pages from log
2023-04-03 16:41:05 0 [Note] InnoDB: To recover: 67047 pages from log
2023-04-03 16:41:20 0 [Note] InnoDB: To recover: 22798 pages from log
2023-04-03 16:41:30 0 [Note] InnoDB: 128 rollback segments are active.
2023-04-03 16:41:30 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-04-03 16:41:31 0 [Note] InnoDB: log sequence number 6041702829 (memory-mapped); transaction id 37992153

This run is not exactly comparable to what I ran on Friday, because there is almost 1GiB less of log to recover (the checkpoint LSN starts at almost 1G). Anyway, the 82-second waste is gone, and we are down from 292 seconds to 176 seconds. For a more fair comparison to the previous run, let us multiply that by 6/5: 1.2×176s=3.5 minutes, much less than the almost 5 minutes.

To speed up recovery further, I think that both RECV_WILL_NOT_READ and RECV_NOT_PROCESSED should be treated by buf_read_recv_pages(). For RECV_WILL_NOT_READ, we need to introduce a "fake read" operation, which would invoke the recovery function directly, without actually reading anything from the file system.

I am afraid that the parsing (here: about 40 MiB/s for the first 121 seconds) must remain single-threaded. With parallel processing, it should be feasible to cut the remaining 55 seconds to 2 or 3 seconds. The original run that produced the log was with 64 concurrent connections, maybe 1 to 1½ minutes for loading the data and another 1½ minutes for the workload. That is, with that final fix, the database recovery should consume a similar amount of time as creating the database took in the first place.

Comment by Marko Mäkelä [ 2023-04-04 ]

It looks like I may have gotten the improved OOM processing to work. We will avoid re-parsing everything from the recovery checkpoint onwards, but we do have to re-parse everything from the start of the mini-transaction at which we ran out of memory for log records. For larger buffer pools, only 256 pages will be reserved for actual buffer pool pages; the rest can be consumed by log records.

I implemented a proof of concept for asynchronous "fake reads" for multi-threaded recovery. In my quick test with innodb_log_file_size=10G and innodb_buffer_pool_size=20G, it now recovers in a single batch when using the same-size buffer pool. The concurrency is not good yet: the best I saw with innodb_read_io_threads=64 (the maximum) was slightly less than 370% of one CPU core being utilized, and typically it was around 150% during the log application. I will streamline the work submission, and I will have to retest with innodb_buffer_pool_chunk_size=20g because the recv_sys.mutex could be a bottleneck in the memory management:

  24,89%  mariadbd         mariadbd              [.] page_recv_t::recs_t::clear
  18,16%  mariadbd         mariadbd              [.] recv_sys_t::parse<recv_buf, true>
  12,39%  mariadbd         mariadbd              [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t>
   5,20%  mariadbd         mariadbd              [.] recv_recover_page
   4,80%  mariadbd         mariadbd              [.] recv_sys_t::add
   3,80%  mariadbd         mariadbd              [.] log_phys_t::apply
   3,64%  mariadbd         [kernel.kallsyms]     [k] clear_page_erms
   1,21%  mariadbd         libc.so.6             [.] _int_free
   1,06%  mariadbd         libc.so.6             [.] __memmove_avx_unaligned_erms_rtm
   0,98%  mariadbd         mariadbd              [.] crc32c_3way
   0,90%  mariadbd         mariadbd              [.] page_apply_insert_dynamic

Using a single buffer pool chunk sped it up, and the CPU utilization could have been over 300% for most part (still a far cry from the 40 that nproc reports):

  21,85%  mariadbd         mariadbd              [.] recv_sys_t::parse<recv_buf, true>
  14,65%  mariadbd         mariadbd              [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t>
   7,24%  mariadbd         [kernel.kallsyms]     [k] clear_page_erms
   5,62%  mariadbd         mariadbd              [.] recv_recover_page
   4,53%  mariadbd         mariadbd              [.] recv_sys_t::add
   4,52%  mariadbd         mariadbd              [.] log_phys_t::apply
   3,90%  mariadbd         mariadbd              [.] page_recv_t::recs_t::clear
   1,50%  mariadbd         libc.so.6             [.] _int_free
   1,26%  mariadbd         libc.so.6             [.] __memmove_avx_unaligned_erms_rtm
   1,26%  mariadbd         mariadbd              [.] page_apply_insert_dynamic
   1,17%  mariadbd         [kernel.kallsyms]     [k] copy_page
   1,13%  mariadbd         mariadbd              [.] crc32c_3way

It now took this much wall clock time:

2023-04-04 16:27:31 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=928593106
2023-04-04 16:27:46 0 [Note] InnoDB: Parsed redo log up to LSN=2027073563; to recover: 128070 pages
2023-04-04 16:29:01 0 [Note] InnoDB: Parsed redo log up to LSN=5331384873; to recover: 402752 pages
2023-04-04 16:29:16 0 [Note] InnoDB: To recover: 453542 pages from log
2023-04-04 16:29:21 0 [Note] InnoDB: 15 transaction(s) which must be rolled back or cleaned up in total 15 row operations to undo
2023-04-04 16:29:21 0 [Note] InnoDB: Trx id counter is 37845466
2023-04-04 16:29:21 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.
2023-04-04 16:30:04 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482

The log parsing speed is about (6018444686-928593106)/105s = 46 MiB/s. That time is unaffected by the innodb_buffer_pool_chunk_size; I checked two previous runs.

Applying the log took 43 seconds. With the default innodb_buffer_pool_chunk_size and innodb_read_io_threads=4, it took 189 seconds :

2023-04-04 16:03:32 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.
2023-04-04 16:06:41 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482

With the default innodb_buffer_pool_chunk_size and the maximum innodb_read_io_threads=64, it took a couple of seconds more (191 seconds):

2023-04-04 16:13:01 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.
2023-04-04 16:16:12 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482

We must reduce contention on recv_sys.mutex. One way could be to lazily clear entries from recv_sys.pages. Hopefully then it will be acceptably fast even without specifying a maximal innodb_buffer_pool_chunk_size.

Comment by Marko Mäkelä [ 2023-04-08 ]

When we run out of memory while storing parsed log records into recv_sys.pages, we only need to rewind to the checkpoint LSN if the first call to recv_init_crash_recovery_spaces() reported that some tablespaces may be missing. Later, a special version of this validation loop could be implemented for DDL workloads.

I tried to recover from a crash that used a 20GiB buffer pool and 10GiB redo log, by using only 1GiB buffer pool:

sql/mariadbd --gdb --innodb-log-file-size=10g --innodb-buffer-pool-{,chunk-}size=1g --innodb-read-io-threads=64 --datadir /dev/shm/sbtest.backup/

Progress reporting looks like this:

2023-04-08 21:32:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688
2023-04-08 21:32:49 0 [Note] InnoDB: Recovery ran out of memory at LSN 1083025911
2023-04-08 21:32:58 0 [Note] InnoDB: End of log at LSN=4483551634
2023-04-08 21:32:59 0 [Note] InnoDB: To recover: LSN 1153652154/4483551634; 32645 pages
2023-04-08 21:33:08 0 [Note] InnoDB: To recover: LSN 1637462775/4483551634; 91426 pages
2023-04-08 21:33:23 0 [Note] InnoDB: Parsed redo log up to LSN=1961012462; to recover: 106911 pages
2023-04-08 21:33:23 0 [Note] InnoDB: To recover: LSN 1987860863/4483551634; 109501 pages
2023-04-08 21:33:38 0 [Note] InnoDB: Parsed redo log up to LSN=2220397229; to recover: 97796 pages
2023-04-08 21:33:41 0 [Note] InnoDB: To recover: LSN 2330818601/4483551634; 108920 pages
2023-04-08 21:33:56 0 [Note] InnoDB: Parsed redo log up to LSN=2581361215; to recover: 99426 pages
2023-04-08 21:33:58 0 [Note] InnoDB: To recover: LSN 2667451691/4483551634; 107164 pages
2023-04-08 21:34:12 0 [Note] InnoDB: To recover: LSN 3002207709/4483551634; 105708 pages
2023-04-08 21:34:27 0 [Note] InnoDB: Parsed redo log up to LSN=3248729981; to recover: 109527 pages
2023-04-08 21:34:29 0 [Note] InnoDB: To recover: LSN 3318003147/4483551634; 116461 pages
2023-04-08 21:34:44 0 [Note] InnoDB: Parsed redo log up to LSN=3517427589; to recover: 105170 pages
2023-04-08 21:34:47 0 [Note] InnoDB: To recover: LSN 3638007653/4483551634; 117075 pages
2023-04-08 21:35:02 0 [Note] InnoDB: Parsed redo log up to LSN=3826053254; to recover: 104253 pages
2023-04-08 21:35:06 0 [Note] InnoDB: To recover: LSN 3959684337/4483551634; 117624 pages
2023-04-08 21:35:21 0 [Note] InnoDB: Parsed redo log up to LSN=4139117168; to recover: 103447 pages
2023-04-08 21:35:25 0 [Note] InnoDB: To recover: LSN 4280878385/4483551634; 117791 pages
2023-04-08 21:35:40 0 [Note] InnoDB: Parsed redo log up to LSN=4432727985; to recover: 100850 pages
2023-04-08 21:35:41 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo
2023-04-08 21:35:41 0 [Note] InnoDB: Trx id counter is 26268325
2023-04-08 21:35:42 0 [Note] InnoDB: To recover: 105496 pages
2023-04-08 21:35:46 0 [Note] InnoDB: 128 rollback segments are active.
2023-04-08 21:35:46 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-04-08 21:35:46 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340

At the start of each recovery batch, a message "To recover:" will be output unconditionally. Subsequent messages of the same or different type will be output every 15 seconds. The final recovery batch (105496 pages) started at 21:35:42 and finished at 21:35:46. For reporting progress before the final batch, we report both the current recovered LSN and the final LSN, as well as the number of pages currently waiting to be recovered. The number of pages increases during parsing and decreases while applying log.

For comparison, I recovered the same dataset using a 10GiB buffer pool (half the size of the original buffer pool). It will run out of memory, even though the redo log is only 10GiB:

2023-04-08 21:40:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688
2023-04-08 21:40:22 0 [Note] InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages
2023-04-08 21:40:37 0 [Note] InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages
2023-04-08 21:40:52 0 [Note] InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages
2023-04-08 21:41:07 0 [Note] InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages
2023-04-08 21:41:07 0 [Note] InnoDB: Recovery ran out of memory at LSN 4189599815
2023-04-08 21:41:07 0 [Note] InnoDB: End of log at LSN=4483551634
2023-04-08 21:41:08 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages
2023-04-08 21:41:23 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages
2023-04-08 21:41:38 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages
2023-04-08 21:41:53 0 [Note] InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages
2023-04-08 21:41:57 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo
2023-04-08 21:41:57 0 [Note] InnoDB: Trx id counter is 26268325
2023-04-08 21:41:57 0 [Note] InnoDB: To recover: 114621 pages
2023-04-08 21:42:01 0 [Note] InnoDB: 128 rollback segments are active.
2023-04-08 21:42:01 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-04-08 21:42:01 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340

Finally, with 20GiB buffer pool size (same size as the server that was killed during workload)

2023-04-08 21:43:04 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688
2023-04-08 21:43:19 0 [Note] InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages
2023-04-08 21:43:34 0 [Note] InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages
2023-04-08 21:43:49 0 [Note] InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages
2023-04-08 21:44:04 0 [Note] InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages
2023-04-08 21:44:10 0 [Note] InnoDB: End of log at LSN=4483551634
2023-04-08 21:44:12 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo
2023-04-08 21:44:12 0 [Note] InnoDB: Trx id counter is 26268325
2023-04-08 21:44:14 0 [Note] InnoDB: To recover: 331797 pages
2023-04-08 21:44:36 0 [Note] InnoDB: 128 rollback segments are active.
2023-04-08 21:44:36 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-04-08 21:44:36 0 [Note] InnoDB: To roll back: 14 transactions, 14 rows
2023-04-08 21:44:36 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268328

Summary:

buffer pool size time until end-of-log detection total recovery time
1GiB 12s 180s
10GiB 60s 114s
20GiB 66s 92s

When using a smaller buffer pool, the end LSN will be reached faster, because once we run out of memory, the rest will be just parsed and discarded. When all parsed log fits in the buffer pool, parsing will take 66 seconds and applying the log takes 26 seconds. When pages need to be read from the file system and written to slower storage than the RAM disk that I was using, it will probably look quite different. In any case, using more RAM and CPUs should be helpful.

When pages need to be read from the file system, recv_sys.mutex may continue to be a significant bottleneck, depending on the type of storage. I only avoid acquiring it in the new "fake asynchronous read" code path.

Comment by Marko Mäkelä [ 2023-04-14 ]

Before MDEV-14425 simplified the log format in MariaDB Server 10.8, there were two log parsers: log block parser and log record parser. The log records were not contiguous in the log file: they could be interrupted by headers and footers of 512-byte log blocks. The mapping from log sequence number (LSN) to byte offsets is more complicated. Starting with 10.8, the log file offset is simply a base offset + LSN modulo (ib_logfile0 file size - 12288 bytes), and the only discontinuity is when the ib_logfile0 wraps around from the end to byte offset 12288.

Due to this, it is not feasible to port some of the log parsing optimizations to 10.6 or earlier versions.

Comment by Marko Mäkelä [ 2023-04-17 ]

I am working on a 10.6 port of this. It will be less efficient due to the more complicated memory management and parsing of redo log records.

The following is from a recovery of Sysbench oltp_update_index:

2023-04-17 12:21:55 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
2023-04-17 12:22:10 0 [Note] InnoDB: Read redo log up to LSN=1935453184
2023-04-17 12:22:25 0 [Note] InnoDB: Read redo log up to LSN=2584783872
2023-04-17 12:22:40 0 [Note] InnoDB: Read redo log up to LSN=3218648064
2023-04-17 12:22:55 0 [Note] InnoDB: Read redo log up to LSN=3803884544
2023-04-17 12:23:02 0 [Note] InnoDB: Recovery ran out of memory at LSN 4068555446
2023-04-17 12:23:10 0 [Note] InnoDB: Read redo log up to LSN=5356432384
2023-04-17 12:23:25 0 [Note] InnoDB: Read redo log up to LSN=1747627008
2023-04-17 12:23:40 0 [Note] InnoDB: Read redo log up to LSN=2296949760
2023-04-17 12:23:55 0 [Note] InnoDB: Read redo log up to LSN=2934352896
2023-04-17 12:24:10 0 [Note] InnoDB: Read redo log up to LSN=3416960000
2023-04-17 12:24:25 0 [Note] InnoDB: Read redo log up to LSN=3911756800
2023-04-17 12:24:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 338590 pages
2023-04-17 12:24:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 325799 pages
2023-04-17 12:25:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 301991 pages
2023-04-17 12:25:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 273831 pages
2023-04-17 12:25:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 241575 pages
2023-04-17 12:25:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 194727 pages
2023-04-17 12:26:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 105895 pages
2023-04-17 12:26:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 36263 pages
2023-04-17 12:26:43 0 [Note] InnoDB: Read redo log up to LSN=4375424000
2023-04-17 12:26:58 0 [Note] InnoDB: Read redo log up to LSN=4886932480
2023-04-17 12:27:13 0 [Note] InnoDB: Read redo log up to LSN=5513325568
2023-04-17 12:27:18 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo
2023-04-17 12:27:18 0 [Note] InnoDB: Trx id counter is 37315565
2023-04-17 12:27:20 0 [Note] InnoDB: To recover: 205090 pages
2023-04-17 12:27:51 0 [Note] InnoDB: 128 rollback segments are active.
2023-04-17 12:27:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-04-17 12:27:51 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows
2023-04-17 12:27:51 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-04-17 12:27:51 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-04-17 12:27:51 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585

In 10.6, if we run out of memory, we will always restart parsing the log from the start.

  32,71%  mariadbd         mariadbd              [.] recv_sys_t::erase
  19,96%  mariadbd         mariadbd              [.] recv_sys_t::parse
  13,90%  mariadbd         mariadbd              [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>
   3,19%  mariadbd         libstdc++.so.6.0.30   [.] std::_Rb_tree_increment
   3,16%  mariadbd         mariadbd              [.] recv_sys_t::add
   2,91%  mariadbd         mariadbd              [.] recv_recover_page                                                                                                                                                                                                                                                                                                                     

The CPU time was dominated by adding an removing items, because I forgot to set innodb_buffer_pool_chunk_size equal to the innodb_buffer_pool_size (10GiB). With a 30GiB buffer pool (same as before the server was killed) in one chunk, it recovers faster:

2023-04-17 12:32:59 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
2023-04-17 12:33:14 0 [Note] InnoDB: Read redo log up to LSN=1984539648
2023-04-17 12:33:29 0 [Note] InnoDB: Read redo log up to LSN=2710875136
2023-04-17 12:33:44 0 [Note] InnoDB: Read redo log up to LSN=3358895104
2023-04-17 12:33:59 0 [Note] InnoDB: Read redo log up to LSN=3965299712
2023-04-17 12:34:14 0 [Note] InnoDB: Read redo log up to LSN=4557417472
2023-04-17 12:34:29 0 [Note] InnoDB: Read redo log up to LSN=5219527680
2023-04-17 12:34:43 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo
2023-04-17 12:34:43 0 [Note] InnoDB: Trx id counter is 37315565
2023-04-17 12:34:45 0 [Note] InnoDB: To recover: 450915 pages
2023-04-17 12:35:07 0 [Note] InnoDB: 128 rollback segments are active.
2023-04-17 12:35:07 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-04-17 12:35:07 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows
2023-04-17 12:35:07 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-04-17 12:35:07 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-04-17 12:35:07 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-04-17 12:35:07 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-04-17 12:35:07 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-04-17 12:35:07 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585

Now the CPU usage makes more sense too:

  24,31%  mariadbd         mariadbd               [.] recv_sys_t::parse
  18,11%  mariadbd         mariadbd               [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>
  11,15%  mariadbd         mariadbd               [.] recv_sys_t::erase
   6,68%  mariadbd         mariadbd               [.] recv_recover_page
   5,43%  mariadbd         mariadbd               [.] log_phys_t::apply
   3,74%  mariadbd         mariadbd               [.] recv_sys_t::add

Finally, here is a recovery with a 5GiB buffer pool in one chunk:

2023-04-17 12:38:34 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
2023-04-17 12:38:49 0 [Note] InnoDB: Read redo log up to LSN=1963895808
2023-04-17 12:39:01 0 [Note] InnoDB: Recovery ran out of memory at LSN 2534560930
2023-04-17 12:39:04 0 [Note] InnoDB: Read redo log up to LSN=3312233472
2023-04-17 12:39:19 0 [Note] InnoDB: Read redo log up to LSN=1599646720
2023-04-17 12:39:34 0 [Note] InnoDB: Read redo log up to LSN=2160831488
2023-04-17 12:39:48 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages
2023-04-17 12:40:03 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages
2023-04-17 12:40:18 0 [Note] InnoDB: Read redo log up to LSN=3195776000
2023-04-17 12:40:33 0 [Note] InnoDB: Read redo log up to LSN=3687099392
2023-04-17 12:40:48 0 [Note] InnoDB: Read redo log up to LSN=4165315584
2023-04-17 12:40:54 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages
2023-04-17 12:41:09 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages
2023-04-17 12:41:24 0 [Note] InnoDB: Read redo log up to LSN=4508724224
2023-04-17 12:41:39 0 [Note] InnoDB: Read redo log up to LSN=5094550528
2023-04-17 12:41:54 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo
2023-04-17 12:41:54 0 [Note] InnoDB: Trx id counter is 37315565
2023-04-17 12:41:54 0 [Note] InnoDB: To recover: 205230 pages
2023-04-17 12:42:02 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-04-17 12:42:02 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-04-17 12:42:02 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-04-17 12:42:02 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-04-17 12:42:02 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-04-17 12:42:02 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585

Summary:

buffer pool size time until parsing OOM or EOF total recovery time
5GiB 27s 208s
10GiB 67s 356s
30GiB 106s 128s

The 10GiB case was with multiple buffer pool chunks. That is why it consumed much more time than a recovery with 5GiB buffer pool in one chunk.

Comment by Marko Mäkelä [ 2023-04-17 ]

Here are some perf report numbers for the 10.6 recovery with innodb_buffer_pool_size=5G in one chunk:

  26,68%  mariadbd         mariadbd                [.] recv_sys_t::parse
  17,08%  mariadbd         mariadbd                [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>
   6,52%  mariadbd         mariadbd                [.] recv_sys_t::erase
   4,53%  mariadbd         libstdc++.so.6.0.30     [.] std::_Rb_tree_increment
   4,49%  mariadbd         libc.so.6               [.] pthread_mutex_lock@@GLIBC_2.2.5
   4,34%  mariadbd         mariadbd                [.] recv_sys_t::add
   4,07%  mariadbd         mariadbd                [.] log_phys_t::apply
   4,04%  mariadbd         mariadbd                [.] recv_recover_page
   2,26%  mariadbd         mariadbd                [.] fil_space_t::get
   2,09%  mariadbd         libc.so.6               [.] __GI___pthread_mutex_unlock_usercnt
   1,49%  mariadbd         libc.so.6               [.] _int_free
   1,35%  mariadbd         mariadbd                [.] page_apply_insert_dynamic
   1,26%  mariadbd         libc.so.6               [.] __memmove_avx_unaligned_erms_rtm
   1,01%  mariadbd         [kernel.kallsyms]       [k] clear_page_erms
   0,92%  mariadbd         [kernel.kallsyms]       [k] copy_user_enhanced_fast_string
   0,87%  mariadbd         [kernel.kallsyms]       [k] syscall_exit_to_user_mode
   0,80%  mariadbd         mariadbd                [.] crc32c_3way

One possible improvement would be to try to invoke recv_sys.pages.clear() instead of erase()‍ing individual elements. Only a couple of percent of CPU was actually spent on interesting or known-to-be CPU-intensive work (applying log to a page, or computing checksums).

Comment by Marko Mäkelä [ 2023-04-17 ]

The out-of-memory handling in the 10.6 (pre MDEV-14425) version of this will be inferior to the later version in two aspects:

If recovery runs of of memory in the 10.6 version,

  • It will restart parsing and storing the log from the checkpoint. It can never resume from the "OOM LSN".
  • Each batch will scan the log to the end. I did not get my optimization of this to work; some DDL crash recovery tests would occasionally fail.
Comment by Marko Mäkelä [ 2023-04-18 ]

Unfortunately, both the 10.8 and 10.6 version of this are acting up. One test that occasionally fails is innodb.alter_copy. I have not yet tested if that test is stable in the original development version. In 11.0 the buffer pool and recovery were simplified thanks to MDEV-29694 (removing the InnoDB change buffer). That could play a role here.

Comment by Marko Mäkelä [ 2023-04-18 ]

I got an rr replay trace of the test innodb.alter_copy from the 10.8 version where fil_validate() fails during IORequest::write_complete(), on a tablespace that is in the process of being created in recv_sys_t::recover_deferred(). I think that we must protect that better by holding fil_system.mutex until the tablespace has been fully created. This was fixed in MDEV-31080.

Comment by Marko Mäkelä [ 2023-04-20 ]

I was asked to improve the wording of the "Recovery ran out of memory at LSN" message. I think that the following should be more informative:

[Note] InnoDB: Multi-batch recovery needed at LSN 4189599815

Comment by Marko Mäkelä [ 2023-04-21 ]

There is not enough time to test this thoroughly before the upcoming quarterly releases.

With the current fix that I have for 10.6 (and which might be feasible to port to 10.5 as well), some excessive or unnecessary log scanning will take place. When the buffer pool is small, the log reads will basically be O(n²) instead of O. This will be better in the version that uses the MDEV-14425 log format.

Generated at Thu Feb 08 10:12:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.