[MDEV-29911] InnoDB recovery and mariadb-backup --prepare fail to report detailed progress - Jira

Marko Mäkelä added a comment - 2022-10-30 08:56

In ~~MDEV-11027~~, the recovery progress reporting was refactored. For the normal server startup, progress should be reported every 15 seconds.

I do not immediately see any reason why it would be different for mariadb-backup --backup.

Can you provide a test case that demonstrates the lack of progress can be reproduced?

Marko Mäkelä added a comment - 2022-10-30 08:56 In MDEV-11027 , the recovery progress reporting was refactored. For the normal server startup, progress should be reported every 15 seconds. I do not immediately see any reason why it would be different for mariadb-backup --backup . Can you provide a test case that demonstrates the lack of progress can be reproduced?

Rick Pizzi (Inactive) added a comment - 2022-10-30 10:04 - edited

marko the reporting is not giving any idea about how many batches are left - we really need a % of completion of the overall prepare phase.
Since it is reading the redo log file it shouldn't be complicated to produce a percentage by checking the current LSN versus the final LSN, what do you think?

Thanks
Rick

Rick Pizzi (Inactive) added a comment - 2022-10-30 10:04 - edited marko the reporting is not giving any idea about how many batches are left - we really need a % of completion of the overall prepare phase. Since it is reading the redo log file it shouldn't be complicated to produce a percentage by checking the current LSN versus the final LSN, what do you think? Thanks Rick

Marko Mäkelä added a comment - 2022-11-02 09:45

I believe that there may be some overlap between recovery batches, that is, the same pages may be updated in multiple batches.

We should be able to give an estimate of how many batches will be needed after the current one, but it may be impossible to give an accurate estimate of the number of pages that are awaiting recovery in subsequent batches.

I would really like to see a complete log of --prepare or a server crash recovery that involves very many batches, so that I can validate some ideas that I have about this.

Marko Mäkelä added a comment - 2022-11-02 09:45 I believe that there may be some overlap between recovery batches, that is, the same pages may be updated in multiple batches. We should be able to give an estimate of how many batches will be needed after the current one, but it may be impossible to give an accurate estimate of the number of pages that are awaiting recovery in subsequent batches. I would really like to see a complete log of --prepare or a server crash recovery that involves very many batches, so that I can validate some ideas that I have about this.

Faisal Saeed (Inactive) added a comment - 2022-11-02 16:37

marko `--prepare` log has been attached for your reference.

Faisal Saeed (Inactive) added a comment - 2022-11-02 16:37 marko `--prepare` log has been attached for your reference.

Marko Mäkelä added a comment - 2022-11-03 08:30

To anyone who tries to reproduce this: If preparing a backup only took a short time, there cannot have been much log written since the latest checkpoint. You would probably need some concurrent write workload on the server while the backup is running. While backup is running, the server can receive further writes and be producing more log. Backup cannot switch to a later checkpoint on the fly; it will copy all log from the checkpoint that was the latest when the backup started.

Log checkpoints starting with MariaDB Server 10.5.7 (after the refactoring in ~~MDEV-23855~~ and ~~MDEV-23399~~) are triggered at the end of each batch of writing modified pages from the buffer pool to the data files. That can happen every second, if you have enabled "background flushing" (see ~~MDEV-27295~~). The innodb_log_file_size and innodb_buffer_pool_size matters a lot.

When preparing a backup, the buffer pool size is specified by the --use-memory parameter. preparelog.txt seems to be using the default size of about 100MiB. The interesting part of the log starts like this:

2022-10-29 12:50:33 0 [Note] InnoDB: Initializing buffer pool, total size = 104857600, chunk size = 104857600

2022-10-29 12:50:33 0 [Note] InnoDB: Completed initialization of buffer pool

2022-10-29 12:50:33 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=113635532803184,113635804884887

2022-10-29 12:50:48 0 [Note] InnoDB: Read redo log up to LSN=113648002206720

The difference between the two LSNs in the checkpoint message is 272,081,703 bytes. This roughly corresponds to the distance between the previous and the latest checkpoint.

In the InnoDB crash recovery, until all of the circular ib_logfile0 has been parsed, we cannot know the last recovery LSN. We would know an upper bound of it: in the worst case, there will be log records right up to the file offset where the latest checkpoint started. The formula used to be more complicated, but with ~~MDEV-14425~~ (MariaDB Server 10.8), the maximum end LSN should be the checkpoint LSN plus the size of the payload area (the size of ib_logfile0 minus the 12,288-byte header).

In mariadb-backup --prepare, the last recovery LSN would actually be known: it is roughly the sum of the checkpoint LSN and the size of the append-only ib_logfile0 that was created during backup.

In the log, the latest message about reading log that I can see is the following:

2022-10-30  1:44:21 0 [Note] InnoDB: To recover: 527 pages from log

2022-10-30  1:44:35 0 [Note] InnoDB: Starting a batch to recover 5062 pages from redo log.

2022-10-30  1:44:35 0 [Note] InnoDB: Starting a batch to recover 5289 pages from redo log.

2022-10-30  1:44:36 0 [Note] InnoDB: Read redo log up to LSN=113635785294848

2022-10-30  1:44:36 0 [Note] InnoDB: Starting a batch to recover 5390 pages from redo log.

This actually is 19,590,039 less than the LSN was when the checkpoint was triggered before backup was initiated.

In ~~MDEV-21351~~ (MariaDB 10.5.1), the memory management during crash recovery was simplified and improved. The rule in recv_sys_t::is_memory_exhausted() is that the number of buffer pool pages that are allocated for redo log records must not exceed a third of the size of the buffer pool. Because memory blocks can be freed as soon as all log records in it have been applied to data pages, new recovery batches can be start. That is why we are seeing so many of those tiny recovery batches (such as 5,062 pages) above, starting several times per second.

The largest batch seems to be the following:

2022-10-29 12:52:29 0 [Note] InnoDB: Starting a batch to recover 82554 pages from redo log.

This means that up to 33MiB of redo log records (one third of the 100MiB buffer pool) are referring to that many pages. Initially, the buffer pool has space for 4,266 pages. Maybe some page reads can be avoided (~~MDEV-12699~~) when the pages are being freed or completely initialized according to log records. The 82,554 pages are 19 times the size of the available buffer pool.

According to the log, things seem to run extremely slowly:

2022-10-29 13:14:03 0 [Note] InnoDB: To recover: 988 pages from log

2022-10-29 13:14:18 0 [Note] InnoDB: To recover: 929 pages from log

How could it take 15 seconds to apply changes to only 59 pages? Was there any other I/O or CPU load concurrently running on the computer or the storage system where the backup was being prepared? What is the maximum write speed of the storage where the backup was being prepared? Are tables in ROW_FORMAT=COMPRESSED or page_compressed being used?

What is the size of the ib_logfile0 in the backup directory?

It might make sense to change the rule "use up to one third of the buffer pool for log records" to something like "reserve two thirds of the buffer pool for data pages to process in a batch". Any performance improvements would only be possible if someone can provide a copy of a backup with which this can be reproduced.

Depending on the server workload, one trick to improve --prepare speed would be to trigger a log checkpoint right before starting the backup:

SET @save_pct= @@GLOBAL.innodb_max_dirty_pages_pct;

SET @save_pct_lwm= @@GLOBAL.innodb_max_dirty_pages_pct_lwm;

SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.0;

SET GLOBAL innodb_max_dirty_pages_pct=0.0;

-- wait for the following to reach 0

SELECT variable_value FROM information_schema.global_status

WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';

SET GLOBAL innodb_max_dirty_pages_pct = @save_pct;

SET GLOBAL innodb_max_dirty_pages_pct_lwm = @save_pct_lwm;

-- start the backup

Marko Mäkelä added a comment - 2022-11-03 08:30 To anyone who tries to reproduce this: If preparing a backup only took a short time, there cannot have been much log written since the latest checkpoint. You would probably need some concurrent write workload on the server while the backup is running. While backup is running, the server can receive further writes and be producing more log. Backup cannot switch to a later checkpoint on the fly; it will copy all log from the checkpoint that was the latest when the backup started. Log checkpoints starting with MariaDB Server 10.5.7 (after the refactoring in MDEV-23855 and MDEV-23399 ) are triggered at the end of each batch of writing modified pages from the buffer pool to the data files. That can happen every second, if you have enabled "background flushing" (see MDEV-27295 ). The innodb_log_file_size and innodb_buffer_pool_size matters a lot. When preparing a backup, the buffer pool size is specified by the --use-memory parameter. preparelog.txt seems to be using the default size of about 100MiB. The interesting part of the log starts like this: 2022-10-29 12:50:33 0 [Note] InnoDB: Initializing buffer pool, total size = 104857600, chunk size = 104857600 2022-10-29 12:50:33 0 [Note] InnoDB: Completed initialization of buffer pool 2022-10-29 12:50:33 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=113635532803184,113635804884887 2022-10-29 12:50:48 0 [Note] InnoDB: Read redo log up to LSN=113648002206720 The difference between the two LSNs in the checkpoint message is 272,081,703 bytes. This roughly corresponds to the distance between the previous and the latest checkpoint. In the InnoDB crash recovery, until all of the circular ib_logfile0 has been parsed, we cannot know the last recovery LSN. We would know an upper bound of it: in the worst case, there will be log records right up to the file offset where the latest checkpoint started. The formula used to be more complicated, but with MDEV-14425 (MariaDB Server 10.8), the maximum end LSN should be the checkpoint LSN plus the size of the payload area (the size of ib_logfile0 minus the 12,288-byte header). In mariadb-backup --prepare , the last recovery LSN would actually be known: it is roughly the sum of the checkpoint LSN and the size of the append-only ib_logfile0 that was created during backup. In the log, the latest message about reading log that I can see is the following: 2022-10-30 1:44:21 0 [Note] InnoDB: To recover: 527 pages from log 2022-10-30 1:44:35 0 [Note] InnoDB: Starting a batch to recover 5062 pages from redo log. 2022-10-30 1:44:35 0 [Note] InnoDB: Starting a batch to recover 5289 pages from redo log. 2022-10-30 1:44:36 0 [Note] InnoDB: Read redo log up to LSN=113635785294848 2022-10-30 1:44:36 0 [Note] InnoDB: Starting a batch to recover 5390 pages from redo log. This actually is 19,590,039 less than the LSN was when the checkpoint was triggered before backup was initiated. In MDEV-21351 (MariaDB 10.5.1), the memory management during crash recovery was simplified and improved. The rule in recv_sys_t::is_memory_exhausted() is that the number of buffer pool pages that are allocated for redo log records must not exceed a third of the size of the buffer pool. Because memory blocks can be freed as soon as all log records in it have been applied to data pages, new recovery batches can be start. That is why we are seeing so many of those tiny recovery batches (such as 5,062 pages) above, starting several times per second. The largest batch seems to be the following: 2022-10-29 12:52:29 0 [Note] InnoDB: Starting a batch to recover 82554 pages from redo log. This means that up to 33MiB of redo log records (one third of the 100MiB buffer pool) are referring to that many pages. Initially, the buffer pool has space for 4,266 pages. Maybe some page reads can be avoided ( MDEV-12699 ) when the pages are being freed or completely initialized according to log records. The 82,554 pages are 19 times the size of the available buffer pool. According to the log, things seem to run extremely slowly: 2022-10-29 13:14:03 0 [Note] InnoDB: To recover: 988 pages from log 2022-10-29 13:14:18 0 [Note] InnoDB: To recover: 929 pages from log How could it take 15 seconds to apply changes to only 59 pages? Was there any other I/O or CPU load concurrently running on the computer or the storage system where the backup was being prepared? What is the maximum write speed of the storage where the backup was being prepared? Are tables in ROW_FORMAT=COMPRESSED or page_compressed being used? What is the size of the ib_logfile0 in the backup directory? It might make sense to change the rule "use up to one third of the buffer pool for log records" to something like "reserve two thirds of the buffer pool for data pages to process in a batch". Any performance improvements would only be possible if someone can provide a copy of a backup with which this can be reproduced. Depending on the server workload, one trick to improve --prepare speed would be to trigger a log checkpoint right before starting the backup: SET @save_pct= @@ GLOBAL .innodb_max_dirty_pages_pct; SET @save_pct_lwm= @@ GLOBAL .innodb_max_dirty_pages_pct_lwm; SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.0; SET GLOBAL innodb_max_dirty_pages_pct=0.0; -- wait for the following to reach 0 SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY' ; SET GLOBAL innodb_max_dirty_pages_pct = @save_pct; SET GLOBAL innodb_max_dirty_pages_pct_lwm = @save_pct_lwm; -- start the backup

Marko Mäkelä added a comment - 2022-11-07 09:11

If this occurred with MariaDB 10.6.5, there have been some bug fixes to recovery performance since then: ~~MDEV-27022~~, ~~MDEV-27058~~, ~~MDEV-27183~~, ~~MDEV-27610~~, not to mention several correctness fixes (mostly affecting the recovery of DDL operations).

It would be interesting to know how fast the mariadb-backup --prepare would run when using a more recent executable.

Marko Mäkelä added a comment - 2022-11-07 09:11 If this occurred with MariaDB 10.6.5, there have been some bug fixes to recovery performance since then: MDEV-27022 , MDEV-27058 , MDEV-27183 , MDEV-27610 , not to mention several correctness fixes (mostly affecting the recovery of DDL operations). It would be interesting to know how fast the mariadb-backup --prepare would run when using a more recent executable.

Marko Mäkelä added a comment - 2022-12-05 10:07

While testing the upgrade logic of ~~MDEV-29694~~, I noticed that we are missing progress reporting for scanning and parsing the log:

10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f
2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319
2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319
2022-12-05 10:31:47 0 [Note] InnoDB: Starting a batch to recover 333533 pages from redo log.
2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log
2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log
2022-12-05 10:32:16 0 [Note] InnoDB: 4 transaction(s) which must be rolled back or cleaned up in total 4 row operations to undo
2022-12-05 10:32:16 0 [Note] InnoDB: Trx id counter is 27822492
2022-12-05 10:32:16 0 [Note] InnoDB: Starting final batch to recover 273219 pages from redo log.
2022-12-05 10:32:23 0 [Note] InnoDB: 128 rollback segments are active.

We are supposed to generate progress reports every 15 seconds. That was not the case here, between the following lines:

2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319

2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319

and

2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log

2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log

This was the recovery of a debug server that was killed in the middle of Sysbench oltp_update_index, with innodb_change_buffering_debug=1, running on RAM disk.
The recovery was conducted on a non-debug server using innodb_log_file_size=4G, innodb_buffer_pool_size=10G

Because the killed server was a debug build, so it was writing additional page checksum records (~~MDEV-18976~~). Disabling the creation of those records had the following effect on recovery performance:

10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f
2022-12-05 11:58:26 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7523651578
2022-12-05 11:58:55 0 [Note] InnoDB: Read redo log up to LSN=7523651578
2022-12-05 11:59:30 0 [Note] InnoDB: Starting a batch to recover 339735 pages from redo log.
2022-12-05 11:59:30 0 [Note] InnoDB: To recover: 339735 pages from log
2022-12-05 11:59:47 0 [Note] InnoDB: To recover: 158840 pages from log
2022-12-05 11:59:47 0 [Note] InnoDB: Starting final batch to recover 158583 pages from redo log.
2022-12-05 11:59:48 0 [Note] InnoDB: 128 rollback segments are active.

Also in this case, several messages between 11:58.26 and 11:59:30 are significantly more than 15 seconds apart.

For the previous run, we could observe a recovery speed of 59928 pages / 29 seconds = 2066 pages/s, or if we count data page bytes, 33 MB/s.

With this run, for the first batch, we can calculate 180895 pages / 17 seconds = 10641 pages/s, or 166 MiB/s. That is more reasonable.

In any case, I can confirm that the progress reporting needs to be improved. The performance feels somewhat low too; we need to check for performance bottlenecks with a profiler.

Marko Mäkelä added a comment - 2022-12-05 10:07 While testing the upgrade logic of MDEV-29694 , I noticed that we are missing progress reporting for scanning and parsing the log: 10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f 2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319 2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319 2022-12-05 10:31:47 0 [Note] InnoDB: Starting a batch to recover 333533 pages from redo log. 2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log 2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log 2022-12-05 10:32:16 0 [Note] InnoDB: 4 transaction(s) which must be rolled back or cleaned up in total 4 row operations to undo 2022-12-05 10:32:16 0 [Note] InnoDB: Trx id counter is 27822492 2022-12-05 10:32:16 0 [Note] InnoDB: Starting final batch to recover 273219 pages from redo log. 2022-12-05 10:32:23 0 [Note] InnoDB: 128 rollback segments are active. We are supposed to generate progress reports every 15 seconds. That was not the case here, between the following lines: 2022-12-05 10:30:41 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=6574044319 2022-12-05 10:31:11 0 [Note] InnoDB: Read redo log up to LSN=6574044319 and 2022-12-05 10:31:47 0 [Note] InnoDB: To recover: 333533 pages from log 2022-12-05 10:32:16 0 [Note] InnoDB: To recover: 273605 pages from log This was the recovery of a debug server that was killed in the middle of Sysbench oltp_update_index , with innodb_change_buffering_debug=1 , running on RAM disk. The recovery was conducted on a non-debug server using innodb_log_file_size=4G , innodb_buffer_pool_size=10G Because the killed server was a debug build, so it was writing additional page checksum records ( MDEV-18976 ). Disabling the creation of those records had the following effect on recovery performance: 10.8 0751bfbcaf08c7c32de8d90b41b6c50ab525da6f 2022-12-05 11:58:26 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7523651578 2022-12-05 11:58:55 0 [Note] InnoDB: Read redo log up to LSN=7523651578 2022-12-05 11:59:30 0 [Note] InnoDB: Starting a batch to recover 339735 pages from redo log. 2022-12-05 11:59:30 0 [Note] InnoDB: To recover: 339735 pages from log 2022-12-05 11:59:47 0 [Note] InnoDB: To recover: 158840 pages from log 2022-12-05 11:59:47 0 [Note] InnoDB: Starting final batch to recover 158583 pages from redo log. 2022-12-05 11:59:48 0 [Note] InnoDB: 128 rollback segments are active. Also in this case, several messages between 11:58.26 and 11:59:30 are significantly more than 15 seconds apart. For the previous run, we could observe a recovery speed of 59928 pages / 29 seconds = 2066 pages/s, or if we count data page bytes, 33 MB/s. With this run, for the first batch, we can calculate 180895 pages / 17 seconds = 10641 pages/s, or 166 MiB/s. That is more reasonable. In any case, I can confirm that the progress reporting needs to be improved. The performance feels somewhat low too; we need to check for performance bottlenecks with a profiler.

Marko Mäkelä added a comment - 2023-03-30 12:05

If I recover using a smaller buffer pool, then there will be frequent messages about starting a new recovery batch, like this:

2023-03-30 14:16:10 0 [Note] InnoDB: Completed initialization of buffer pool

2023-03-30 14:16:10 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812

2023-03-30 14:16:25 0 [Note] InnoDB: Starting a batch to recover 12978 pages from redo log.

2023-03-30 14:16:25 0 [Note] InnoDB: To recover: 12978 pages from log

2023-03-30 14:16:26 0 [Note] InnoDB: Starting a batch to recover 13571 pages from redo log.

2023-03-30 14:16:30 0 [Note] InnoDB: Starting a batch to recover 80468 pages from redo log.

2023-03-30 14:16:37 0 [Note] InnoDB: Starting a batch to recover 84957 pages from redo log.

2023-03-30 14:16:40 0 [Note] InnoDB: To recover: 50796 pages from log

The "To recover:" messages would appear every 15 seconds. However, if I start the recovery of the same data directory with a larger buffer pool, then it will be silent for longer periods of time:

2023-03-30 14:18:05 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812

2023-03-30 14:18:38 0 [Note] InnoDB: Trying to recover page [page id: space=5, page number=2166] from the doublewrite buffer.

2023-03-30 14:18:38 0 [Note] InnoDB: Recovered page [page id: space=5, page number=2166] to './test/sbtest1.ibd' from the doublewrite buffer.

2023-03-30 14:18:39 0 [Note] InnoDB: Read redo log up to LSN=890094812

2023-03-30 14:19:12 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log.

2023-03-30 14:19:12 0 [Note] InnoDB: To recover: 153487 pages from log

2023-03-30 14:19:56 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log.

2023-03-30 14:19:56 0 [Note] InnoDB: To recover: 183721 pages from log

2023-03-30 14:20:11 0 [Note] InnoDB: To recover: 18334 pages from log

2023-03-30 14:20:46 0 [Note] InnoDB: Starting a batch to recover 192135 pages from redo log.

2023-03-30 14:20:46 0 [Note] InnoDB: To recover: 192135 pages from log

2023-03-30 14:21:01 0 [Note] InnoDB: To recover: 30398 pages from log

2023-03-30 14:21:34 0 [Note] InnoDB: Starting a batch to recover 188234 pages from redo log.

2023-03-30 14:21:34 0 [Note] InnoDB: To recover: 188234 pages from log

2023-03-30 14:21:49 0 [Note] InnoDB: To recover: 48679 pages from log

2023-03-30 14:22:01 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo

2023-03-30 14:22:01 0 [Note] InnoDB: Trx id counter is 37362293

2023-03-30 14:22:01 0 [Note] InnoDB: Starting final batch to recover 104686 pages from redo log.

2023-03-30 14:22:04 0 [Note] InnoDB: To recover: 29018 pages from log

2023-03-30 14:22:05 0 [Note] InnoDB: 128 rollback segments are active.

2023-03-30 14:22:05 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-03-30 14:22:05 0 [Note] InnoDB: Rollback of non-prepared transactions completed

2023-03-30 14:22:05 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

2023-03-30 14:22:05 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

The "To recover:" messages should be output every 15 seconds, but there were no messages at 14:20:26 or 14:20:41. The "Starting a batch to recover" messages are unconditionally output at the start of each batch.

During this experiment, it also caught my eye that the recovery is occupying only one CPU core. I suppose that recovery avoids reading most data pages thanks to ~~MDEV-12699~~, and most of the recovery will simply initialize pages in the buffer pool. To analyze that, I started a perf record -g during a recovery of the same data directory contents with the same innodb_buffer_pool_size=10g, and I noticed another anomaly for this repeated run:

2023-03-30 14:30:58 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812

2023-03-30 14:31:33 0 [Note] InnoDB: Read redo log up to LSN=890094812

2023-03-30 14:32:07 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log.

2023-03-30 14:32:07 0 [Note] InnoDB: To recover: 153487 pages from log

2023-03-30 14:32:55 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log.

2023-03-30 14:32:55 0 [Note] InnoDB: To recover: 183721 pages from log

There was no progress message about "Read redo log up to" for the first 35 seconds of the recovery. Also, the second "To recover" message was issued 48 seconds after the first one. It looks like the messages are somehow being missed sporadically. Another try with innodb_buffer_pool_size=20g:

2023-03-30 14:38:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812

2023-03-30 14:39:22 0 [Note] InnoDB: Read redo log up to LSN=890094812

2023-03-30 14:40:33 0 [Note] InnoDB: Starting a batch to recover 249675 pages from redo log.

2023-03-30 14:40:33 0 [Note] InnoDB: To recover: 249675 pages from log

2023-03-30 14:40:48 0 [Note] InnoDB: To recover: 65746 pages from log

2023-03-30 14:42:20 0 [Note] InnoDB: Starting a batch to recover 299196 pages from redo log.

2023-03-30 14:42:20 0 [Note] InnoDB: To recover: 299196 pages from log

2023-03-30 14:42:35 0 [Note] InnoDB: To recover: 204605 pages from log

2023-03-30 14:42:50 0 [Note] InnoDB: To recover: 76653 pages from log

2023-03-30 14:43:05 0 [Note] InnoDB: To recover: 3039 pages from log

2023-03-30 14:43:18 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo

2023-03-30 14:43:18 0 [Note] InnoDB: Trx id counter is 37362293

2023-03-30 14:43:18 0 [Note] InnoDB: Starting final batch to recover 104676 pages from redo log.

2023-03-30 14:43:20 0 [Note] InnoDB: To recover: 79977 pages from log

2023-03-30 14:43:22 0 [Note] InnoDB: 128 rollback segments are active.

2023-03-30 14:43:22 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-03-30 14:43:22 0 [Note] InnoDB: Rollback of non-prepared transactions completed

2023-03-30 14:43:22 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"

2023-03-30 14:43:22 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed

2023-03-30 14:43:22 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

2023-03-30 14:43:22 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

2023-03-30 14:43:22 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

Right before issuing the first message "Starting a batch" it was busy parsing the redo log. The "Read redo log up to LSN" message is intended to be issued every 15 seconds, just like the "To recover:" ones. Because perf failed to collect samples for some reason, I ran it once more, this time with innodb_buffer_pool_size=30g, to be met with similar initial silence:

2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812

2023-03-30 14:52:09 0 [Note] InnoDB: Read redo log up to LSN=890094812

2023-03-30 14:53:59 0 [Note] InnoDB: Starting a batch to recover 343907 pages from redo log.

2023-03-30 14:53:59 0 [Note] InnoDB: To recover: 343907 pages from log

2023-03-30 14:54:14 0 [Note] InnoDB: To recover: 162142 pages from log

2023-03-30 14:54:29 0 [Note] InnoDB: To recover: 16129 pages from log

2023-03-30 14:55:18 0 [Note] InnoDB: To recover: 200958 pages from log

2023-03-30 14:55:20 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo

2023-03-30 14:55:20 0 [Note] InnoDB: Trx id counter is 37362293

2023-03-30 14:55:20 0 [Note] InnoDB: Starting final batch to recover 200464 pages from redo log.

2023-03-30 14:55:33 0 [Note] InnoDB: To recover: 61463 pages from log

2023-03-30 14:55:42 0 [Note] InnoDB: 128 rollback segments are active.

2023-03-30 14:55:42 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-03-30 14:55:42 0 [Note] InnoDB: Rollback of non-prepared transactions completed

2023-03-30 14:55:42 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

During the 3½ minutes of the (necessarily) single-threaded initial parsing of the redo log, only one "Read redo log up to LSN" message was issued. The "To recover" messages were issued almost reliably during this run. The perf report data shows that some page reads did take place, so some log should have been applied in multiple threads. Yet, the CPU utilization hovered close to 100% (of one CPU core) for the entire run.

Marko Mäkelä added a comment - 2023-03-30 12:05 If I recover using a smaller buffer pool, then there will be frequent messages about starting a new recovery batch, like this: 2023-03-30 14:16:10 0 [Note] InnoDB: Completed initialization of buffer pool 2023-03-30 14:16:10 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812 2023-03-30 14:16:25 0 [Note] InnoDB: Starting a batch to recover 12978 pages from redo log. 2023-03-30 14:16:25 0 [Note] InnoDB: To recover: 12978 pages from log 2023-03-30 14:16:26 0 [Note] InnoDB: Starting a batch to recover 13571 pages from redo log. 2023-03-30 14:16:30 0 [Note] InnoDB: Starting a batch to recover 80468 pages from redo log. 2023-03-30 14:16:37 0 [Note] InnoDB: Starting a batch to recover 84957 pages from redo log. 2023-03-30 14:16:40 0 [Note] InnoDB: To recover: 50796 pages from log The "To recover:" messages would appear every 15 seconds. However, if I start the recovery of the same data directory with a larger buffer pool, then it will be silent for longer periods of time: 2023-03-30 14:18:05 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812 2023-03-30 14:18:38 0 [Note] InnoDB: Trying to recover page [page id: space=5, page number=2166] from the doublewrite buffer. 2023-03-30 14:18:38 0 [Note] InnoDB: Recovered page [page id: space=5, page number=2166] to './test/sbtest1.ibd' from the doublewrite buffer. 2023-03-30 14:18:39 0 [Note] InnoDB: Read redo log up to LSN=890094812 2023-03-30 14:19:12 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log. 2023-03-30 14:19:12 0 [Note] InnoDB: To recover: 153487 pages from log 2023-03-30 14:19:56 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log. 2023-03-30 14:19:56 0 [Note] InnoDB: To recover: 183721 pages from log 2023-03-30 14:20:11 0 [Note] InnoDB: To recover: 18334 pages from log 2023-03-30 14:20:46 0 [Note] InnoDB: Starting a batch to recover 192135 pages from redo log. 2023-03-30 14:20:46 0 [Note] InnoDB: To recover: 192135 pages from log 2023-03-30 14:21:01 0 [Note] InnoDB: To recover: 30398 pages from log 2023-03-30 14:21:34 0 [Note] InnoDB: Starting a batch to recover 188234 pages from redo log. 2023-03-30 14:21:34 0 [Note] InnoDB: To recover: 188234 pages from log 2023-03-30 14:21:49 0 [Note] InnoDB: To recover: 48679 pages from log 2023-03-30 14:22:01 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo 2023-03-30 14:22:01 0 [Note] InnoDB: Trx id counter is 37362293 2023-03-30 14:22:01 0 [Note] InnoDB: Starting final batch to recover 104686 pages from redo log. 2023-03-30 14:22:04 0 [Note] InnoDB: To recover: 29018 pages from log 2023-03-30 14:22:05 0 [Note] InnoDB: 128 rollback segments are active. 2023-03-30 14:22:05 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-03-30 14:22:05 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2023-03-30 14:22:05 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. 2023-03-30 14:22:05 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321 The "To recover:" messages should be output every 15 seconds, but there were no messages at 14:20:26 or 14:20:41. The "Starting a batch to recover" messages are unconditionally output at the start of each batch. During this experiment, it also caught my eye that the recovery is occupying only one CPU core. I suppose that recovery avoids reading most data pages thanks to MDEV-12699 , and most of the recovery will simply initialize pages in the buffer pool. To analyze that, I started a perf record -g during a recovery of the same data directory contents with the same innodb_buffer_pool_size=10g , and I noticed another anomaly for this repeated run: 2023-03-30 14:30:58 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812 2023-03-30 14:31:33 0 [Note] InnoDB: Read redo log up to LSN=890094812 2023-03-30 14:32:07 0 [Note] InnoDB: Starting a batch to recover 153487 pages from redo log. 2023-03-30 14:32:07 0 [Note] InnoDB: To recover: 153487 pages from log 2023-03-30 14:32:55 0 [Note] InnoDB: Starting a batch to recover 183721 pages from redo log. 2023-03-30 14:32:55 0 [Note] InnoDB: To recover: 183721 pages from log There was no progress message about "Read redo log up to" for the first 35 seconds of the recovery. Also, the second "To recover" message was issued 48 seconds after the first one. It looks like the messages are somehow being missed sporadically. Another try with innodb_buffer_pool_size=20g : 2023-03-30 14:38:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812 2023-03-30 14:39:22 0 [Note] InnoDB: Read redo log up to LSN=890094812 2023-03-30 14:40:33 0 [Note] InnoDB: Starting a batch to recover 249675 pages from redo log. 2023-03-30 14:40:33 0 [Note] InnoDB: To recover: 249675 pages from log 2023-03-30 14:40:48 0 [Note] InnoDB: To recover: 65746 pages from log 2023-03-30 14:42:20 0 [Note] InnoDB: Starting a batch to recover 299196 pages from redo log. 2023-03-30 14:42:20 0 [Note] InnoDB: To recover: 299196 pages from log 2023-03-30 14:42:35 0 [Note] InnoDB: To recover: 204605 pages from log 2023-03-30 14:42:50 0 [Note] InnoDB: To recover: 76653 pages from log 2023-03-30 14:43:05 0 [Note] InnoDB: To recover: 3039 pages from log 2023-03-30 14:43:18 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo 2023-03-30 14:43:18 0 [Note] InnoDB: Trx id counter is 37362293 2023-03-30 14:43:18 0 [Note] InnoDB: Starting final batch to recover 104676 pages from redo log. 2023-03-30 14:43:20 0 [Note] InnoDB: To recover: 79977 pages from log 2023-03-30 14:43:22 0 [Note] InnoDB: 128 rollback segments are active. 2023-03-30 14:43:22 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-03-30 14:43:22 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2023-03-30 14:43:22 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1" 2023-03-30 14:43:22 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed 2023-03-30 14:43:22 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... 2023-03-30 14:43:22 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. 2023-03-30 14:43:22 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321 Right before issuing the first message "Starting a batch" it was busy parsing the redo log. The "Read redo log up to LSN" message is intended to be issued every 15 seconds, just like the "To recover:" ones. Because perf failed to collect samples for some reason, I ran it once more, this time with innodb_buffer_pool_size=30g , to be met with similar initial silence: 2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812 2023-03-30 14:52:09 0 [Note] InnoDB: Read redo log up to LSN=890094812 2023-03-30 14:53:59 0 [Note] InnoDB: Starting a batch to recover 343907 pages from redo log. 2023-03-30 14:53:59 0 [Note] InnoDB: To recover: 343907 pages from log 2023-03-30 14:54:14 0 [Note] InnoDB: To recover: 162142 pages from log 2023-03-30 14:54:29 0 [Note] InnoDB: To recover: 16129 pages from log 2023-03-30 14:55:18 0 [Note] InnoDB: To recover: 200958 pages from log 2023-03-30 14:55:20 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo 2023-03-30 14:55:20 0 [Note] InnoDB: Trx id counter is 37362293 2023-03-30 14:55:20 0 [Note] InnoDB: Starting final batch to recover 200464 pages from redo log. 2023-03-30 14:55:33 0 [Note] InnoDB: To recover: 61463 pages from log 2023-03-30 14:55:42 0 [Note] InnoDB: 128 rollback segments are active. 2023-03-30 14:55:42 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-03-30 14:55:42 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2023-03-30 14:55:42 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. 2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321 During the 3½ minutes of the (necessarily) single-threaded initial parsing of the redo log, only one "Read redo log up to LSN" message was issued. The "To recover" messages were issued almost reliably during this run. The perf report data shows that some page reads did take place, so some log should have been applied in multiple threads. Yet, the CPU utilization hovered close to 100% (of one CPU core) for the entire run.

Marko Mäkelä added a comment - 2023-03-30 15:54

I got a bit distracted with recovery performance testing, reading the perf report. In my previous comment, I mentioned recovery with innodb_buffer_pool_size=30g:

2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812

…

2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

The total time was 5 minutes and 6 seconds. I managed to optimize away 2 minutes:

2023-03-30 18:09:29 0 [Note] InnoDB: Completed initialization of buffer pool

2023-03-30 18:11:38 0 [Note] InnoDB: To recover: 447139 pages from log

2023-03-30 18:11:45 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo

2023-03-30 18:11:45 0 [Note] InnoDB: Trx id counter is 37362293

2023-03-30 18:11:45 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log.

2023-03-30 18:11:53 0 [Note] InnoDB: To recover: 296136 pages from log

2023-03-30 18:12:08 0 [Note] InnoDB: To recover: 89868 pages from log

2023-03-30 18:12:23 0 [Note] InnoDB: To recover: 23926 pages from log

2023-03-30 18:12:33 0 [Note] InnoDB: 128 rollback segments are active.

…

2023-03-30 18:12:33 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321

That is 3 minutes and 4 seconds for the recovery. Most of the time goes into parsing and storing log records, which I think must be single-threaded.

I tried to revert one change, to see if it was for the better or for the worse:

2023-03-30 18:37:54 0 [Note] InnoDB: Completed initialization of buffer pool

2023-03-30 18:40:11 0 [Note] InnoDB: To recover: 447139 pages from log

2023-03-30 18:40:21 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo

2023-03-30 18:40:21 0 [Note] InnoDB: Trx id counter is 37362293

2023-03-30 18:40:21 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log.

2023-03-30 18:40:26 0 [Note] InnoDB: To recover: 364557 pages from log

2023-03-30 18:40:41 0 [Note] InnoDB: To recover: 128886 pages from log

2023-03-30 18:40:56 0 [Note] InnoDB: To recover: 43070 pages from log

2023-03-30 18:41:10 0 [Note] InnoDB: 128 rollback segments are active.

…

2023-03-30 18:41:10 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

This is 3 minutes and 16 seconds, a little worse.

The last 48 or 49 seconds of these runs is actual recovery, which could be subjected to parallelization. If we assume 16 concurrent threads, ideally we could get that shaved to 3 seconds, that is, the total time for the faster run would be 2 minutes and 32 seconds. I think that I will try that as well. The more reliable progress reporting would only be some icing on the cake.

Marko Mäkelä added a comment - 2023-03-30 15:54 I got a bit distracted with recovery performance testing, reading the perf report . In my previous comment, I mentioned recovery with innodb_buffer_pool_size=30g : 2023-03-30 14:50:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=890094812 … 2023-03-30 14:55:42 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321 The total time was 5 minutes and 6 seconds. I managed to optimize away 2 minutes: 2023-03-30 18:09:29 0 [Note] InnoDB: Completed initialization of buffer pool 2023-03-30 18:11:38 0 [Note] InnoDB: To recover: 447139 pages from log 2023-03-30 18:11:45 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo 2023-03-30 18:11:45 0 [Note] InnoDB: Trx id counter is 37362293 2023-03-30 18:11:45 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log. 2023-03-30 18:11:53 0 [Note] InnoDB: To recover: 296136 pages from log 2023-03-30 18:12:08 0 [Note] InnoDB: To recover: 89868 pages from log 2023-03-30 18:12:23 0 [Note] InnoDB: To recover: 23926 pages from log 2023-03-30 18:12:33 0 [Note] InnoDB: 128 rollback segments are active. … 2023-03-30 18:12:33 0 [Note] InnoDB: log sequence number 5943243155 (memory-mapped); transaction id 37362321 That is 3 minutes and 4 seconds for the recovery. Most of the time goes into parsing and storing log records, which I think must be single-threaded. I tried to revert one change, to see if it was for the better or for the worse: 2023-03-30 18:37:54 0 [Note] InnoDB: Completed initialization of buffer pool 2023-03-30 18:40:11 0 [Note] InnoDB: To recover: 447139 pages from log 2023-03-30 18:40:21 0 [Note] InnoDB: 27 transaction(s) which must be rolled back or cleaned up in total 27 row operations to undo 2023-03-30 18:40:21 0 [Note] InnoDB: Trx id counter is 37362293 2023-03-30 18:40:21 0 [Note] InnoDB: Starting final batch to recover 446636 pages from redo log. 2023-03-30 18:40:26 0 [Note] InnoDB: To recover: 364557 pages from log 2023-03-30 18:40:41 0 [Note] InnoDB: To recover: 128886 pages from log 2023-03-30 18:40:56 0 [Note] InnoDB: To recover: 43070 pages from log 2023-03-30 18:41:10 0 [Note] InnoDB: 128 rollback segments are active. … 2023-03-30 18:41:10 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... This is 3 minutes and 16 seconds, a little worse. The last 48 or 49 seconds of these runs is actual recovery, which could be subjected to parallelization. If we assume 16 concurrent threads, ideally we could get that shaved to 3 seconds, that is, the total time for the faster run would be 2 minutes and 32 seconds. I think that I will try that as well. The more reliable progress reporting would only be some icing on the cake.

Marko Mäkelä added a comment - 2023-03-31 09:24

The messages InnoDB: To recover: would only be output when applying log to a page in recv_recover_page(). The initial parsing stage can take a long time. I added an intentionally typo-ed message to the log parsing loop in recv_scan_log(). This is how the output would look like (for recovering a different data directory than yesterday):

2023-03-31 12:06:21 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451

2023-03-31 12:06:36 0 [Note] InnoDB: To resover: 110099 pages from log

2023-03-31 12:06:51 0 [Note] InnoDB: To resover: 160222 pages from log

2023-03-31 12:07:06 0 [Note] InnoDB: To resover: 213224 pages from log

2023-03-31 12:07:21 0 [Note] InnoDB: To resover: 262026 pages from log

2023-03-31 12:07:41 0 [Note] InnoDB: Read redo log up to LSN=45451

2023-03-31 12:07:56 0 [Note] InnoDB: To resover: 91522 pages from log

2023-03-31 12:08:11 0 [Note] InnoDB: To resover: 130295 pages from log

2023-03-31 12:08:26 0 [Note] InnoDB: To resover: 169361 pages from log

2023-03-31 12:08:41 0 [Note] InnoDB: To resover: 212187 pages from log

2023-03-31 12:08:56 0 [Note] InnoDB: To resover: 247702 pages from log

2023-03-31 12:09:11 0 [Note] InnoDB: To resover: 283953 pages from log

2023-03-31 12:09:24 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.

2023-03-31 12:09:26 0 [Note] InnoDB: To recover: 314811 pages from log

2023-03-31 12:09:41 0 [Note] InnoDB: To recover: 100224 pages from log

2023-03-31 12:09:56 0 [Note] InnoDB: To recover: 8134 pages from log

2023-03-31 12:10:11 0 [Note] InnoDB: To resover: 100637 pages from log

2023-03-31 12:10:26 0 [Note] InnoDB: To resover: 140568 pages from log

2023-03-31 12:10:41 0 [Note] InnoDB: To resover: 185950 pages from log

2023-03-31 12:10:56 0 [Note] InnoDB: To resover: 225344 pages from log

2023-03-31 12:11:11 0 [Note] InnoDB: To resover: 259438 pages from log

2023-03-31 12:11:22 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo

2023-03-31 12:11:22 0 [Note] InnoDB: Trx id counter is 41494841

2023-03-31 12:11:22 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.

2023-03-31 12:11:26 0 [Note] InnoDB: To recover: 212473 pages from log

2023-03-31 12:11:41 0 [Note] InnoDB: To recover: 49345 pages from log

2023-03-31 12:11:51 0 [Note] InnoDB: 128 rollback segments are active.

…

2023-03-31 12:11:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858

That is, for the first 3 minutes recovery would likely not execute recv_recover_page() at all, only recv_scan_log() (typo-ed To resover). After the first To recover message at 12:09:26 both types of messages are being interleaved.

The tight loop in recv_scan_log() might not the right place for the message output, though. Let us compare how quickly the same data directory recovered before I added the message output:

2023-03-31 11:58:42 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451

2023-03-31 12:00:00 0 [Note] InnoDB: Read redo log up to LSN=45451

2023-03-31 12:01:35 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.

2023-03-31 12:01:35 0 [Note] InnoDB: To recover: 315871 pages from log

2023-03-31 12:01:50 0 [Note] InnoDB: To recover: 120018 pages from log

2023-03-31 12:02:05 0 [Note] InnoDB: To recover: 1754 pages from log

2023-03-31 12:03:19 0 [Note] InnoDB: To recover: 272422 pages from log

2023-03-31 12:03:23 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo

2023-03-31 12:03:23 0 [Note] InnoDB: Trx id counter is 41494841

2023-03-31 12:03:23 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.

2023-03-31 12:03:34 0 [Note] InnoDB: To recover: 116378 pages from log

2023-03-31 12:03:49 0 [Note] InnoDB: To recover: 4730 pages from log

2023-03-31 12:03:51 0 [Note] InnoDB: 128 rollback segments are active.

2023-03-31 12:03:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-03-31 12:03:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858

With the messages added: 12:11:51 - 12:06:21 = 5 minutes and 30 seconds
Without the messages: 12:03:51 - 11:58:42 = 5 minutes and 9 seconds

I will try to find a more suitable place for the message output, to avoid introducing any significant performance regression.

Marko Mäkelä added a comment - 2023-03-31 09:24 The messages InnoDB: To recover: would only be output when applying log to a page in recv_recover_page() . The initial parsing stage can take a long time. I added an intentionally typo-ed message to the log parsing loop in recv_scan_log() . This is how the output would look like (for recovering a different data directory than yesterday): 2023-03-31 12:06:21 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451 2023-03-31 12:06:36 0 [Note] InnoDB: To resover: 110099 pages from log 2023-03-31 12:06:51 0 [Note] InnoDB: To resover: 160222 pages from log 2023-03-31 12:07:06 0 [Note] InnoDB: To resover: 213224 pages from log 2023-03-31 12:07:21 0 [Note] InnoDB: To resover: 262026 pages from log 2023-03-31 12:07:41 0 [Note] InnoDB: Read redo log up to LSN=45451 2023-03-31 12:07:56 0 [Note] InnoDB: To resover: 91522 pages from log 2023-03-31 12:08:11 0 [Note] InnoDB: To resover: 130295 pages from log 2023-03-31 12:08:26 0 [Note] InnoDB: To resover: 169361 pages from log 2023-03-31 12:08:41 0 [Note] InnoDB: To resover: 212187 pages from log 2023-03-31 12:08:56 0 [Note] InnoDB: To resover: 247702 pages from log 2023-03-31 12:09:11 0 [Note] InnoDB: To resover: 283953 pages from log 2023-03-31 12:09:24 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log. 2023-03-31 12:09:26 0 [Note] InnoDB: To recover: 314811 pages from log 2023-03-31 12:09:41 0 [Note] InnoDB: To recover: 100224 pages from log 2023-03-31 12:09:56 0 [Note] InnoDB: To recover: 8134 pages from log 2023-03-31 12:10:11 0 [Note] InnoDB: To resover: 100637 pages from log 2023-03-31 12:10:26 0 [Note] InnoDB: To resover: 140568 pages from log 2023-03-31 12:10:41 0 [Note] InnoDB: To resover: 185950 pages from log 2023-03-31 12:10:56 0 [Note] InnoDB: To resover: 225344 pages from log 2023-03-31 12:11:11 0 [Note] InnoDB: To resover: 259438 pages from log 2023-03-31 12:11:22 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo 2023-03-31 12:11:22 0 [Note] InnoDB: Trx id counter is 41494841 2023-03-31 12:11:22 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log. 2023-03-31 12:11:26 0 [Note] InnoDB: To recover: 212473 pages from log 2023-03-31 12:11:41 0 [Note] InnoDB: To recover: 49345 pages from log 2023-03-31 12:11:51 0 [Note] InnoDB: 128 rollback segments are active. … 2023-03-31 12:11:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858 That is, for the first 3 minutes recovery would likely not execute recv_recover_page() at all, only recv_scan_log() (typo-ed To resover ). After the first To recover message at 12:09:26 both types of messages are being interleaved. The tight loop in recv_scan_log() might not the right place for the message output, though. Let us compare how quickly the same data directory recovered before I added the message output: 2023-03-31 11:58:42 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451 2023-03-31 12:00:00 0 [Note] InnoDB: Read redo log up to LSN=45451 2023-03-31 12:01:35 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log. 2023-03-31 12:01:35 0 [Note] InnoDB: To recover: 315871 pages from log 2023-03-31 12:01:50 0 [Note] InnoDB: To recover: 120018 pages from log 2023-03-31 12:02:05 0 [Note] InnoDB: To recover: 1754 pages from log 2023-03-31 12:03:19 0 [Note] InnoDB: To recover: 272422 pages from log 2023-03-31 12:03:23 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo 2023-03-31 12:03:23 0 [Note] InnoDB: Trx id counter is 41494841 2023-03-31 12:03:23 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log. 2023-03-31 12:03:34 0 [Note] InnoDB: To recover: 116378 pages from log 2023-03-31 12:03:49 0 [Note] InnoDB: To recover: 4730 pages from log 2023-03-31 12:03:51 0 [Note] InnoDB: 128 rollback segments are active. 2023-03-31 12:03:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-03-31 12:03:51 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858 With the messages added: 12:11:51 - 12:06:21 = 5 minutes and 30 seconds Without the messages: 12:03:51 - 11:58:42 = 5 minutes and 9 seconds I will try to find a more suitable place for the message output, to avoid introducing any significant performance regression.

Marko Mäkelä added a comment - 2023-03-31 10:50 - edited

I confirmed with ATTRIBUTE_NOINLINE that virtually all CPU cycles of recv_sys_t::recover_low() are being spent in recv_sys_t::free(). I believe that implementing ~~MDEV-29445~~ will help with that. In total, these two functions waste over 10% of single-threaded CPU time:

   5.18%  mariadbd         mariadbd              [.] recv_sys_t::add

   4.78%  mariadbd         mariadbd              [.] recv_sys_t::free

I reran the recovery with an extra parameter --innodb-buffer-pool-chunk-size=30g to work around ~~MDEV-25342~~, which would normally divide the buffer pool into 64 chunks:

2023-03-31 13:21:37 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB

2023-03-31 13:21:38 0 [Note] InnoDB: Completed initialization of buffer pool

2023-03-31 13:21:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451

…

2023-03-31 13:26:07 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858

There was no impact on the recovery time (still 5½ minutes, because the extra To resover messages were present). The ranking of recv_sys_t::free() dropped a bit:

   2.92%  mariadbd         mariadbd              [.] recv_sys_t::free

The remaining performance hog in that function is an atomic operation. We are repurposing a field in the buffer block descriptor in this custom allocator, and we do not actually need any atomic operation because we are holding recv_sys.mutex:

@@ -1411,8 +1422,11 @@ inline void recv_sys_t::free(const void *data)

     ut_ad(block->page.state() == buf_page_t::MEMORY);

     ut_ad(static_cast<uint16_t>(block->page.access_time - 1) <

           srv_page_size);

-    ut_ad(block->page.access_time >= 1U << 16);

-    if (!((block->page.access_time -= 1U << 16) >> 16))

+    unsigned a= block->page.access_time;

+    ut_ad(a >= 1U << 16);

+    a-= 1U << 16;

+    block->page.access_time= a;

+    if (!(a >> 16))

       UT_LIST_REMOVE(blocks, block);

       MEM_MAKE_ADDRESSABLE(block->page.frame, srv_page_size);

In recv_sys_t::alloc() we are already using relaxed loads and stores instead of the costly atomic read-modify-write (lock xadd in 80486 and its successors). With the above change, the recv_sys_t::free() improved and the recovery time improved by a whopping 25% (82 seconds) from 5½ minutes to 4 minutes and 8 seconds:

   1.15%  mariadbd         mariadbd             [.] recv_sys_t::free

It looks like the remaining time in this function is dominated by the loop that iterates over the buffer pool chunks (in this case, only 1, but frequently executed conditions destroy performance). That would be addressed by ~~MDEV-29445~~.

Marko Mäkelä added a comment - 2023-03-31 10:50 - edited I confirmed with ATTRIBUTE_NOINLINE that virtually all CPU cycles of recv_sys_t::recover_low() are being spent in recv_sys_t::free() . I believe that implementing MDEV-29445 will help with that. In total, these two functions waste over 10% of single-threaded CPU time: 5.18% mariadbd mariadbd [.] recv_sys_t::add 4.78% mariadbd mariadbd [.] recv_sys_t::free I reran the recovery with an extra parameter --innodb-buffer-pool-chunk-size=30g to work around MDEV-25342 , which would normally divide the buffer pool into 64 chunks: 2023-03-31 13:21:37 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB 2023-03-31 13:21:38 0 [Note] InnoDB: Completed initialization of buffer pool 2023-03-31 13:21:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451 … 2023-03-31 13:26:07 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494858 There was no impact on the recovery time (still 5½ minutes, because the extra To resover messages were present). The ranking of recv_sys_t::free() dropped a bit: 2.92% mariadbd mariadbd [.] recv_sys_t::free The remaining performance hog in that function is an atomic operation. We are repurposing a field in the buffer block descriptor in this custom allocator, and we do not actually need any atomic operation because we are holding recv_sys.mutex : @@ -1411,8 +1422,11 @@ inline void recv_sys_t::free(const void *data) ut_ad(block->page.state() == buf_page_t::MEMORY); ut_ad(static_cast<uint16_t>(block->page.access_time - 1) < srv_page_size); - ut_ad(block->page.access_time >= 1U << 16); - if (!((block->page.access_time -= 1U << 16) >> 16)) + unsigned a= block->page.access_time; + ut_ad(a >= 1U << 16); + a-= 1U << 16; + block->page.access_time= a; + if (!(a >> 16)) { UT_LIST_REMOVE(blocks, block); MEM_MAKE_ADDRESSABLE(block->page.frame, srv_page_size); In recv_sys_t::alloc() we are already using relaxed loads and stores instead of the costly atomic read-modify-write ( lock xadd in 80486 and its successors). With the above change, the recv_sys_t::free() improved and the recovery time improved by a whopping 25% (82 seconds) from 5½ minutes to 4 minutes and 8 seconds: 1.15% mariadbd mariadbd [.] recv_sys_t::free It looks like the remaining time in this function is dominated by the loop that iterates over the buffer pool chunks (in this case, only 1, but frequently executed conditions destroy performance). That would be addressed by MDEV-29445 .

Marko Mäkelä added a comment - 2023-03-31 14:18

I was thinking that something like this could improve performance a lot in case a single recovery batch is not sufficient:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc

index 148ab4ce55b..d1479456e62 100644

--- a/storage/innobase/log/log0recv.cc

+++ b/storage/innobase/log/log0recv.cc

@@ -4212,7 +4212,8 @@ dberr_t recv_recovery_from_checkpoint_start()

 		ut_ad(srv_force_recovery <= SRV_FORCE_NO_UNDO_LOG_SCAN);

 		if (rescan) {

-			recv_sys.lsn = log_sys.next_checkpoint_lsn;

+			recv_sys.lsn = std::max(log_sys.next_checkpoint_lsn,

+						recv_sys.last_stored_lsn);

 			rescan = recv_scan_log(true);

 			if ((recv_sys.is_corrupt_log()

 			     && !srv_force_recovery)

This exact change will make the test innodb.recovery_memory, and I imagine that it is somehow related to the FILE_ records for file name tracking. During the first scan of the redo log, we may parse log records for tablespace files that may have been deleted or renamed after the time of the latest checkpoint. Until we see a FILE_RENAME or FILE_DELETE record for the tablespace ID, we cannot know what is the right thing to do. Currently, the log between the latest checkpoint and its corresponding FILE_CHECKPOINT record is being parsed in "store if exists" mode: If the tablespace for some pages is not found, we will discard the records. It could be better to unconditionally store all log records in recv_sys.pages and check for inconsistency upon reaching the FILE_CHECKPOINT record. Some added progress reporting highlights this problem:

2023-03-31 16:52:01 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB

2023-03-31 16:52:01 0 [Note] InnoDB: Completed initialization of buffer pool

2023-03-31 16:52:01 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451

2023-03-31 16:52:16 0 [Note] InnoDB: Parsed redo log up to LSN=1742292932; to recover: 106297 pages

2023-03-31 16:52:31 0 [Note] InnoDB: Parsed redo log up to LSN=2311978604; to recover: 156337 pages

2023-03-31 16:52:46 0 [Note] InnoDB: Parsed redo log up to LSN=2942556105; to recover: 210401 pages

2023-03-31 16:53:01 0 [Note] InnoDB: Parsed redo log up to LSN=3504149533; to recover: 259067 pages

2023-03-31 16:53:23 0 [Note] InnoDB: Read redo log up to LSN=45451

2023-03-31 16:53:38 0 [Note] InnoDB: Parsed redo log up to LSN=1645302567; to recover: 93217 pages

2023-03-31 16:53:53 0 [Note] InnoDB: Parsed redo log up to LSN=2166257717; to recover: 138937 pages

2023-03-31 16:54:08 0 [Note] InnoDB: Parsed redo log up to LSN=2746862824; to recover: 188829 pages

2023-03-31 16:54:23 0 [Note] InnoDB: Parsed redo log up to LSN=3229740111; to recover: 231309 pages

2023-03-31 16:54:38 0 [Note] InnoDB: Parsed redo log up to LSN=3706313376; to recover: 272336 pages

2023-03-31 16:54:53 0 [Note] InnoDB: Parsed redo log up to LSN=4173821625; to recover: 312515 pages

2023-03-31 16:54:54 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log.

2023-03-31 16:55:08 0 [Note] InnoDB: To recover: 95902 pages from log

2023-03-31 16:55:26 0 [Note] InnoDB: Parsed redo log up to LSN=4213203801; to recover: 0 pages

2023-03-31 16:55:40 0 [Note] InnoDB: Parsed redo log up to LSN=4665667725; to recover: 124497 pages

2023-03-31 16:55:55 0 [Note] InnoDB: Parsed redo log up to LSN=5227132263; to recover: 174975 pages

2023-03-31 16:56:10 0 [Note] InnoDB: Parsed redo log up to LSN=5722640474; to recover: 221146 pages

2023-03-31 16:56:25 0 [Note] InnoDB: Parsed redo log up to LSN=6131373971; to recover: 259411 pages

2023-03-31 16:56:34 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo

2023-03-31 16:56:34 0 [Note] InnoDB: Trx id counter is 41494841

2023-03-31 16:56:34 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log.

2023-03-31 16:56:40 0 [Note] InnoDB: To recover: 155518 pages from log

…

2023-03-31 16:56:53 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494847

The time between 16:52:01 and 16:53:23 (82 seconds) was basically wasted, because we will restart parsing the log from the checkpoint. Another way of looking at it is to interpolate the timestamp corresponding to the LSN that was displayed at 16:53:01. It could be around 16:54:30 (a little before a somewhat larger LSN was reported at 16:54:38). Again, the difference is more than 80 seconds. That is quite a bit, considering that the total recovery time is 4 minutes and 52 seconds (292 seconds). We could save another 29% of time there. Together with the previous savings in recv_sys_t::free(), this should halve the recovery time!

This could affect many users, because my recovery attempt was with a larger buffer pool (30G) than was used for originally creating the data directory (20G). With an equal-sized or smaller buffer pool and infrequent log checkpoints, a multi-batch recovery should be very likely.

Also the memory limit (do not let log records occupy more than ⅓ of the buffer pool) feels rather arbitrary. Do we really need to reserve ⅔ of a huge buffer pool for copies of data pages during recovery, or would something like 128 or 256 pages suffice? Recovery only posts batches of 32 asynchronous page reads when it needs to read pages, and during the recovery of my test datasets I hardly see any page reads taking place. However, we must keep in mind that this is a rather special test: no checkpoint had been written after the server was bootstrapped, and practically everything can be recovered (~~MDEV-12699~~) without reading any pages from the data files. I will have to test also a case where there was a checkpoint between loading the test data and running the UPDATE workload.

I tried "pushing down" the memory limit to recv_sys_t::add(), but it needs some more work, because the test innodb.recovery_memory would occasionally fail.

Marko Mäkelä added a comment - 2023-03-31 14:18 I was thinking that something like this could improve performance a lot in case a single recovery batch is not sufficient: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 148ab4ce55b..d1479456e62 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -4212,7 +4212,8 @@ dberr_t recv_recovery_from_checkpoint_start() ut_ad(srv_force_recovery <= SRV_FORCE_NO_UNDO_LOG_SCAN); if (rescan) { - recv_sys.lsn = log_sys.next_checkpoint_lsn; + recv_sys.lsn = std::max(log_sys.next_checkpoint_lsn, + recv_sys.last_stored_lsn); rescan = recv_scan_log(true); if ((recv_sys.is_corrupt_log() && !srv_force_recovery) This exact change will make the test innodb.recovery_memory , and I imagine that it is somehow related to the FILE_ records for file name tracking. During the first scan of the redo log, we may parse log records for tablespace files that may have been deleted or renamed after the time of the latest checkpoint. Until we see a FILE_RENAME or FILE_DELETE record for the tablespace ID, we cannot know what is the right thing to do. Currently, the log between the latest checkpoint and its corresponding FILE_CHECKPOINT record is being parsed in "store if exists" mode: If the tablespace for some pages is not found, we will discard the records. It could be better to unconditionally store all log records in recv_sys.pages and check for inconsistency upon reaching the FILE_CHECKPOINT record. Some added progress reporting highlights this problem: 2023-03-31 16:52:01 0 [Note] InnoDB: Initializing buffer pool, total size = 30.000GiB, chunk size = 30.000GiB 2023-03-31 16:52:01 0 [Note] InnoDB: Completed initialization of buffer pool 2023-03-31 16:52:01 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=45451 2023-03-31 16:52:16 0 [Note] InnoDB: Parsed redo log up to LSN=1742292932; to recover: 106297 pages 2023-03-31 16:52:31 0 [Note] InnoDB: Parsed redo log up to LSN=2311978604; to recover: 156337 pages 2023-03-31 16:52:46 0 [Note] InnoDB: Parsed redo log up to LSN=2942556105; to recover: 210401 pages 2023-03-31 16:53:01 0 [Note] InnoDB: Parsed redo log up to LSN=3504149533; to recover: 259067 pages 2023-03-31 16:53:23 0 [Note] InnoDB: Read redo log up to LSN=45451 2023-03-31 16:53:38 0 [Note] InnoDB: Parsed redo log up to LSN=1645302567; to recover: 93217 pages 2023-03-31 16:53:53 0 [Note] InnoDB: Parsed redo log up to LSN=2166257717; to recover: 138937 pages 2023-03-31 16:54:08 0 [Note] InnoDB: Parsed redo log up to LSN=2746862824; to recover: 188829 pages 2023-03-31 16:54:23 0 [Note] InnoDB: Parsed redo log up to LSN=3229740111; to recover: 231309 pages 2023-03-31 16:54:38 0 [Note] InnoDB: Parsed redo log up to LSN=3706313376; to recover: 272336 pages 2023-03-31 16:54:53 0 [Note] InnoDB: Parsed redo log up to LSN=4173821625; to recover: 312515 pages 2023-03-31 16:54:54 0 [Note] InnoDB: Starting a batch to recover 315871 pages from redo log. 2023-03-31 16:55:08 0 [Note] InnoDB: To recover: 95902 pages from log 2023-03-31 16:55:26 0 [Note] InnoDB: Parsed redo log up to LSN=4213203801; to recover: 0 pages 2023-03-31 16:55:40 0 [Note] InnoDB: Parsed redo log up to LSN=4665667725; to recover: 124497 pages 2023-03-31 16:55:55 0 [Note] InnoDB: Parsed redo log up to LSN=5227132263; to recover: 174975 pages 2023-03-31 16:56:10 0 [Note] InnoDB: Parsed redo log up to LSN=5722640474; to recover: 221146 pages 2023-03-31 16:56:25 0 [Note] InnoDB: Parsed redo log up to LSN=6131373971; to recover: 259411 pages 2023-03-31 16:56:34 0 [Note] InnoDB: 16 transaction(s) which must be rolled back or cleaned up in total 16 row operations to undo 2023-03-31 16:56:34 0 [Note] InnoDB: Trx id counter is 41494841 2023-03-31 16:56:34 0 [Note] InnoDB: Starting final batch to recover 271971 pages from redo log. 2023-03-31 16:56:40 0 [Note] InnoDB: To recover: 155518 pages from log … 2023-03-31 16:56:53 0 [Note] InnoDB: log sequence number 6269792715 (memory-mapped); transaction id 41494847 The time between 16:52:01 and 16:53:23 (82 seconds) was basically wasted, because we will restart parsing the log from the checkpoint. Another way of looking at it is to interpolate the timestamp corresponding to the LSN that was displayed at 16:53:01. It could be around 16:54:30 (a little before a somewhat larger LSN was reported at 16:54:38). Again, the difference is more than 80 seconds. That is quite a bit, considering that the total recovery time is 4 minutes and 52 seconds (292 seconds). We could save another 29% of time there. Together with the previous savings in recv_sys_t::free() , this should halve the recovery time! This could affect many users, because my recovery attempt was with a larger buffer pool (30G) than was used for originally creating the data directory (20G). With an equal-sized or smaller buffer pool and infrequent log checkpoints, a multi-batch recovery should be very likely. Also the memory limit (do not let log records occupy more than ⅓ of the buffer pool) feels rather arbitrary. Do we really need to reserve ⅔ of a huge buffer pool for copies of data pages during recovery, or would something like 128 or 256 pages suffice? Recovery only posts batches of 32 asynchronous page reads when it needs to read pages, and during the recovery of my test datasets I hardly see any page reads taking place. However, we must keep in mind that this is a rather special test: no checkpoint had been written after the server was bootstrapped, and practically everything can be recovered ( MDEV-12699 ) without reading any pages from the data files. I will have to test also a case where there was a checkpoint between loading the test data and running the UPDATE workload. I tried "pushing down" the memory limit to recv_sys_t::add() , but it needs some more work, because the test innodb.recovery_memory would occasionally fail.

Marko Mäkelä added a comment - 2023-04-03 14:24

Today, I made some progress on "pushing down OOM checks". In case we run out of memory, we will resume parsing the log from the start of the current mini-transaction, not from the start of the log. The memory usage limit is more reasonable too: for larger buffer pools, only BUF_LRU_MIN_LEN (256) pages are reserved for the to-be-recovered pages. Even that is somewhat of an overkill, because it is 4 times the maximum value of innodb_read_io_threads=64.

2023-04-03 16:38:35 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=950555258

2023-04-03 16:38:50 0 [Note] InnoDB: Parsed redo log up to LSN=2014717452; to recover: 126885 pages

2023-04-03 16:39:05 0 [Note] InnoDB: Parsed redo log up to LSN=2656759712; to recover: 181122 pages

2023-04-03 16:39:20 0 [Note] InnoDB: Parsed redo log up to LSN=3326397264; to recover: 234593 pages

2023-04-03 16:39:35 0 [Note] InnoDB: Parsed redo log up to LSN=3906055802; to recover: 283483 pages

2023-04-03 16:39:50 0 [Note] InnoDB: Parsed redo log up to LSN=4484524600; to recover: 332314 pages

2023-04-03 16:40:05 0 [Note] InnoDB: Parsed redo log up to LSN=5066015464; to recover: 381400 pages

2023-04-03 16:40:20 0 [Note] InnoDB: Parsed redo log up to LSN=5680977027; to recover: 428113 pages

2023-04-03 16:40:35 0 [Note] InnoDB: To recover: 454513 pages from log

2023-04-03 16:40:36 0 [Note] InnoDB: 26 transaction(s) which must be rolled back or cleaned up in total 26 row operations to undo

2023-04-03 16:40:36 0 [Note] InnoDB: Trx id counter is 37992126

2023-04-03 16:40:36 0 [Note] InnoDB: Starting final batch to recover 454423 pages from redo log.

2023-04-03 16:40:50 0 [Note] InnoDB: To recover: 206743 pages from log

2023-04-03 16:41:05 0 [Note] InnoDB: To recover: 67047 pages from log

2023-04-03 16:41:20 0 [Note] InnoDB: To recover: 22798 pages from log

2023-04-03 16:41:30 0 [Note] InnoDB: 128 rollback segments are active.

2023-04-03 16:41:30 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-04-03 16:41:31 0 [Note] InnoDB: log sequence number 6041702829 (memory-mapped); transaction id 37992153

This run is not exactly comparable to what I ran on Friday, because there is almost 1GiB less of log to recover (the checkpoint LSN starts at almost 1G). Anyway, the 82-second waste is gone, and we are down from 292 seconds to 176 seconds. For a more fair comparison to the previous run, let us multiply that by 6/5: 1.2×176s=3.5 minutes, much less than the almost 5 minutes.

To speed up recovery further, I think that both RECV_WILL_NOT_READ and RECV_NOT_PROCESSED should be treated by buf_read_recv_pages(). For RECV_WILL_NOT_READ, we need to introduce a "fake read" operation, which would invoke the recovery function directly, without actually reading anything from the file system.

I am afraid that the parsing (here: about 40 MiB/s for the first 121 seconds) must remain single-threaded. With parallel processing, it should be feasible to cut the remaining 55 seconds to 2 or 3 seconds. The original run that produced the log was with 64 concurrent connections, maybe 1 to 1½ minutes for loading the data and another 1½ minutes for the workload. That is, with that final fix, the database recovery should consume a similar amount of time as creating the database took in the first place.

Marko Mäkelä added a comment - 2023-04-03 14:24 Today, I made some progress on "pushing down OOM checks". In case we run out of memory, we will resume parsing the log from the start of the current mini-transaction, not from the start of the log. The memory usage limit is more reasonable too: for larger buffer pools, only BUF_LRU_MIN_LEN (256) pages are reserved for the to-be-recovered pages. Even that is somewhat of an overkill, because it is 4 times the maximum value of innodb_read_io_threads=64 . 2023-04-03 16:38:35 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=950555258 2023-04-03 16:38:50 0 [Note] InnoDB: Parsed redo log up to LSN=2014717452; to recover: 126885 pages 2023-04-03 16:39:05 0 [Note] InnoDB: Parsed redo log up to LSN=2656759712; to recover: 181122 pages 2023-04-03 16:39:20 0 [Note] InnoDB: Parsed redo log up to LSN=3326397264; to recover: 234593 pages 2023-04-03 16:39:35 0 [Note] InnoDB: Parsed redo log up to LSN=3906055802; to recover: 283483 pages 2023-04-03 16:39:50 0 [Note] InnoDB: Parsed redo log up to LSN=4484524600; to recover: 332314 pages 2023-04-03 16:40:05 0 [Note] InnoDB: Parsed redo log up to LSN=5066015464; to recover: 381400 pages 2023-04-03 16:40:20 0 [Note] InnoDB: Parsed redo log up to LSN=5680977027; to recover: 428113 pages 2023-04-03 16:40:35 0 [Note] InnoDB: To recover: 454513 pages from log 2023-04-03 16:40:36 0 [Note] InnoDB: 26 transaction(s) which must be rolled back or cleaned up in total 26 row operations to undo 2023-04-03 16:40:36 0 [Note] InnoDB: Trx id counter is 37992126 2023-04-03 16:40:36 0 [Note] InnoDB: Starting final batch to recover 454423 pages from redo log. 2023-04-03 16:40:50 0 [Note] InnoDB: To recover: 206743 pages from log 2023-04-03 16:41:05 0 [Note] InnoDB: To recover: 67047 pages from log 2023-04-03 16:41:20 0 [Note] InnoDB: To recover: 22798 pages from log 2023-04-03 16:41:30 0 [Note] InnoDB: 128 rollback segments are active. 2023-04-03 16:41:30 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-04-03 16:41:31 0 [Note] InnoDB: log sequence number 6041702829 (memory-mapped); transaction id 37992153 This run is not exactly comparable to what I ran on Friday, because there is almost 1GiB less of log to recover (the checkpoint LSN starts at almost 1G). Anyway, the 82-second waste is gone, and we are down from 292 seconds to 176 seconds. For a more fair comparison to the previous run, let us multiply that by 6/5: 1.2×176s=3.5 minutes, much less than the almost 5 minutes. To speed up recovery further, I think that both RECV_WILL_NOT_READ and RECV_NOT_PROCESSED should be treated by buf_read_recv_pages() . For RECV_WILL_NOT_READ , we need to introduce a "fake read" operation, which would invoke the recovery function directly, without actually reading anything from the file system. I am afraid that the parsing (here: about 40 MiB/s for the first 121 seconds) must remain single-threaded. With parallel processing, it should be feasible to cut the remaining 55 seconds to 2 or 3 seconds. The original run that produced the log was with 64 concurrent connections, maybe 1 to 1½ minutes for loading the data and another 1½ minutes for the workload. That is, with that final fix, the database recovery should consume a similar amount of time as creating the database took in the first place.

Marko Mäkelä added a comment - 2023-04-04 13:48

It looks like I may have gotten the improved OOM processing to work. We will avoid re-parsing everything from the recovery checkpoint onwards, but we do have to re-parse everything from the start of the mini-transaction at which we ran out of memory for log records. For larger buffer pools, only 256 pages will be reserved for actual buffer pool pages; the rest can be consumed by log records.

I implemented a proof of concept for asynchronous "fake reads" for multi-threaded recovery. In my quick test with innodb_log_file_size=10G and innodb_buffer_pool_size=20G, it now recovers in a single batch when using the same-size buffer pool. The concurrency is not good yet: the best I saw with innodb_read_io_threads=64 (the maximum) was slightly less than 370% of one CPU core being utilized, and typically it was around 150% during the log application. I will streamline the work submission, and I will have to retest with innodb_buffer_pool_chunk_size=20g because the recv_sys.mutex could be a bottleneck in the memory management:

  24,89%  mariadbd         mariadbd              [.] page_recv_t::recs_t::clear

  18,16%  mariadbd         mariadbd              [.] recv_sys_t::parse<recv_buf, true>

  12,39%  mariadbd         mariadbd              [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t>

   5,20%  mariadbd         mariadbd              [.] recv_recover_page

   4,80%  mariadbd         mariadbd              [.] recv_sys_t::add

   3,80%  mariadbd         mariadbd              [.] log_phys_t::apply

   3,64%  mariadbd         [kernel.kallsyms]     [k] clear_page_erms

   1,21%  mariadbd         libc.so.6             [.] _int_free

   1,06%  mariadbd         libc.so.6             [.] __memmove_avx_unaligned_erms_rtm

   0,98%  mariadbd         mariadbd              [.] crc32c_3way

   0,90%  mariadbd         mariadbd              [.] page_apply_insert_dynamic

Using a single buffer pool chunk sped it up, and the CPU utilization could have been over 300% for most part (still a far cry from the 40 that nproc reports):

  21,85%  mariadbd         mariadbd              [.] recv_sys_t::parse<recv_buf, true>

  14,65%  mariadbd         mariadbd              [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t>

   7,24%  mariadbd         [kernel.kallsyms]     [k] clear_page_erms

   5,62%  mariadbd         mariadbd              [.] recv_recover_page

   4,53%  mariadbd         mariadbd              [.] recv_sys_t::add

   4,52%  mariadbd         mariadbd              [.] log_phys_t::apply

   3,90%  mariadbd         mariadbd              [.] page_recv_t::recs_t::clear

   1,50%  mariadbd         libc.so.6             [.] _int_free

   1,26%  mariadbd         libc.so.6             [.] __memmove_avx_unaligned_erms_rtm

   1,26%  mariadbd         mariadbd              [.] page_apply_insert_dynamic

   1,17%  mariadbd         [kernel.kallsyms]     [k] copy_page

   1,13%  mariadbd         mariadbd              [.] crc32c_3way

It now took this much wall clock time:

2023-04-04 16:27:31 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=928593106

2023-04-04 16:27:46 0 [Note] InnoDB: Parsed redo log up to LSN=2027073563; to recover: 128070 pages

…

2023-04-04 16:29:01 0 [Note] InnoDB: Parsed redo log up to LSN=5331384873; to recover: 402752 pages

2023-04-04 16:29:16 0 [Note] InnoDB: To recover: 453542 pages from log

2023-04-04 16:29:21 0 [Note] InnoDB: 15 transaction(s) which must be rolled back or cleaned up in total 15 row operations to undo

2023-04-04 16:29:21 0 [Note] InnoDB: Trx id counter is 37845466

2023-04-04 16:29:21 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.

…

2023-04-04 16:30:04 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482

The log parsing speed is about (6018444686-928593106)/105s = 46 MiB/s. That time is unaffected by the innodb_buffer_pool_chunk_size; I checked two previous runs.

Applying the log took 43 seconds. With the default innodb_buffer_pool_chunk_size and innodb_read_io_threads=4, it took 189 seconds :

2023-04-04 16:03:32 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.

…

2023-04-04 16:06:41 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482

With the default innodb_buffer_pool_chunk_size and the maximum innodb_read_io_threads=64, it took a couple of seconds more (191 seconds):

2023-04-04 16:13:01 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log.

…

2023-04-04 16:16:12 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482

We must reduce contention on recv_sys.mutex. One way could be to lazily clear entries from recv_sys.pages. Hopefully then it will be acceptably fast even without specifying a maximal innodb_buffer_pool_chunk_size.

Marko Mäkelä added a comment - 2023-04-04 13:48 It looks like I may have gotten the improved OOM processing to work. We will avoid re-parsing everything from the recovery checkpoint onwards, but we do have to re-parse everything from the start of the mini-transaction at which we ran out of memory for log records. For larger buffer pools, only 256 pages will be reserved for actual buffer pool pages; the rest can be consumed by log records. I implemented a proof of concept for asynchronous "fake reads" for multi-threaded recovery. In my quick test with innodb_log_file_size=10G and innodb_buffer_pool_size=20G , it now recovers in a single batch when using the same-size buffer pool. The concurrency is not good yet: the best I saw with innodb_read_io_threads=64 (the maximum) was slightly less than 370% of one CPU core being utilized, and typically it was around 150% during the log application. I will streamline the work submission, and I will have to retest with innodb_buffer_pool_chunk_size=20g because the recv_sys.mutex could be a bottleneck in the memory management: 24,89% mariadbd mariadbd [.] page_recv_t::recs_t::clear 18,16% mariadbd mariadbd [.] recv_sys_t::parse<recv_buf, true> 12,39% mariadbd mariadbd [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t> 5,20% mariadbd mariadbd [.] recv_recover_page 4,80% mariadbd mariadbd [.] recv_sys_t::add 3,80% mariadbd mariadbd [.] log_phys_t::apply 3,64% mariadbd [kernel.kallsyms] [k] clear_page_erms 1,21% mariadbd libc.so.6 [.] _int_free 1,06% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm 0,98% mariadbd mariadbd [.] crc32c_3way 0,90% mariadbd mariadbd [.] page_apply_insert_dynamic Using a single buffer pool chunk sped it up, and the CPU utilization could have been over 300% for most part (still a far cry from the 40 that nproc reports): 21,85% mariadbd mariadbd [.] recv_sys_t::parse<recv_buf, true> 14,65% mariadbd mariadbd [.] std::_Rb_tree<page_id_t const, std::pair<page_id_t const, page_recv_t>, std::_Select1st<std::pair<page_id_t const, page_recv_t> >, std::less<page_id_t const>, ut_allocator<std::pair<page_id_t const, page_recv_t>, true> >::_M_emplace_unique<page_id_t const&, page_recv_t> 7,24% mariadbd [kernel.kallsyms] [k] clear_page_erms 5,62% mariadbd mariadbd [.] recv_recover_page 4,53% mariadbd mariadbd [.] recv_sys_t::add 4,52% mariadbd mariadbd [.] log_phys_t::apply 3,90% mariadbd mariadbd [.] page_recv_t::recs_t::clear 1,50% mariadbd libc.so.6 [.] _int_free 1,26% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm 1,26% mariadbd mariadbd [.] page_apply_insert_dynamic 1,17% mariadbd [kernel.kallsyms] [k] copy_page 1,13% mariadbd mariadbd [.] crc32c_3way It now took this much wall clock time: 2023-04-04 16:27:31 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=928593106 2023-04-04 16:27:46 0 [Note] InnoDB: Parsed redo log up to LSN=2027073563; to recover: 128070 pages … 2023-04-04 16:29:01 0 [Note] InnoDB: Parsed redo log up to LSN=5331384873; to recover: 402752 pages 2023-04-04 16:29:16 0 [Note] InnoDB: To recover: 453542 pages from log 2023-04-04 16:29:21 0 [Note] InnoDB: 15 transaction(s) which must be rolled back or cleaned up in total 15 row operations to undo 2023-04-04 16:29:21 0 [Note] InnoDB: Trx id counter is 37845466 2023-04-04 16:29:21 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log. … 2023-04-04 16:30:04 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482 The log parsing speed is about (6018444686-928593106)/105s = 46 MiB/s. That time is unaffected by the innodb_buffer_pool_chunk_size ; I checked two previous runs. Applying the log took 43 seconds. With the default innodb_buffer_pool_chunk_size and innodb_read_io_threads=4 , it took 189 seconds : 2023-04-04 16:03:32 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log. … 2023-04-04 16:06:41 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482 With the default innodb_buffer_pool_chunk_size and the maximum innodb_read_io_threads=64 , it took a couple of seconds more (191 seconds): 2023-04-04 16:13:01 0 [Note] InnoDB: Starting final batch to recover 453104 pages from redo log. … 2023-04-04 16:16:12 0 [Note] InnoDB: log sequence number 6018444686 (memory-mapped); transaction id 37845482 We must reduce contention on recv_sys.mutex . One way could be to lazily clear entries from recv_sys.pages . Hopefully then it will be acceptably fast even without specifying a maximal innodb_buffer_pool_chunk_size .

Marko Mäkelä added a comment - 2023-04-08 18:54

When we run out of memory while storing parsed log records into recv_sys.pages, we only need to rewind to the checkpoint LSN if the first call to recv_init_crash_recovery_spaces() reported that some tablespaces may be missing. Later, a special version of this validation loop could be implemented for DDL workloads.

I tried to recover from a crash that used a 20GiB buffer pool and 10GiB redo log, by using only 1GiB buffer pool:

sql/mariadbd --gdb --innodb-log-file-size=10g --innodb-buffer-pool-{,chunk-}size=1g --innodb-read-io-threads=64 --datadir /dev/shm/sbtest.backup/

Progress reporting looks like this:

2023-04-08 21:32:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688

2023-04-08 21:32:49 0 [Note] InnoDB: Recovery ran out of memory at LSN 1083025911

2023-04-08 21:32:58 0 [Note] InnoDB: End of log at LSN=4483551634

2023-04-08 21:32:59 0 [Note] InnoDB: To recover: LSN 1153652154/4483551634; 32645 pages

2023-04-08 21:33:08 0 [Note] InnoDB: To recover: LSN 1637462775/4483551634; 91426 pages

2023-04-08 21:33:23 0 [Note] InnoDB: Parsed redo log up to LSN=1961012462; to recover: 106911 pages

2023-04-08 21:33:23 0 [Note] InnoDB: To recover: LSN 1987860863/4483551634; 109501 pages

2023-04-08 21:33:38 0 [Note] InnoDB: Parsed redo log up to LSN=2220397229; to recover: 97796 pages

2023-04-08 21:33:41 0 [Note] InnoDB: To recover: LSN 2330818601/4483551634; 108920 pages

2023-04-08 21:33:56 0 [Note] InnoDB: Parsed redo log up to LSN=2581361215; to recover: 99426 pages

2023-04-08 21:33:58 0 [Note] InnoDB: To recover: LSN 2667451691/4483551634; 107164 pages

2023-04-08 21:34:12 0 [Note] InnoDB: To recover: LSN 3002207709/4483551634; 105708 pages

2023-04-08 21:34:27 0 [Note] InnoDB: Parsed redo log up to LSN=3248729981; to recover: 109527 pages

2023-04-08 21:34:29 0 [Note] InnoDB: To recover: LSN 3318003147/4483551634; 116461 pages

2023-04-08 21:34:44 0 [Note] InnoDB: Parsed redo log up to LSN=3517427589; to recover: 105170 pages

2023-04-08 21:34:47 0 [Note] InnoDB: To recover: LSN 3638007653/4483551634; 117075 pages

2023-04-08 21:35:02 0 [Note] InnoDB: Parsed redo log up to LSN=3826053254; to recover: 104253 pages

2023-04-08 21:35:06 0 [Note] InnoDB: To recover: LSN 3959684337/4483551634; 117624 pages

2023-04-08 21:35:21 0 [Note] InnoDB: Parsed redo log up to LSN=4139117168; to recover: 103447 pages

2023-04-08 21:35:25 0 [Note] InnoDB: To recover: LSN 4280878385/4483551634; 117791 pages

2023-04-08 21:35:40 0 [Note] InnoDB: Parsed redo log up to LSN=4432727985; to recover: 100850 pages

2023-04-08 21:35:41 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo

2023-04-08 21:35:41 0 [Note] InnoDB: Trx id counter is 26268325

2023-04-08 21:35:42 0 [Note] InnoDB: To recover: 105496 pages

2023-04-08 21:35:46 0 [Note] InnoDB: 128 rollback segments are active.

2023-04-08 21:35:46 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-04-08 21:35:46 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340

At the start of each recovery batch, a message "To recover:" will be output unconditionally. Subsequent messages of the same or different type will be output every 15 seconds. The final recovery batch (105496 pages) started at 21:35:42 and finished at 21:35:46. For reporting progress before the final batch, we report both the current recovered LSN and the final LSN, as well as the number of pages currently waiting to be recovered. The number of pages increases during parsing and decreases while applying log.

For comparison, I recovered the same dataset using a 10GiB buffer pool (half the size of the original buffer pool). It will run out of memory, even though the redo log is only 10GiB:

2023-04-08 21:40:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688

2023-04-08 21:40:22 0 [Note] InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages

2023-04-08 21:40:37 0 [Note] InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages

2023-04-08 21:40:52 0 [Note] InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages

2023-04-08 21:41:07 0 [Note] InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages

2023-04-08 21:41:07 0 [Note] InnoDB: Recovery ran out of memory at LSN 4189599815

2023-04-08 21:41:07 0 [Note] InnoDB: End of log at LSN=4483551634

2023-04-08 21:41:08 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages

2023-04-08 21:41:23 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages

2023-04-08 21:41:38 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages

2023-04-08 21:41:53 0 [Note] InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages

2023-04-08 21:41:57 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo

2023-04-08 21:41:57 0 [Note] InnoDB: Trx id counter is 26268325

2023-04-08 21:41:57 0 [Note] InnoDB: To recover: 114621 pages

2023-04-08 21:42:01 0 [Note] InnoDB: 128 rollback segments are active.

2023-04-08 21:42:01 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

…

2023-04-08 21:42:01 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340

Finally, with 20GiB buffer pool size (same size as the server that was killed during workload)

2023-04-08 21:43:04 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688

2023-04-08 21:43:19 0 [Note] InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages

2023-04-08 21:43:34 0 [Note] InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages

2023-04-08 21:43:49 0 [Note] InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages

2023-04-08 21:44:04 0 [Note] InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages

2023-04-08 21:44:10 0 [Note] InnoDB: End of log at LSN=4483551634

2023-04-08 21:44:12 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo

2023-04-08 21:44:12 0 [Note] InnoDB: Trx id counter is 26268325

2023-04-08 21:44:14 0 [Note] InnoDB: To recover: 331797 pages

2023-04-08 21:44:36 0 [Note] InnoDB: 128 rollback segments are active.

2023-04-08 21:44:36 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

2023-04-08 21:44:36 0 [Note] InnoDB: To roll back: 14 transactions, 14 rows

…

2023-04-08 21:44:36 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268328

Summary:

buffer pool size	time until end-of-log detection	total recovery time
1GiB	12s	180s
10GiB	60s	114s
20GiB	66s	92s

When using a smaller buffer pool, the end LSN will be reached faster, because once we run out of memory, the rest will be just parsed and discarded. When all parsed log fits in the buffer pool, parsing will take 66 seconds and applying the log takes 26 seconds. When pages need to be read from the file system and written to slower storage than the RAM disk that I was using, it will probably look quite different. In any case, using more RAM and CPUs should be helpful.

When pages need to be read from the file system, recv_sys.mutex may continue to be a significant bottleneck, depending on the type of storage. I only avoid acquiring it in the new "fake asynchronous read" code path.

Marko Mäkelä added a comment - 2023-04-08 18:54 When we run out of memory while storing parsed log records into recv_sys.pages , we only need to rewind to the checkpoint LSN if the first call to recv_init_crash_recovery_spaces() reported that some tablespaces may be missing. Later, a special version of this validation loop could be implemented for DDL workloads. I tried to recover from a crash that used a 20GiB buffer pool and 10GiB redo log, by using only 1GiB buffer pool: sql/mariadbd --gdb --innodb-log-file-size=10g --innodb-buffer-pool-{,chunk-}size=1g --innodb-read-io-threads=64 --datadir /dev/shm/sbtest.backup/ Progress reporting looks like this: 2023-04-08 21:32:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688 2023-04-08 21:32:49 0 [Note] InnoDB: Recovery ran out of memory at LSN 1083025911 2023-04-08 21:32:58 0 [Note] InnoDB: End of log at LSN=4483551634 2023-04-08 21:32:59 0 [Note] InnoDB: To recover: LSN 1153652154/4483551634; 32645 pages 2023-04-08 21:33:08 0 [Note] InnoDB: To recover: LSN 1637462775/4483551634; 91426 pages 2023-04-08 21:33:23 0 [Note] InnoDB: Parsed redo log up to LSN=1961012462; to recover: 106911 pages 2023-04-08 21:33:23 0 [Note] InnoDB: To recover: LSN 1987860863/4483551634; 109501 pages 2023-04-08 21:33:38 0 [Note] InnoDB: Parsed redo log up to LSN=2220397229; to recover: 97796 pages 2023-04-08 21:33:41 0 [Note] InnoDB: To recover: LSN 2330818601/4483551634; 108920 pages 2023-04-08 21:33:56 0 [Note] InnoDB: Parsed redo log up to LSN=2581361215; to recover: 99426 pages 2023-04-08 21:33:58 0 [Note] InnoDB: To recover: LSN 2667451691/4483551634; 107164 pages 2023-04-08 21:34:12 0 [Note] InnoDB: To recover: LSN 3002207709/4483551634; 105708 pages 2023-04-08 21:34:27 0 [Note] InnoDB: Parsed redo log up to LSN=3248729981; to recover: 109527 pages 2023-04-08 21:34:29 0 [Note] InnoDB: To recover: LSN 3318003147/4483551634; 116461 pages 2023-04-08 21:34:44 0 [Note] InnoDB: Parsed redo log up to LSN=3517427589; to recover: 105170 pages 2023-04-08 21:34:47 0 [Note] InnoDB: To recover: LSN 3638007653/4483551634; 117075 pages 2023-04-08 21:35:02 0 [Note] InnoDB: Parsed redo log up to LSN=3826053254; to recover: 104253 pages 2023-04-08 21:35:06 0 [Note] InnoDB: To recover: LSN 3959684337/4483551634; 117624 pages 2023-04-08 21:35:21 0 [Note] InnoDB: Parsed redo log up to LSN=4139117168; to recover: 103447 pages 2023-04-08 21:35:25 0 [Note] InnoDB: To recover: LSN 4280878385/4483551634; 117791 pages 2023-04-08 21:35:40 0 [Note] InnoDB: Parsed redo log up to LSN=4432727985; to recover: 100850 pages 2023-04-08 21:35:41 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo 2023-04-08 21:35:41 0 [Note] InnoDB: Trx id counter is 26268325 2023-04-08 21:35:42 0 [Note] InnoDB: To recover: 105496 pages 2023-04-08 21:35:46 0 [Note] InnoDB: 128 rollback segments are active. 2023-04-08 21:35:46 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-04-08 21:35:46 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340 At the start of each recovery batch, a message "To recover:" will be output unconditionally. Subsequent messages of the same or different type will be output every 15 seconds. The final recovery batch (105496 pages) started at 21:35:42 and finished at 21:35:46. For reporting progress before the final batch, we report both the current recovered LSN and the final LSN, as well as the number of pages currently waiting to be recovered. The number of pages increases during parsing and decreases while applying log. For comparison, I recovered the same dataset using a 10GiB buffer pool (half the size of the original buffer pool). It will run out of memory, even though the redo log is only 10GiB: 2023-04-08 21:40:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688 2023-04-08 21:40:22 0 [Note] InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages 2023-04-08 21:40:37 0 [Note] InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages 2023-04-08 21:40:52 0 [Note] InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages 2023-04-08 21:41:07 0 [Note] InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages 2023-04-08 21:41:07 0 [Note] InnoDB: Recovery ran out of memory at LSN 4189599815 2023-04-08 21:41:07 0 [Note] InnoDB: End of log at LSN=4483551634 2023-04-08 21:41:08 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages 2023-04-08 21:41:23 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages 2023-04-08 21:41:38 0 [Note] InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages 2023-04-08 21:41:53 0 [Note] InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages 2023-04-08 21:41:57 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo 2023-04-08 21:41:57 0 [Note] InnoDB: Trx id counter is 26268325 2023-04-08 21:41:57 0 [Note] InnoDB: To recover: 114621 pages 2023-04-08 21:42:01 0 [Note] InnoDB: 128 rollback segments are active. 2023-04-08 21:42:01 0 [Note] InnoDB: Starting in background the rollback of recovered transactions … 2023-04-08 21:42:01 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268340 Finally, with 20GiB buffer pool size (same size as the server that was killed during workload) 2023-04-08 21:43:04 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=503549688 2023-04-08 21:43:19 0 [Note] InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages 2023-04-08 21:43:34 0 [Note] InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages 2023-04-08 21:43:49 0 [Note] InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages 2023-04-08 21:44:04 0 [Note] InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages 2023-04-08 21:44:10 0 [Note] InnoDB: End of log at LSN=4483551634 2023-04-08 21:44:12 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 14 row operations to undo 2023-04-08 21:44:12 0 [Note] InnoDB: Trx id counter is 26268325 2023-04-08 21:44:14 0 [Note] InnoDB: To recover: 331797 pages 2023-04-08 21:44:36 0 [Note] InnoDB: 128 rollback segments are active. 2023-04-08 21:44:36 0 [Note] InnoDB: Starting in background the rollback of recovered transactions 2023-04-08 21:44:36 0 [Note] InnoDB: To roll back: 14 transactions, 14 rows … 2023-04-08 21:44:36 0 [Note] InnoDB: log sequence number 4483551634 (memory-mapped); transaction id 26268328 Summary: buffer pool size time until end-of-log detection total recovery time 1GiB 12s 180s 10GiB 60s 114s 20GiB 66s 92s When using a smaller buffer pool, the end LSN will be reached faster, because once we run out of memory, the rest will be just parsed and discarded. When all parsed log fits in the buffer pool, parsing will take 66 seconds and applying the log takes 26 seconds. When pages need to be read from the file system and written to slower storage than the RAM disk that I was using, it will probably look quite different. In any case, using more RAM and CPUs should be helpful. When pages need to be read from the file system, recv_sys.mutex may continue to be a significant bottleneck, depending on the type of storage. I only avoid acquiring it in the new "fake asynchronous read" code path.

Marko Mäkelä added a comment - 2023-04-14 06:29

Before ~~MDEV-14425~~ simplified the log format in MariaDB Server 10.8, there were two log parsers: log block parser and log record parser. The log records were not contiguous in the log file: they could be interrupted by headers and footers of 512-byte log blocks. The mapping from log sequence number (LSN) to byte offsets is more complicated. Starting with 10.8, the log file offset is simply a base offset + LSN modulo (ib_logfile0 file size - 12288 bytes), and the only discontinuity is when the ib_logfile0 wraps around from the end to byte offset 12288.

Due to this, it is not feasible to port some of the log parsing optimizations to 10.6 or earlier versions.

Marko Mäkelä added a comment - 2023-04-14 06:29 Before MDEV-14425 simplified the log format in MariaDB Server 10.8, there were two log parsers: log block parser and log record parser. The log records were not contiguous in the log file: they could be interrupted by headers and footers of 512-byte log blocks. The mapping from log sequence number (LSN) to byte offsets is more complicated. Starting with 10.8, the log file offset is simply a base offset + LSN modulo ( ib_logfile0 file size - 12288 bytes), and the only discontinuity is when the ib_logfile0 wraps around from the end to byte offset 12288. Due to this, it is not feasible to port some of the log parsing optimizations to 10.6 or earlier versions.

Marko Mäkelä added a comment - 2023-04-17 09:51

I am working on a 10.6 port of this. It will be less efficient due to the more complicated memory management and parsing of redo log records.

The following is from a recovery of Sysbench oltp_update_index:

2023-04-17 12:21:55 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799

2023-04-17 12:22:10 0 [Note] InnoDB: Read redo log up to LSN=1935453184

2023-04-17 12:22:25 0 [Note] InnoDB: Read redo log up to LSN=2584783872

2023-04-17 12:22:40 0 [Note] InnoDB: Read redo log up to LSN=3218648064

2023-04-17 12:22:55 0 [Note] InnoDB: Read redo log up to LSN=3803884544

2023-04-17 12:23:02 0 [Note] InnoDB: Recovery ran out of memory at LSN 4068555446

2023-04-17 12:23:10 0 [Note] InnoDB: Read redo log up to LSN=5356432384

2023-04-17 12:23:25 0 [Note] InnoDB: Read redo log up to LSN=1747627008

2023-04-17 12:23:40 0 [Note] InnoDB: Read redo log up to LSN=2296949760

2023-04-17 12:23:55 0 [Note] InnoDB: Read redo log up to LSN=2934352896

2023-04-17 12:24:10 0 [Note] InnoDB: Read redo log up to LSN=3416960000

2023-04-17 12:24:25 0 [Note] InnoDB: Read redo log up to LSN=3911756800

2023-04-17 12:24:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 338590 pages

2023-04-17 12:24:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 325799 pages

2023-04-17 12:25:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 301991 pages

2023-04-17 12:25:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 273831 pages

2023-04-17 12:25:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 241575 pages

2023-04-17 12:25:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 194727 pages

2023-04-17 12:26:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 105895 pages

2023-04-17 12:26:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 36263 pages

2023-04-17 12:26:43 0 [Note] InnoDB: Read redo log up to LSN=4375424000

2023-04-17 12:26:58 0 [Note] InnoDB: Read redo log up to LSN=4886932480

2023-04-17 12:27:13 0 [Note] InnoDB: Read redo log up to LSN=5513325568

2023-04-17 12:27:18 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo

2023-04-17 12:27:18 0 [Note] InnoDB: Trx id counter is 37315565

2023-04-17 12:27:20 0 [Note] InnoDB: To recover: 205090 pages

2023-04-17 12:27:51 0 [Note] InnoDB: 128 rollback segments are active.

2023-04-17 12:27:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

2023-04-17 12:27:51 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows

…

2023-04-17 12:27:51 0 [Note] InnoDB: Rollback of non-prepared transactions completed

2023-04-17 12:27:51 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

2023-04-17 12:27:51 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585

In 10.6, if we run out of memory, we will always restart parsing the log from the start.

  32,71%  mariadbd         mariadbd              [.] recv_sys_t::erase

  19,96%  mariadbd         mariadbd              [.] recv_sys_t::parse

  13,90%  mariadbd         mariadbd              [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>

   3,19%  mariadbd         libstdc++.so.6.0.30   [.] std::_Rb_tree_increment

   3,16%  mariadbd         mariadbd              [.] recv_sys_t::add

   2,91%  mariadbd         mariadbd              [.] recv_recover_page

The CPU time was dominated by adding an removing items, because I forgot to set innodb_buffer_pool_chunk_size equal to the innodb_buffer_pool_size (10GiB). With a 30GiB buffer pool (same as before the server was killed) in one chunk, it recovers faster:

2023-04-17 12:32:59 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799

2023-04-17 12:33:14 0 [Note] InnoDB: Read redo log up to LSN=1984539648

2023-04-17 12:33:29 0 [Note] InnoDB: Read redo log up to LSN=2710875136

2023-04-17 12:33:44 0 [Note] InnoDB: Read redo log up to LSN=3358895104

2023-04-17 12:33:59 0 [Note] InnoDB: Read redo log up to LSN=3965299712

2023-04-17 12:34:14 0 [Note] InnoDB: Read redo log up to LSN=4557417472

2023-04-17 12:34:29 0 [Note] InnoDB: Read redo log up to LSN=5219527680

2023-04-17 12:34:43 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo

2023-04-17 12:34:43 0 [Note] InnoDB: Trx id counter is 37315565

2023-04-17 12:34:45 0 [Note] InnoDB: To recover: 450915 pages

2023-04-17 12:35:07 0 [Note] InnoDB: 128 rollback segments are active.

2023-04-17 12:35:07 0 [Note] InnoDB: Starting in background the rollback of recovered transactions

2023-04-17 12:35:07 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows

…

2023-04-17 12:35:07 0 [Note] InnoDB: Rollback of non-prepared transactions completed

2023-04-17 12:35:07 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"

2023-04-17 12:35:07 0 [Note] InnoDB: Creating shared tablespace for temporary tables

2023-04-17 12:35:07 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...

2023-04-17 12:35:07 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

2023-04-17 12:35:07 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585

Now the CPU usage makes more sense too:

  24,31%  mariadbd         mariadbd               [.] recv_sys_t::parse

  18,11%  mariadbd         mariadbd               [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>

  11,15%  mariadbd         mariadbd               [.] recv_sys_t::erase

   6,68%  mariadbd         mariadbd               [.] recv_recover_page

   5,43%  mariadbd         mariadbd               [.] log_phys_t::apply

   3,74%  mariadbd         mariadbd               [.] recv_sys_t::add

Finally, here is a recovery with a 5GiB buffer pool in one chunk:

2023-04-17 12:38:34 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799

2023-04-17 12:38:49 0 [Note] InnoDB: Read redo log up to LSN=1963895808

2023-04-17 12:39:01 0 [Note] InnoDB: Recovery ran out of memory at LSN 2534560930

2023-04-17 12:39:04 0 [Note] InnoDB: Read redo log up to LSN=3312233472

2023-04-17 12:39:19 0 [Note] InnoDB: Read redo log up to LSN=1599646720

2023-04-17 12:39:34 0 [Note] InnoDB: Read redo log up to LSN=2160831488

2023-04-17 12:39:48 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages

2023-04-17 12:40:03 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages

2023-04-17 12:40:18 0 [Note] InnoDB: Read redo log up to LSN=3195776000

2023-04-17 12:40:33 0 [Note] InnoDB: Read redo log up to LSN=3687099392

2023-04-17 12:40:48 0 [Note] InnoDB: Read redo log up to LSN=4165315584

2023-04-17 12:40:54 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages

2023-04-17 12:41:09 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages

2023-04-17 12:41:24 0 [Note] InnoDB: Read redo log up to LSN=4508724224

2023-04-17 12:41:39 0 [Note] InnoDB: Read redo log up to LSN=5094550528

2023-04-17 12:41:54 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo

2023-04-17 12:41:54 0 [Note] InnoDB: Trx id counter is 37315565

2023-04-17 12:41:54 0 [Note] InnoDB: To recover: 205230 pages

…

2023-04-17 12:42:02 0 [Note] InnoDB: Rollback of non-prepared transactions completed

2023-04-17 12:42:02 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"

2023-04-17 12:42:02 0 [Note] InnoDB: Creating shared tablespace for temporary tables

2023-04-17 12:42:02 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...

2023-04-17 12:42:02 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

2023-04-17 12:42:02 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585

Summary:

buffer pool size	time until parsing OOM or EOF	total recovery time
5GiB	27s	208s
10GiB	67s	356s
30GiB	106s	128s

The 10GiB case was with multiple buffer pool chunks. That is why it consumed much more time than a recovery with 5GiB buffer pool in one chunk.

Marko Mäkelä added a comment - 2023-04-17 09:51 I am working on a 10.6 port of this. It will be less efficient due to the more complicated memory management and parsing of redo log records. The following is from a recovery of Sysbench oltp_update_index : 2023-04-17 12:21:55 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 2023-04-17 12:22:10 0 [Note] InnoDB: Read redo log up to LSN=1935453184 2023-04-17 12:22:25 0 [Note] InnoDB: Read redo log up to LSN=2584783872 2023-04-17 12:22:40 0 [Note] InnoDB: Read redo log up to LSN=3218648064 2023-04-17 12:22:55 0 [Note] InnoDB: Read redo log up to LSN=3803884544 2023-04-17 12:23:02 0 [Note] InnoDB: Recovery ran out of memory at LSN 4068555446 2023-04-17 12:23:10 0 [Note] InnoDB: Read redo log up to LSN=5356432384 2023-04-17 12:23:25 0 [Note] InnoDB: Read redo log up to LSN=1747627008 2023-04-17 12:23:40 0 [Note] InnoDB: Read redo log up to LSN=2296949760 2023-04-17 12:23:55 0 [Note] InnoDB: Read redo log up to LSN=2934352896 2023-04-17 12:24:10 0 [Note] InnoDB: Read redo log up to LSN=3416960000 2023-04-17 12:24:25 0 [Note] InnoDB: Read redo log up to LSN=3911756800 2023-04-17 12:24:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 338590 pages 2023-04-17 12:24:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 325799 pages 2023-04-17 12:25:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 301991 pages 2023-04-17 12:25:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 273831 pages 2023-04-17 12:25:39 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 241575 pages 2023-04-17 12:25:54 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 194727 pages 2023-04-17 12:26:09 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 105895 pages 2023-04-17 12:26:24 0 [Note] InnoDB: To recover: LSN 4375342769/4375358464; 36263 pages 2023-04-17 12:26:43 0 [Note] InnoDB: Read redo log up to LSN=4375424000 2023-04-17 12:26:58 0 [Note] InnoDB: Read redo log up to LSN=4886932480 2023-04-17 12:27:13 0 [Note] InnoDB: Read redo log up to LSN=5513325568 2023-04-17 12:27:18 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo 2023-04-17 12:27:18 0 [Note] InnoDB: Trx id counter is 37315565 2023-04-17 12:27:20 0 [Note] InnoDB: To recover: 205090 pages 2023-04-17 12:27:51 0 [Note] InnoDB: 128 rollback segments are active. 2023-04-17 12:27:51 0 [Note] InnoDB: Starting in background the rollback of recovered transactions 2023-04-17 12:27:51 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows … 2023-04-17 12:27:51 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2023-04-17 12:27:51 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 2023-04-17 12:27:51 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585 In 10.6, if we run out of memory, we will always restart parsing the log from the start. 32,71% mariadbd mariadbd [.] recv_sys_t::erase 19,96% mariadbd mariadbd [.] recv_sys_t::parse 13,90% mariadbd mariadbd [.] …::_M_emplace_unique<page_id_t const&, page_recv_t> 3,19% mariadbd libstdc++.so.6.0.30 [.] std::_Rb_tree_increment 3,16% mariadbd mariadbd [.] recv_sys_t::add 2,91% mariadbd mariadbd [.] recv_recover_page The CPU time was dominated by adding an removing items, because I forgot to set innodb_buffer_pool_chunk_size equal to the innodb_buffer_pool_size (10GiB). With a 30GiB buffer pool (same as before the server was killed) in one chunk, it recovers faster: 2023-04-17 12:32:59 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 2023-04-17 12:33:14 0 [Note] InnoDB: Read redo log up to LSN=1984539648 2023-04-17 12:33:29 0 [Note] InnoDB: Read redo log up to LSN=2710875136 2023-04-17 12:33:44 0 [Note] InnoDB: Read redo log up to LSN=3358895104 2023-04-17 12:33:59 0 [Note] InnoDB: Read redo log up to LSN=3965299712 2023-04-17 12:34:14 0 [Note] InnoDB: Read redo log up to LSN=4557417472 2023-04-17 12:34:29 0 [Note] InnoDB: Read redo log up to LSN=5219527680 2023-04-17 12:34:43 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo 2023-04-17 12:34:43 0 [Note] InnoDB: Trx id counter is 37315565 2023-04-17 12:34:45 0 [Note] InnoDB: To recover: 450915 pages 2023-04-17 12:35:07 0 [Note] InnoDB: 128 rollback segments are active. 2023-04-17 12:35:07 0 [Note] InnoDB: Starting in background the rollback of recovered transactions 2023-04-17 12:35:07 0 [Note] InnoDB: To roll back: 19 transactions, 19 rows … 2023-04-17 12:35:07 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2023-04-17 12:35:07 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1" 2023-04-17 12:35:07 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2023-04-17 12:35:07 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2023-04-17 12:35:07 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 2023-04-17 12:35:07 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585 Now the CPU usage makes more sense too: 24,31% mariadbd mariadbd [.] recv_sys_t::parse 18,11% mariadbd mariadbd [.] …::_M_emplace_unique<page_id_t const&, page_recv_t> 11,15% mariadbd mariadbd [.] recv_sys_t::erase 6,68% mariadbd mariadbd [.] recv_recover_page 5,43% mariadbd mariadbd [.] log_phys_t::apply 3,74% mariadbd mariadbd [.] recv_sys_t::add Finally, here is a recovery with a 5GiB buffer pool in one chunk: 2023-04-17 12:38:34 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 2023-04-17 12:38:49 0 [Note] InnoDB: Read redo log up to LSN=1963895808 2023-04-17 12:39:01 0 [Note] InnoDB: Recovery ran out of memory at LSN 2534560930 2023-04-17 12:39:04 0 [Note] InnoDB: Read redo log up to LSN=3312233472 2023-04-17 12:39:19 0 [Note] InnoDB: Read redo log up to LSN=1599646720 2023-04-17 12:39:34 0 [Note] InnoDB: Read redo log up to LSN=2160831488 2023-04-17 12:39:48 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages 2023-04-17 12:40:03 0 [Note] InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages 2023-04-17 12:40:18 0 [Note] InnoDB: Read redo log up to LSN=3195776000 2023-04-17 12:40:33 0 [Note] InnoDB: Read redo log up to LSN=3687099392 2023-04-17 12:40:48 0 [Note] InnoDB: Read redo log up to LSN=4165315584 2023-04-17 12:40:54 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages 2023-04-17 12:41:09 0 [Note] InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages 2023-04-17 12:41:24 0 [Note] InnoDB: Read redo log up to LSN=4508724224 2023-04-17 12:41:39 0 [Note] InnoDB: Read redo log up to LSN=5094550528 2023-04-17 12:41:54 0 [Note] InnoDB: 19 transaction(s) which must be rolled back or cleaned up in total 19 row operations to undo 2023-04-17 12:41:54 0 [Note] InnoDB: Trx id counter is 37315565 2023-04-17 12:41:54 0 [Note] InnoDB: To recover: 205230 pages … 2023-04-17 12:42:02 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2023-04-17 12:42:02 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1" 2023-04-17 12:42:02 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2023-04-17 12:42:02 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2023-04-17 12:42:02 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 2023-04-17 12:42:02 0 [Note] InnoDB: 10.6.13 started; log sequence number 5670373116; transaction id 37315585 Summary: buffer pool size time until parsing OOM or EOF total recovery time 5GiB 27s 208s 10GiB 67s 356s 30GiB 106s 128s The 10GiB case was with multiple buffer pool chunks. That is why it consumed much more time than a recovery with 5GiB buffer pool in one chunk.

Marko Mäkelä added a comment - 2023-04-17 10:02

Here are some perf report numbers for the 10.6 recovery with innodb_buffer_pool_size=5G in one chunk:

  26,68%  mariadbd         mariadbd                [.] recv_sys_t::parse

  17,08%  mariadbd         mariadbd                [.] …::_M_emplace_unique<page_id_t const&, page_recv_t>

   6,52%  mariadbd         mariadbd                [.] recv_sys_t::erase

   4,53%  mariadbd         libstdc++.so.6.0.30     [.] std::_Rb_tree_increment

   4,49%  mariadbd         libc.so.6               [.] pthread_mutex_lock@@GLIBC_2.2.5

   4,34%  mariadbd         mariadbd                [.] recv_sys_t::add

   4,07%  mariadbd         mariadbd                [.] log_phys_t::apply

   4,04%  mariadbd         mariadbd                [.] recv_recover_page

   2,26%  mariadbd         mariadbd                [.] fil_space_t::get

   2,09%  mariadbd         libc.so.6               [.] __GI___pthread_mutex_unlock_usercnt

   1,49%  mariadbd         libc.so.6               [.] _int_free

   1,35%  mariadbd         mariadbd                [.] page_apply_insert_dynamic

   1,26%  mariadbd         libc.so.6               [.] __memmove_avx_unaligned_erms_rtm

   1,01%  mariadbd         [kernel.kallsyms]       [k] clear_page_erms

   0,92%  mariadbd         [kernel.kallsyms]       [k] copy_user_enhanced_fast_string

   0,87%  mariadbd         [kernel.kallsyms]       [k] syscall_exit_to_user_mode

   0,80%  mariadbd         mariadbd                [.] crc32c_3way

One possible improvement would be to try to invoke recv_sys.pages.clear() instead of erase()‍ing individual elements. Only a couple of percent of CPU was actually spent on interesting or known-to-be CPU-intensive work (applying log to a page, or computing checksums).

Marko Mäkelä added a comment - 2023-04-17 10:02 Here are some perf report numbers for the 10.6 recovery with innodb_buffer_pool_size=5G in one chunk: 26,68% mariadbd mariadbd [.] recv_sys_t::parse 17,08% mariadbd mariadbd [.] …::_M_emplace_unique<page_id_t const&, page_recv_t> 6,52% mariadbd mariadbd [.] recv_sys_t::erase 4,53% mariadbd libstdc++.so.6.0.30 [.] std::_Rb_tree_increment 4,49% mariadbd libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5 4,34% mariadbd mariadbd [.] recv_sys_t::add 4,07% mariadbd mariadbd [.] log_phys_t::apply 4,04% mariadbd mariadbd [.] recv_recover_page 2,26% mariadbd mariadbd [.] fil_space_t::get 2,09% mariadbd libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt 1,49% mariadbd libc.so.6 [.] _int_free 1,35% mariadbd mariadbd [.] page_apply_insert_dynamic 1,26% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm 1,01% mariadbd [kernel.kallsyms] [k] clear_page_erms 0,92% mariadbd [kernel.kallsyms] [k] copy_user_enhanced_fast_string 0,87% mariadbd [kernel.kallsyms] [k] syscall_exit_to_user_mode 0,80% mariadbd mariadbd [.] crc32c_3way One possible improvement would be to try to invoke recv_sys.pages.clear() instead of erase() ‍ing individual elements. Only a couple of percent of CPU was actually spent on interesting or known-to-be CPU-intensive work (applying log to a page, or computing checksums).

Marko Mäkelä added a comment - 2023-04-17 12:02

The out-of-memory handling in the 10.6 (pre ~~MDEV-14425~~) version of this will be inferior to the later version in two aspects:

If recovery runs of of memory in the 10.6 version,

It will restart parsing and storing the log from the checkpoint. It can never resume from the "OOM LSN".
Each batch will scan the log to the end. I did not get my optimization of this to work; some DDL crash recovery tests would occasionally fail.

Marko Mäkelä added a comment - 2023-04-17 12:02 The out-of-memory handling in the 10.6 (pre MDEV-14425 ) version of this will be inferior to the later version in two aspects: If recovery runs of of memory in the 10.6 version, It will restart parsing and storing the log from the checkpoint. It can never resume from the "OOM LSN". Each batch will scan the log to the end. I did not get my optimization of this to work; some DDL crash recovery tests would occasionally fail.

Marko Mäkelä added a comment - 2023-04-18 11:45

Unfortunately, both the 10.8 and 10.6 version of this are acting up. One test that occasionally fails is innodb.alter_copy. I have not yet tested if that test is stable in the original development version. In 11.0 the buffer pool and recovery were simplified thanks to ~~MDEV-29694~~ (removing the InnoDB change buffer). That could play a role here.

Marko Mäkelä added a comment - 2023-04-18 11:45 Unfortunately, both the 10.8 and 10.6 version of this are acting up. One test that occasionally fails is innodb.alter_copy . I have not yet tested if that test is stable in the original development version. In 11.0 the buffer pool and recovery were simplified thanks to MDEV-29694 (removing the InnoDB change buffer). That could play a role here.

Marko Mäkelä added a comment - 2023-04-18 14:17 - edited

I got an rr replay trace of the test innodb.alter_copy from the 10.8 version where fil_validate() fails during IORequest::write_complete(), on a tablespace that is in the process of being created in recv_sys_t::recover_deferred(). I think that we must protect that better by holding fil_system.mutex until the tablespace has been fully created. This was fixed in ~~MDEV-31080~~.

Marko Mäkelä added a comment - 2023-04-18 14:17 - edited I got an rr replay trace of the test innodb.alter_copy from the 10.8 version where fil_validate() fails during IORequest::write_complete() , on a tablespace that is in the process of being created in recv_sys_t::recover_deferred() . I think that we must protect that better by holding fil_system.mutex until the tablespace has been fully created. This was fixed in MDEV-31080 .

Marko Mäkelä added a comment - 2023-04-20 09:19

I was asked to improve the wording of the "Recovery ran out of memory at LSN" message. I think that the following should be more informative:

[Note] InnoDB: Multi-batch recovery needed at LSN 4189599815

Marko Mäkelä added a comment - 2023-04-20 09:19 I was asked to improve the wording of the "Recovery ran out of memory at LSN" message. I think that the following should be more informative: [Note] InnoDB: Multi-batch recovery needed at LSN 4189599815

Marko Mäkelä added a comment - 2023-04-21 12:50

There is not enough time to test this thoroughly before the upcoming quarterly releases.

With the current fix that I have for 10.6 (and which might be feasible to port to 10.5 as well), some excessive or unnecessary log scanning will take place. When the buffer pool is small, the log reads will basically be O(n²) instead of O. This will be better in the version that uses the ~~MDEV-14425~~ log format.

Marko Mäkelä added a comment - 2023-04-21 12:50 There is not enough time to test this thoroughly before the upcoming quarterly releases. With the current fix that I have for 10.6 (and which might be feasible to port to 10.5 as well), some excessive or unnecessary log scanning will take place. When the buffer pool is small, the log reads will basically be O(n²) instead of O . This will be better in the version that uses the MDEV-14425 log format.

Stephan Vos added a comment - 2024-12-20 04:59 - edited

I am currently working around the slow prepare of our 1.7TB database when doing manual backup and restore by specifying Memory:
sudo mariabackup --prepare --use-memory=464G

But the problem is that when a SST happens as part of normal recovery it uses the default 100MB for use-memory which then results in hours and hours of prep.

Is it not possible to set --use-memory somewhere in Configuration so that SST recovery script will use this?

Stephan Vos added a comment - 2024-12-20 04:59 - edited I am currently working around the slow prepare of our 1.7TB database when doing manual backup and restore by specifying Memory: sudo mariabackup --prepare --use-memory=464G But the problem is that when a SST happens as part of normal recovery it uses the default 100MB for use-memory which then results in hours and hours of prep. Is it not possible to set --use-memory somewhere in Configuration so that SST recovery script will use this?

Marko Mäkelä added a comment - 2024-12-20 05:42

stephanvos, can you please file a separate bug against wsrep_sst_mariabackup.sh so that this parameter can be specified? It is also thinkable that mariadb-backup --prepare would read the use_memory parameter from an appropriate section of a configuration file.

Marko Mäkelä added a comment - 2024-12-20 05:42 stephanvos , can you please file a separate bug against wsrep_sst_mariabackup.sh so that this parameter can be specified? It is also thinkable that mariadb-backup --prepare would read the use_memory parameter from an appropriate section of a configuration file.

Stephan Vos added a comment - 2024-12-20 09:25

Thanks I will do so.

From: Marko Mäkelä (Jira) <jira@mariadb.org>
Sent: Friday, 20 December 2024 07:43
To: Stephan Vos <stephan.vos@metacom.net>
Subject: [JIRA] Marko Mäkelä mentioned you on ~~MDEV-29911~~ InnoDB recovery and mariadb-backup --prepare fail to report detailed progress (Jira)

[ https://jira.mariadb.org/browse/MDEV-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel<https://jira.mariadb.org/browse/MDEV-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel> ]

Marko Mäkelä mentioned you on ~~MDEV-29911~~
--------------------------------

stephanvos, can you please file a separate bug against wsrep_sst_mariabackup.sh so that this parameter can be specified? It is also thinkable that mariadb-backup --prepare would read the use_memory parameter from an appropriate section of a configuration file.

–
This message was sent by Atlassian Jira
(v8.20.16#820016)

Stephan Vos added a comment - 2024-12-20 09:25 Thanks I will do so. From: Marko Mäkelä (Jira) <jira@mariadb.org> Sent: Friday, 20 December 2024 07:43 To: Stephan Vos <stephan.vos@metacom.net> Subject: [JIRA] Marko Mäkelä mentioned you on MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress (Jira) [ https://jira.mariadb.org/browse/MDEV-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel < https://jira.mariadb.org/browse/MDEV-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] Marko Mäkelä mentioned you on MDEV-29911 -------------------------------- stephanvos , can you please file a separate bug against wsrep_sst_mariabackup.sh so that this parameter can be specified? It is also thinkable that mariadb-backup --prepare would read the use_memory parameter from an appropriate section of a configuration file. – This message was sent by Atlassian Jira (v8.20.16#820016)

Marko Mäkelä added a comment - 2025-01-02 13:46

Thank you, I see that you filed ~~MDEV-35749~~ for the wsrep_sst_mariabackup.sh problem.

Marko Mäkelä added a comment - 2025-01-02 13:46 Thank you, I see that you filed MDEV-35749 for the wsrep_sst_mariabackup.sh problem.

MariaDB Server

InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration