[MDEV-19586] Replace recv_sys_t::addr_hash with a std::map Created: 2019-05-24  Updated: 2023-11-27  Resolved: 2019-06-11

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.5.0

Type: Task Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance, recovery

Attachments: File Remove_mlog_init.patch     File Remove_recv_data_t.patch    
Issue Links:
PartOf
Relates
relates to MDEV-23380 InnoDB reads a page from disk despite... Closed
relates to MDEV-12353 Efficient InnoDB redo log record format Closed
relates to MDEV-12699 Improve crash recovery of corrupted d... Closed
relates to MDEV-27326 Mariabackup being overwhelmed during ... Closed

 Description   

MDEV-12353 suggests one idea for improving InnoDB crash recovery speed: Read the to-be-recovered pages sorted by page number. Currently, recv_apply_hashed_log_recs() picks a ‘random’ page number from recv_sys->addr_hash and then reads a number of pages around that number.

The most straightforward way to achieve this would seem to be to replace the hash table with an ordered data structure, such as std::map, which would automatically be iterated in an ordered fashion.

While we are at it, we should merge the MDEV-12699 mlog_init_t to the same data structure.



 Comments   
Comment by Marko Mäkelä [ 2019-05-31 ]

kevg gave some valuable feedback:

  • std::forward_list uses a pointer to next member, but also a pointer to data. It would be better to store data inline, and pointer to next element.
  • recv_sys.pages does not need to be a pointer.

Some more things remain to be done:

  • Remove mlog_init_t and store the information in recv_sys.pages.
  • Remove recv_data_copy_to_buf() and make the records in recv_sys.pages point directly to recv_sys.buf.
Comment by Marko Mäkelä [ 2019-06-10 ]

It turns out that some cleanup has to be deferred.

Comment by Marko Mäkelä [ 2019-06-11 ]

I evaluated the performance with the following simple benchmark:

--source include/have_innodb.inc
--source include/have_sequence.inc
--source include/have_partition.inc
 
SET GLOBAL innodb_flush_log_at_trx_commit=1;
CREATE TABLE t1 (a SERIAL, b CHAR(255) NOT NULL)
ENGINE=InnoDB
PARTITION BY HASH(a) PARTITIONS 30
SELECT NULL, 'b' FROM seq_1_to_1000000;
SELECT now();
--let $shutdown_timeout= 0
--source include/shutdown_mysqld.inc
--source include/start_mysqld.inc
SELECT now();
DROP TABLE t1;

./mtr --mysqld=--skip-innodb-buffer-pool-load-at-startup --mysqld=--innodb-buffer-pool-size=100m --mysqld=--innodb-page-size=4k innodb.recover

To my surprise, recv_sys_t::add() turned out to become a bottleneck. It would spend a lot of time seeking to the end of the singly-linked list. Introducing the shortcut pointer recv_sys_t::recs_t::last fixed this bottleneck, and the performance became similar.

I found the bottleneck by removing the line

--source include/start_mysqld.inc

(which would lead to the test failing), then backing up the data directory, and timing the recovery as follows:

sudo perf record su marko -c '../sql/mysqld --skip-innodb-buffer-pool-load-at-startup --innodb-page-size=4k --innodb-buffer-pool-size=100m --lc-messages-dir=/dev/shm/10.5/sql/share/  --datadir=/dev/shm/10.5/mysql-test/var/log/innodb.recover-innodb/mysqld.1/data --shutdown'

The last (invalid) option will cause the server to shut down immediately after InnoDB startup. Finally, I would analyze the data with

sudo perf report

This could be repeated as many times as needed, always restoring the data directory in between, so that on each execution, the exactly same redo log records will be processed.

A larger recovery test on a bigger buffer pool, data set and possibly a hard disk would be needed to notice a significant performance difference.

Generated at Thu Feb 08 08:52:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.