[MDEV-11634] Improve the InnoDB change buffer - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Do
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- innodb
- performance

Description

The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-11634-10.5.pdf
22 kB
2022-01-31 12:06
MDEV-11634-10.6.pdf
22 kB
2022-01-31 12:06
MDEV-11634-10.8B.pdf
28 kB
2022-02-17 16:32
Stop-buffering-delete-purge-operations.patch
57 kB
2019-10-08 07:21

Issue Links

is blocked by

MDEV-17598 InnoDB index option for per-record transaction ID

Open

MDEV-19514 Defer change buffer merge until pages are requested

Closed

relates to

MDEV-515 innodb bulk insert

Closed

MDEV-13637 InnoDB change buffer housekeeping can cause redo log overrun and possibly deadlocks

Closed

MDEV-14094 benchmark effects of innodb change buffer

Closed

MDEV-22930 Unnecessary contention on rw_lock_list_mutex in ibuf_dummy_index_create()

Closed

MDEV-29694 Remove the InnoDB change buffer

Closed

MDEV-13485 MTR tests fail massively with --innodb-sync-debug

Closed

mentioned in: Page Loading...

(3 relates to, 1 mentioned in)

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Marko Mäkelä added a comment - 2019-10-08 07:28

Based on benchmarks that were run for ~~MDEV-19344~~ for a 10.2 version of Stop-buffering-delete-purge-operations.patch, the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set.

So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch. The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the ~~MDEV-19514~~ development branch.

Marko Mäkelä added a comment - 2019-10-08 07:28 Based on benchmarks that were run for MDEV-19344 for a 10.2 version of Stop-buffering-delete-purge-operations.patch , the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set. So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch . The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the MDEV-19514 development branch.

Marko Mäkelä added a comment - 2020-05-06 06:09

~~MDEV-19514~~ preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.

Marko Mäkelä added a comment - 2020-05-06 06:09 MDEV-19514 preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.

Axel Schwenke added a comment - 2022-01-31 12:07

Attached benchmark graphs for MariaDB 10.5 and 10.6

Axel Schwenke added a comment - 2022-01-31 12:07 Attached benchmark graphs for MariaDB 10.5 and 10.6

Axel Schwenke added a comment - 2022-02-17 16:36

Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes ~~MDEV-27774~~. It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

Axel Schwenke added a comment - 2022-02-17 16:36 Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes MDEV-27774 . It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

Marko Mäkelä added a comment - 2022-11-02 07:03

Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly.

A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written:

A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log.
Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered.

Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages.

Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823.

Marko Mäkelä added a comment - 2022-11-02 07:03 Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly. A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written: A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log. Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered. Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages. Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823 .

MariaDB Server

Improve the InnoDB change buffer

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration