[MDEV-11634] Improve the InnoDB change buffer Created: 2016-12-22 Updated: 2023-03-06 Resolved: 2023-03-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | N/A |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 3 |
| Labels: | innodb, performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Description |
|
The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it. The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer. The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record. On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed. If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes. Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist. If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read. |
| Comments |
| Comment by Marko Mäkelä [ 2019-04-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The implementation of For We could also allow the change buffer merge to split or merge pages as needed. This would remove the need for the "free bits bitmap". We might eliminate the entire change buffer bitmap pages (page 1+n·innodb_page_size), and incorporate the "merge needed" bit to the allocation bitmap pages (page n·innodb_page_size). This optimized format would be only available to data files that are in the | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-09-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As noted in We can improve the change buffer by:
With such a logical format, there is no need to avoid any page splits or merges, or to keep track of the fill factor of individual pages. We will also be able to mark a unique index corrupted if duplicate values were successfully buffered during unique_checks=0. Starting the key with (table_id,index_id) allows us to easily drop any buffered changes when a table or index is dropped or rebuilt. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A change buffer merge would be initiated whenever a search reaches a secondary index leaf page for which the bitmap page indicates there to exist buffered changes. Based on the potential key range of the secondary index leaf page (which is indicated by the keys on the above-leaf-level page), we would be able to find all buffered changes in the page. The change buffer records could be in one of the following formats:
The sequence_number field would be needed for buffering delete-mark and delete (purge) operations. If we only buffer inserts, then the order of applying the operations does not matter. If we only support insert buffering, we can simply seek to (table_id,index_id,index_fields) with the minimum bound of the index_fields for the leaf page, and then scan the range until the maximum bound of the index_fields is reached, or we run out of records for that (table_id,index_id). If we support change buffering, we have to seek to (table_id,index_id,index_page_number,0) and then scan the range with increasing sequence_number until we run out of records for that (table_id,index_id,index_page_number). If applying the buffered changes involves splitting or merging secondary index leaf pages to another page for which buffered changes exist, it could get a little tricky, but that could be manageable if we set the ‘buffered changes exist’ flag conservatively, forcing any access to the adjacent pages to check if any buffered changes can be applied. This could be simpler if we do not store index_page_number in the change buffer key. It looks like we should determine whether innodb_change_buffering=all brings any performance benefit over innodb_change_buffering=inserts. If not, maybe the new format should support only insert buffering? When the delete-buffering feature was originally implemented for the InnoDB+ that later became part of MySQL 5.5, no such experiments were conducted. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It turns out that the delete (purge) buffering (BTR_DELETE) is only attempted on purge, and not at all by rollback. Here is proof of that:
The test is reporting
both before and after the XA ROLLBACK statement. None of the secondary index pages are present in the buffer pool after the slow shutdown, and the pages are only being loaded by the XA ROLLBACK. That rollback is invoking buf_page_get_gen() not only for the secondary index root page and some node pointer pages, but also to retrieve the leaf page:
In fact, the flag BTR_DELETE (which could have been passed here) is only being used by purge, never by rollback. Similarly, the flag BTR_DELETE_MARK is only being used by row_upd_sec_index_entry() (as part of an DELETE, or an UPDATE of a PRIMARY KEY column), never on rollback. So, the delete buffering was never being used to its full potential. I will prepare a patch that will remove the BTR_DELETE buffering and the buffer pool watch mechanism. It will be interesting to see how this affects performance. Similar performance should be obtained by setting innodb_change_buffering=changes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It turns out that we can employ the simple change buffer record format Whenever we are about to buffer an operation, we would have to search for (table_id,index_id,index_fields) anyway, to find the correct position for inserting the record into the change buffer. If we find a match, we would merely update the operation in the change buffer record, instead of inserting a new record. The order of applying the operations to the index page does not matter, because we only buffer the latest pending operation for any given key, and applying the operations will be idempotent. We should make use of delete-mark, delete-unmark and delete buffering also on ROLLBACK, to speed up the rollback of large DELETE or UPDATE transactions. Rollback of INSERT would delete records directly from the secondary index pages, as it always has been the case. A separate benchmark has to be run to determine if it would be worthwhile to buffer delete operations (not only on purge, but also on ROLLBACK). When updating the operation of an existing change buffer record, we could try to perform some consistency checks. Perhaps a buffered 'insert' operation should never be replaced with a buffered 'delete-unmark' operation. If the to-be-buffered operation is the same as in the already buffered record, it might be a sign of corruption. Multiple 'delete' operations are probably fine, because sometimes ROLLBACK can perform a purge-like operation, and purge could attempt the same again. Also, when the ROLLBACK of a row operation occurs, sometimes the row operation was not completed for all indexes, yet the rollback will be attempted on every index. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on benchmarks that were run for So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-01-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached benchmark graphs for MariaDB 10.5 and 10.6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-02-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached MDEV-11634-10.8B.pdf | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly. A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written:
Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages. Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823. |