[MDEV-11634] Improve the InnoDB change buffer - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Do
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- innodb
- performance

Description

The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-11634-10.5.pdf
22 kB
2022-01-31 12:06
MDEV-11634-10.6.pdf
22 kB
2022-01-31 12:06
MDEV-11634-10.8B.pdf
28 kB
2022-02-17 16:32
Stop-buffering-delete-purge-operations.patch
57 kB
2019-10-08 07:21

Issue Links

is blocked by

MDEV-17598 InnoDB index option for per-record transaction ID

Open

MDEV-19514 Defer change buffer merge until pages are requested

Closed

relates to

MDEV-515 innodb bulk insert

Closed

MDEV-13637 InnoDB change buffer housekeeping can cause redo log overrun and possibly deadlocks

Closed

MDEV-14094 benchmark effects of innodb change buffer

Closed

MDEV-22930 Unnecessary contention on rw_lock_list_mutex in ibuf_dummy_index_create()

Closed

MDEV-29694 Remove the InnoDB change buffer

Closed

MDEV-13485 MTR tests fail massively with --innodb-sync-debug

Closed

mentioned in: Page Loading...

(3 relates to, 1 mentioned in)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä created issue - 2016-12-22 09:59

Marko Mäkelä made changes - 2017-08-24 11:07

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-13485~~ [ ~~MDEV-13485~~ ]

Marko Mäkelä made changes - 2017-08-24 11:48

Link

This issue relates to ~~MDEV-13637~~ [ ~~MDEV-13637~~ ]

Ralf Gebhardt made changes - 2018-11-22 10:52

Fix Version/s

10.4 [ 22408 ]

Marko Mäkelä made changes - 2019-04-09 06:26

Link

This issue relates to ~~MDEV-14094~~ [ ~~MDEV-14094~~ ]

Marko Mäkelä made changes - 2019-04-09 06:26

Link

This issue relates to ~~MDEV-515~~ [ ~~MDEV-515~~ ]

Marko Mäkelä made changes - 2019-04-09 06:26

Link

This issue relates to ~~MDEV-14481~~ [ ~~MDEV-14481~~ ]

Marko Mäkelä added a comment - 2019-04-09 06:26 - edited

The implementation of ~~MDEV-515~~ could reduce the need for the change buffering.

For ~~MDEV-14481~~, ~~MDEV-19514~~ must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread. Thanks to this, the index metadata will be available, and we do not have to store any metadata in the change buffer records.

We could also allow the change buffer merge to split or merge pages as needed. This would remove the need for the "free bits bitmap". We might eliminate the entire change buffer bitmap pages (page 1+n·innodb_page_size), and incorporate the "merge needed" bit to the allocation bitmap pages (page n·innodb_page_size).

This optimized format would be only available to data files that are in the ~~MDEV-12026~~ innodb_checksum_algorithm=full_crc32 format. The absence of InnoDB change buffer bitmap pages would be identified by a new bit in FSP_SPACE_FLAGS.

Marko Mäkelä added a comment - 2019-04-09 06:26 - edited The implementation of MDEV-515 could reduce the need for the change buffering. For MDEV-14481 , MDEV-19514 must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread. Thanks to this, the index metadata will be available, and we do not have to store any metadata in the change buffer records. We could also allow the change buffer merge to split or merge pages as needed. This would remove the need for the "free bits bitmap". We might eliminate the entire change buffer bitmap pages (page 1+n· innodb_page_size ), and incorporate the "merge needed" bit to the allocation bitmap pages (page n· innodb_page_size ). This optimized format would be only available to data files that are in the MDEV-12026 innodb_checksum_algorithm=full_crc32 format. The absence of InnoDB change buffer bitmap pages would be identified by a new bit in FSP_SPACE_FLAGS .

Marko Mäkelä made changes - 2019-04-09 09:06

NRE Projects		RM_105_CANDIDATE
Labels	innodb	innodb performance

Ralf Gebhardt made changes - 2019-05-17 13:25

Link

This issue blocks ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2019-05-17 13:32

Link

This issue relates to ~~MDEV-14481~~ [ ~~MDEV-14481~~ ]

Marko Mäkelä made changes - 2019-05-17 13:32

Link

This issue blocks ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä made changes - 2019-05-17 13:33

Link

This issue is blocked by ~~MDEV-19514~~ [ ~~MDEV-19514~~ ]

Marko Mäkelä added a comment - 2019-09-30 08:35

As noted in ~~MDEV-19514~~, the change buffer does give some notable performance improvement. Hence, we should not remove it altogether.

We can improve the change buffer by:

using a dedicated data file instead of the system tablespace
changing to a logical format: (table_id,index_id,sequence_number)↦(operation,index_fields)

With such a logical format, there is no need to avoid any page splits or merges, or to keep track of the fill factor of individual pages. We will also be able to mark a unique index corrupted if duplicate values were successfully buffered during unique_checks=0.

Starting the key with (table_id,index_id) allows us to easily drop any buffered changes when a table or index is dropped or rebuilt.

Marko Mäkelä added a comment - 2019-09-30 08:35 As noted in MDEV-19514 , the change buffer does give some notable performance improvement. Hence, we should not remove it altogether. We can improve the change buffer by: using a dedicated data file instead of the system tablespace changing to a logical format: (table_id,index_id,sequence_number)↦(operation,index_fields) With such a logical format, there is no need to avoid any page splits or merges, or to keep track of the fill factor of individual pages. We will also be able to mark a unique index corrupted if duplicate values were successfully buffered during unique_checks=0 . Starting the key with (table_id,index_id) allows us to easily drop any buffered changes when a table or index is dropped or rebuilt.

Marko Mäkelä made changes - 2019-09-30 08:35

Summary

Improve or remove the InnoDB change buffer

Improve the InnoDB change buffer

Marko Mäkelä added a comment - 2019-10-01 07:28

A change buffer merge would be initiated whenever a search reaches a secondary index leaf page for which the bitmap page indicates there to exist buffered changes. Based on the potential key range of the secondary index leaf page (which is indicated by the keys on the above-leaf-level page), we would be able to find all buffered changes in the page.

The change buffer records could be in one of the following formats:

(table_id,index_id,index_fields) if we are going to support insert buffering only
(table_id,index_id,index_page_number,sequence_number)↦(operation,index_fields) for supporting insert, delete-mark, and delete (purge) buffering

The sequence_number field would be needed for buffering delete-mark and delete (purge) operations. If we only buffer inserts, then the order of applying the operations does not matter.

If we only support insert buffering, we can simply seek to (table_id,index_id,index_fields) with the minimum bound of the index_fields for the leaf page, and then scan the range until the maximum bound of the index_fields is reached, or we run out of records for that (table_id,index_id).

If we support change buffering, we have to seek to (table_id,index_id,index_page_number,0) and then scan the range with increasing sequence_number until we run out of records for that (table_id,index_id,index_page_number).

If applying the buffered changes involves splitting or merging secondary index leaf pages to another page for which buffered changes exist, it could get a little tricky, but that could be manageable if we set the ‘buffered changes exist’ flag conservatively, forcing any access to the adjacent pages to check if any buffered changes can be applied. This could be simpler if we do not store index_page_number in the change buffer key.

It looks like we should determine whether innodb_change_buffering=all brings any performance benefit over innodb_change_buffering=inserts. If not, maybe the new format should support only insert buffering? When the delete-buffering feature was originally implemented for the InnoDB+ that later became part of MySQL 5.5, no such experiments were conducted.

Marko Mäkelä added a comment - 2019-10-01 07:28 A change buffer merge would be initiated whenever a search reaches a secondary index leaf page for which the bitmap page indicates there to exist buffered changes. Based on the potential key range of the secondary index leaf page (which is indicated by the keys on the above-leaf-level page), we would be able to find all buffered changes in the page. The change buffer records could be in one of the following formats: (table_id,index_id,index_fields) if we are going to support insert buffering only (table_id,index_id,index_page_number,sequence_number)↦(operation,index_fields) for supporting insert, delete-mark, and delete (purge) buffering The sequence_number field would be needed for buffering delete-mark and delete (purge) operations. If we only buffer inserts, then the order of applying the operations does not matter. If we only support insert buffering, we can simply seek to (table_id,index_id,index_fields) with the minimum bound of the index_fields for the leaf page, and then scan the range until the maximum bound of the index_fields is reached, or we run out of records for that (table_id,index_id). If we support change buffering, we have to seek to (table_id,index_id,index_page_number,0) and then scan the range with increasing sequence_number until we run out of records for that (table_id,index_id,index_page_number). If applying the buffered changes involves splitting or merging secondary index leaf pages to another page for which buffered changes exist, it could get a little tricky, but that could be manageable if we set the ‘buffered changes exist’ flag conservatively, forcing any access to the adjacent pages to check if any buffered changes can be applied. This could be simpler if we do not store index_page_number in the change buffer key. It looks like we should determine whether innodb_change_buffering=all brings any performance benefit over innodb_change_buffering=inserts . If not, maybe the new format should support only insert buffering? When the delete-buffering feature was originally implemented for the InnoDB+ that later became part of MySQL 5.5, no such experiments were conducted.

Marko Mäkelä added a comment - 2019-10-03 06:43

It turns out that the delete (purge) buffering (BTR_DELETE) is only attempted on purge, and not at all by rollback. Here is proof of that:

--source include/have_innodb.inc

--source include/have_sequence.inc

connect (con1,localhost,root,,);

CREATE TABLE t(a SERIAL, b CHAR(255) NOT NULL DEFAULT '', KEY(b))ENGINE=InnoDB;

XA START '123';

INSERT INTO t (a) SELECT * FROM seq_1_to_10000;

XA END '123';

XA PREPARE '123';

connection default;

--echo # Kill the server to work around MDEV-742

--let $shutdown_timeout=0

--source include/restart_mysqld.inc

disconnect con1;

--echo # Initiate slow shutdown, to have an "empty" buffer pool after restart

let $datadir=`select @@datadir`;

--let $shutdown_timeout=60

SET GLOBAL innodb_fast_shutdown=0;

--source include/shutdown_mysqld.inc

--remove_file $datadir/ib_buffer_pool

--source include/start_mysqld.inc

replace_regex /.*operations:.* (insert.*delete.*delete \d+).*discarded .*/\1/;

SHOW ENGINE INNODB STATUS;

XA ROLLBACK '123';

replace_regex /.*operations:.* (insert.*delete.*delete \d+).*discarded .*/\1/;

SHOW ENGINE INNODB STATUS;

DROP TABLE t;

./mtr --mysqld=--innodb-page-size=4k testname

The test is reporting

InnoDB		insert 0, delete mark 0, delete 0

both before and after the XA ROLLBACK statement. None of the secondary index pages are present in the buffer pool after the slow shutdown, and the pages are only being loaded by the XA ROLLBACK. That rollback is invoking buf_page_get_gen() not only for the secondary index root page and some node pointer pages, but also to retrieve the leaf page:

#0  0x0000555556826717 in btr_pcur_open_low (index=0x555558448818, level=0,

    tuple=0x7fff8c1408d8, mode=PAGE_CUR_LE, latch_mode=16386,

    cursor=0x7ffff03ecae8,

    file=0x555556e70fc5 "/mariadb/10.5/storage/innobase/row/row0row.cc",

    line=1313, autoinc=0, mtr=0x7ffff03ecd90)

    at /mariadb/10.5/storage/innobase/include/btr0pcur.ic:441

#1  0x000055555682712e in row_search_index_entry (index=0x555558448818,

    entry=0x7fff8c1408d8, mode=16386, pcur=0x7ffff03ecae8, mtr=0x7ffff03ecd90)

    at /mariadb/10.5/storage/innobase/row/row0row.cc:1313

#2  0x0000555556b034c3 in row_undo_ins_remove_sec_low (mode=16386,

    index=0x555558448818, entry=0x7fff8c1408d8, thr=0x7fff8c13a268)

    at /mariadb/10.5/storage/innobase/row/row0uins.cc:280

#3  0x0000555556b03062 in row_undo_ins_remove_sec (index=0x555558448818,

    entry=0x7fff8c1408d8, thr=0x7fff8c13a268)

    at /mariadb/10.5/storage/innobase/row/row0uins.cc:344

#4  0x0000555556b005e0 in row_undo_ins_remove_sec_rec (node=0x7fff8c13eac8,

    thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:524

#5  0x0000555556aff745 in row_undo_ins (node=0x7fff8c13eac8,

    thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:576

#6  0x000055555684980b in row_undo (node=0x7fff8c13eac8, thr=0x7fff8c13a268)

    at /mariadb/10.5/storage/innobase/row/row0undo.cc:438

#7  0x0000555556849380 in row_undo_step (thr=0x7fff8c13a268)

    at /mariadb/10.5/storage/innobase/row/row0undo.cc:499

…

#15 0x00005555565ff26b in innobase_rollback_by_xid (hton=0x5555580cc1c8,

    xid=0x7fff8c014b70)

    at /mariadb/10.5/storage/innobase/handler/ha_innodb.cc:16737

In fact, the flag BTR_DELETE (which could have been passed here) is only being used by purge, never by rollback. Similarly, the flag BTR_DELETE_MARK is only being used by row_upd_sec_index_entry() (as part of an DELETE, or an UPDATE of a PRIMARY KEY column), never on rollback. So, the delete buffering was never being used to its full potential.

I will prepare a patch that will remove the BTR_DELETE buffering and the buffer pool watch mechanism. It will be interesting to see how this affects performance. Similar performance should be obtained by setting innodb_change_buffering=changes.

Marko Mäkelä added a comment - 2019-10-03 06:43 It turns out that the delete (purge) buffering ( BTR_DELETE ) is only attempted on purge, and not at all by rollback. Here is proof of that: --source include/have_innodb.inc --source include/have_sequence.inc connect (con1,localhost,root,,); CREATE TABLE t(a SERIAL, b CHAR (255) NOT NULL DEFAULT '' , KEY (b))ENGINE=InnoDB; XA START '123' ; INSERT INTO t (a) SELECT * FROM seq_1_to_10000; XA END '123' ; XA PREPARE '123' ; connection default ; --echo # Kill the server to work around MDEV-742 --let $shutdown_timeout=0 --source include/restart_mysqld.inc disconnect con1; --echo # Initiate slow shutdown, to have an "empty" buffer pool after restart let $datadir=` select @@datadir`; --let $shutdown_timeout=60 SET GLOBAL innodb_fast_shutdown=0; --source include/shutdown_mysqld.inc --remove_file $datadir/ib_buffer_pool --source include/start_mysqld.inc replace_regex /.*operations:.* ( insert .* delete .* delete \d+).*discarded .*/\1/; SHOW ENGINE INNODB STATUS; XA ROLLBACK '123' ; replace_regex /.*operations:.* ( insert .* delete .* delete \d+).*discarded .*/\1/; SHOW ENGINE INNODB STATUS; DROP TABLE t; ./mtr --mysqld=--innodb-page-size=4k testname The test is reporting InnoDB insert 0, delete mark 0, delete 0 both before and after the XA ROLLBACK statement. None of the secondary index pages are present in the buffer pool after the slow shutdown, and the pages are only being loaded by the XA ROLLBACK . That rollback is invoking buf_page_get_gen() not only for the secondary index root page and some node pointer pages, but also to retrieve the leaf page: #0 0x0000555556826717 in btr_pcur_open_low (index=0x555558448818, level=0, tuple=0x7fff8c1408d8, mode=PAGE_CUR_LE, latch_mode=16386, cursor=0x7ffff03ecae8, file=0x555556e70fc5 "/mariadb/10.5/storage/innobase/row/row0row.cc", line=1313, autoinc=0, mtr=0x7ffff03ecd90) at /mariadb/10.5/storage/innobase/include/btr0pcur.ic:441 #1 0x000055555682712e in row_search_index_entry (index=0x555558448818, entry=0x7fff8c1408d8, mode=16386, pcur=0x7ffff03ecae8, mtr=0x7ffff03ecd90) at /mariadb/10.5/storage/innobase/row/row0row.cc:1313 #2 0x0000555556b034c3 in row_undo_ins_remove_sec_low (mode=16386, index=0x555558448818, entry=0x7fff8c1408d8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:280 #3 0x0000555556b03062 in row_undo_ins_remove_sec (index=0x555558448818, entry=0x7fff8c1408d8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:344 #4 0x0000555556b005e0 in row_undo_ins_remove_sec_rec (node=0x7fff8c13eac8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:524 #5 0x0000555556aff745 in row_undo_ins (node=0x7fff8c13eac8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:576 #6 0x000055555684980b in row_undo (node=0x7fff8c13eac8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0undo.cc:438 #7 0x0000555556849380 in row_undo_step (thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0undo.cc:499 … #15 0x00005555565ff26b in innobase_rollback_by_xid (hton=0x5555580cc1c8, xid=0x7fff8c014b70) at /mariadb/10.5/storage/innobase/handler/ha_innodb.cc:16737 In fact, the flag BTR_DELETE (which could have been passed here) is only being used by purge, never by rollback. Similarly, the flag BTR_DELETE_MARK is only being used by row_upd_sec_index_entry() (as part of an DELETE , or an UPDATE of a PRIMARY KEY column), never on rollback. So, the delete buffering was never being used to its full potential. I will prepare a patch that will remove the BTR_DELETE buffering and the buffer pool watch mechanism. It will be interesting to see how this affects performance. Similar performance should be obtained by setting innodb_change_buffering=changes .

Marko Mäkelä added a comment - 2019-10-04 12:30 - edited

It turns out that we can employ the simple change buffer record format
(table_id,index_id,index_fields,operation)
and only buffer the latest operation (insert, delete-mark, delete-unmark, delete).

Whenever we are about to buffer an operation, we would have to search for (table_id,index_id,index_fields) anyway, to find the correct position for inserting the record into the change buffer. If we find a match, we would merely update the operation in the change buffer record, instead of inserting a new record.

The order of applying the operations to the index page does not matter, because we only buffer the latest pending operation for any given key, and applying the operations will be idempotent.

We should make use of delete-mark, delete-unmark and delete buffering also on ROLLBACK, to speed up the rollback of large DELETE or UPDATE transactions. Rollback of INSERT would delete records directly from the secondary index pages, as it always has been the case.

A separate benchmark has to be run to determine if it would be worthwhile to buffer delete operations (not only on purge, but also on ROLLBACK).

When updating the operation of an existing change buffer record, we could try to perform some consistency checks. Perhaps a buffered 'insert' operation should never be replaced with a buffered 'delete-unmark' operation. If the to-be-buffered operation is the same as in the already buffered record, it might be a sign of corruption. Multiple 'delete' operations are probably fine, because sometimes ROLLBACK can perform a purge-like operation, and purge could attempt the same again. Also, when the ROLLBACK of a row operation occurs, sometimes the row operation was not completed for all indexes, yet the rollback will be attempted on every index.

Marko Mäkelä added a comment - 2019-10-04 12:30 - edited It turns out that we can employ the simple change buffer record format (table_id,index_id,index_fields,operation) and only buffer the latest operation (insert, delete-mark, delete-unmark, delete). Whenever we are about to buffer an operation, we would have to search for (table_id,index_id,index_fields) anyway, to find the correct position for inserting the record into the change buffer. If we find a match, we would merely update the operation in the change buffer record, instead of inserting a new record. The order of applying the operations to the index page does not matter, because we only buffer the latest pending operation for any given key, and applying the operations will be idempotent. We should make use of delete-mark, delete-unmark and delete buffering also on ROLLBACK , to speed up the rollback of large DELETE or UPDATE transactions. Rollback of INSERT would delete records directly from the secondary index pages, as it always has been the case. A separate benchmark has to be run to determine if it would be worthwhile to buffer delete operations (not only on purge, but also on ROLLBACK ). When updating the operation of an existing change buffer record, we could try to perform some consistency checks. Perhaps a buffered 'insert' operation should never be replaced with a buffered 'delete-unmark' operation. If the to-be-buffered operation is the same as in the already buffered record, it might be a sign of corruption. Multiple 'delete' operations are probably fine, because sometimes ROLLBACK can perform a purge-like operation, and purge could attempt the same again. Also, when the ROLLBACK of a row operation occurs, sometimes the row operation was not completed for all indexes, yet the rollback will be attempted on every index.

Marko Mäkelä made changes - 2019-10-08 07:21

Attachment

Stop-buffering-delete-purge-operations.patch [ 49140 ]

Marko Mäkelä added a comment - 2019-10-08 07:28

Based on benchmarks that were run for ~~MDEV-19344~~ for a 10.2 version of Stop-buffering-delete-purge-operations.patch, the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set.

So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch. The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the ~~MDEV-19514~~ development branch.

Marko Mäkelä added a comment - 2019-10-08 07:28 Based on benchmarks that were run for MDEV-19344 for a 10.2 version of Stop-buffering-delete-purge-operations.patch , the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set. So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch . The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the MDEV-19514 development branch.

Marko Mäkelä made changes - 2020-01-13 14:02

Link

This issue blocks MDEV-11633 [ MDEV-11633 ]

Marko Mäkelä made changes - 2020-01-13 14:06

Parent	MDEV-11633 [ 59463 ]
Issue Type	Technical task [ 7 ]	Task [ 3 ]

Marko Mäkelä added a comment - 2020-05-06 06:09

~~MDEV-19514~~ preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.

Marko Mäkelä added a comment - 2020-05-06 06:09 MDEV-19514 preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.

Marko Mäkelä made changes - 2020-06-18 13:25

Link

This issue relates to ~~MDEV-22930~~ [ ~~MDEV-22930~~ ]

Rob Schwyzer (Inactive) made changes - 2021-05-27 16:35

Remote Link

This issue links to "Page (MariaDB Confluence)" [ 31241 ]

Sergei Golubchik made changes - 2021-12-06 21:21

Workflow

MariaDB v3 [ 78866 ]

MariaDB v4 [ 130572 ]

Marko Mäkelä made changes - 2022-01-25 16:39

Link

This issue relates to MENT-1396 [ MENT-1396 ]

Axel Schwenke made changes - 2022-01-31 12:06

Attachment		MDEV-11634-10.5.pdf [ 61941 ]
Attachment		MDEV-11634-10.6.pdf [ 61942 ]

Axel Schwenke added a comment - 2022-01-31 12:07

Attached benchmark graphs for MariaDB 10.5 and 10.6

Axel Schwenke added a comment - 2022-01-31 12:07 Attached benchmark graphs for MariaDB 10.5 and 10.6

Axel Schwenke made changes - 2022-02-17 16:32

Attachment

MDEV-11634-10.8B.pdf [ 62273 ]

Axel Schwenke added a comment - 2022-02-17 16:36

Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes ~~MDEV-27774~~. It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

Axel Schwenke added a comment - 2022-02-17 16:36 Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes MDEV-27774 . It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

AirFocus made changes - 2022-08-09 16:11

Description

The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read.

The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B\-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

The key in the 5.5 format is (tablespace_id, page_number, operation\_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

If the InnoDB change buffer key was something like (tablespace_id, index_id, page\_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

If the InnoDB change buffer is to be preserved, it would be good to define it as a no\-rollback persistent table that privileged users can read.

Marko Mäkelä made changes - 2022-11-02 07:03

Link

This issue is blocked by MDEV-17598 [ MDEV-17598 ]

Marko Mäkelä made changes - 2022-11-02 07:03

Link

This issue is blocked by MDEV-29823 [ MDEV-29823 ]

Marko Mäkelä added a comment - 2022-11-02 07:03

Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly.

A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written:

A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log.
Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered.

Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages.

Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823.

Marko Mäkelä added a comment - 2022-11-02 07:03 Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly. A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written: A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log. Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered. Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages. Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823 .

Ralf Gebhardt made changes - 2023-03-06 08:56

Link

This issue blocks MDEV-11633 [ MDEV-11633 ]

Ralf Gebhardt made changes - 2023-03-06 08:58

Fix Version/s		N/A [ 14700 ]
Assignee	Marko Mäkelä [ marko ]
Resolution		Won't Do [ 10201 ]
Status	Open [ 1 ]	Closed [ 6 ]

Ralf Gebhardt made changes - 2023-03-06 08:59

Link

This issue relates to ~~MDEV-29694~~ [ ~~MDEV-29694~~ ]

Ralf Gebhardt made changes - 2023-03-06 09:44

Link

This issue is blocked by MDEV-29823 [ MDEV-29823 ]

People

Assignee:: Unassigned

Reporter:: Marko Mäkelä

Votes:: 3 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 2016-12-22 09:59

Updated:: 2023-03-06 09:44

Resolved:: 2023-03-06 08:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration