Details

    Description

      The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

      The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

      The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

      On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

      If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

      Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
      On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

      If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.4 [ 22408 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            The implementation of MDEV-515 could reduce the need for the change buffering.

            For MDEV-14481, MDEV-19514 must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread. Thanks to this, the index metadata will be available, and we do not have to store any metadata in the change buffer records.

            We could also allow the change buffer merge to split or merge pages as needed. This would remove the need for the "free bits bitmap". We might eliminate the entire change buffer bitmap pages (page 1+n·innodb_page_size), and incorporate the "merge needed" bit to the allocation bitmap pages (page n·innodb_page_size).

            This optimized format would be only available to data files that are in the MDEV-12026 innodb_checksum_algorithm=full_crc32 format. The absence of InnoDB change buffer bitmap pages would be identified by a new bit in FSP_SPACE_FLAGS.

            marko Marko Mäkelä added a comment - - edited The implementation of MDEV-515 could reduce the need for the change buffering. For MDEV-14481 , MDEV-19514 must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread. Thanks to this, the index metadata will be available, and we do not have to store any metadata in the change buffer records. We could also allow the change buffer merge to split or merge pages as needed. This would remove the need for the "free bits bitmap". We might eliminate the entire change buffer bitmap pages (page 1+n· innodb_page_size ), and incorporate the "merge needed" bit to the allocation bitmap pages (page n· innodb_page_size ). This optimized format would be only available to data files that are in the MDEV-12026 innodb_checksum_algorithm=full_crc32 format. The absence of InnoDB change buffer bitmap pages would be identified by a new bit in FSP_SPACE_FLAGS .
            marko Marko Mäkelä made changes -
            NRE Projects RM_105_CANDIDATE
            Labels innodb innodb performance
            ralf.gebhardt Ralf Gebhardt made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            As noted in MDEV-19514, the change buffer does give some notable performance improvement. Hence, we should not remove it altogether.

            We can improve the change buffer by:

            • using a dedicated data file instead of the system tablespace
            • changing to a logical format: (table_id,index_id,sequence_number)↦(operation,index_fields)

            With such a logical format, there is no need to avoid any page splits or merges, or to keep track of the fill factor of individual pages. We will also be able to mark a unique index corrupted if duplicate values were successfully buffered during unique_checks=0.

            Starting the key with (table_id,index_id) allows us to easily drop any buffered changes when a table or index is dropped or rebuilt.

            marko Marko Mäkelä added a comment - As noted in MDEV-19514 , the change buffer does give some notable performance improvement. Hence, we should not remove it altogether. We can improve the change buffer by: using a dedicated data file instead of the system tablespace changing to a logical format: (table_id,index_id,sequence_number)↦(operation,index_fields) With such a logical format, there is no need to avoid any page splits or merges, or to keep track of the fill factor of individual pages. We will also be able to mark a unique index corrupted if duplicate values were successfully buffered during unique_checks=0 . Starting the key with (table_id,index_id) allows us to easily drop any buffered changes when a table or index is dropped or rebuilt.
            marko Marko Mäkelä made changes -
            Summary Improve or remove the InnoDB change buffer Improve the InnoDB change buffer

            A change buffer merge would be initiated whenever a search reaches a secondary index leaf page for which the bitmap page indicates there to exist buffered changes. Based on the potential key range of the secondary index leaf page (which is indicated by the keys on the above-leaf-level page), we would be able to find all buffered changes in the page.

            The change buffer records could be in one of the following formats:

            • (table_id,index_id,index_fields) if we are going to support insert buffering only
            • (table_id,index_id,index_page_number,sequence_number)↦(operation,index_fields) for supporting insert, delete-mark, and delete (purge) buffering

            The sequence_number field would be needed for buffering delete-mark and delete (purge) operations. If we only buffer inserts, then the order of applying the operations does not matter.

            If we only support insert buffering, we can simply seek to (table_id,index_id,index_fields) with the minimum bound of the index_fields for the leaf page, and then scan the range until the maximum bound of the index_fields is reached, or we run out of records for that (table_id,index_id).

            If we support change buffering, we have to seek to (table_id,index_id,index_page_number,0) and then scan the range with increasing sequence_number until we run out of records for that (table_id,index_id,index_page_number).

            If applying the buffered changes involves splitting or merging secondary index leaf pages to another page for which buffered changes exist, it could get a little tricky, but that could be manageable if we set the ‘buffered changes exist’ flag conservatively, forcing any access to the adjacent pages to check if any buffered changes can be applied. This could be simpler if we do not store index_page_number in the change buffer key.

            It looks like we should determine whether innodb_change_buffering=all brings any performance benefit over innodb_change_buffering=inserts. If not, maybe the new format should support only insert buffering? When the delete-buffering feature was originally implemented for the InnoDB+ that later became part of MySQL 5.5, no such experiments were conducted.

            marko Marko Mäkelä added a comment - A change buffer merge would be initiated whenever a search reaches a secondary index leaf page for which the bitmap page indicates there to exist buffered changes. Based on the potential key range of the secondary index leaf page (which is indicated by the keys on the above-leaf-level page), we would be able to find all buffered changes in the page. The change buffer records could be in one of the following formats: (table_id,index_id,index_fields) if we are going to support insert buffering only (table_id,index_id,index_page_number,sequence_number)↦(operation,index_fields) for supporting insert, delete-mark, and delete (purge) buffering The sequence_number field would be needed for buffering delete-mark and delete (purge) operations. If we only buffer inserts, then the order of applying the operations does not matter. If we only support insert buffering, we can simply seek to (table_id,index_id,index_fields) with the minimum bound of the index_fields for the leaf page, and then scan the range until the maximum bound of the index_fields is reached, or we run out of records for that (table_id,index_id). If we support change buffering, we have to seek to (table_id,index_id,index_page_number,0) and then scan the range with increasing sequence_number until we run out of records for that (table_id,index_id,index_page_number). If applying the buffered changes involves splitting or merging secondary index leaf pages to another page for which buffered changes exist, it could get a little tricky, but that could be manageable if we set the ‘buffered changes exist’ flag conservatively, forcing any access to the adjacent pages to check if any buffered changes can be applied. This could be simpler if we do not store index_page_number in the change buffer key. It looks like we should determine whether innodb_change_buffering=all brings any performance benefit over innodb_change_buffering=inserts . If not, maybe the new format should support only insert buffering? When the delete-buffering feature was originally implemented for the InnoDB+ that later became part of MySQL 5.5, no such experiments were conducted.

            It turns out that the delete (purge) buffering (BTR_DELETE) is only attempted on purge, and not at all by rollback. Here is proof of that:

            --source include/have_innodb.inc
            --source include/have_sequence.inc
             
            connect (con1,localhost,root,,);
             
            CREATE TABLE t(a SERIAL, b CHAR(255) NOT NULL DEFAULT '', KEY(b))ENGINE=InnoDB;
            XA START '123';
            INSERT INTO t (a) SELECT * FROM seq_1_to_10000;
            XA END '123';
            XA PREPARE '123';
             
            connection default;
             
            --echo # Kill the server to work around MDEV-742
            --let $shutdown_timeout=0
            --source include/restart_mysqld.inc
            disconnect con1;
             
            --echo # Initiate slow shutdown, to have an "empty" buffer pool after restart
            let $datadir=`select @@datadir`;
            --let $shutdown_timeout=60
            SET GLOBAL innodb_fast_shutdown=0;
            --source include/shutdown_mysqld.inc
             
            --remove_file $datadir/ib_buffer_pool
            --source include/start_mysqld.inc
             
            replace_regex /.*operations:.* (insert.*delete.*delete \d+).*discarded .*/\1/;
            SHOW ENGINE INNODB STATUS;
            XA ROLLBACK '123';
            replace_regex /.*operations:.* (insert.*delete.*delete \d+).*discarded .*/\1/;
            SHOW ENGINE INNODB STATUS;
            DROP TABLE t;
            

            ./mtr --mysqld=--innodb-page-size=4k testname
            

            The test is reporting

            InnoDB		insert 0, delete mark 0, delete 0
            

            both before and after the XA ROLLBACK statement. None of the secondary index pages are present in the buffer pool after the slow shutdown, and the pages are only being loaded by the XA ROLLBACK. That rollback is invoking buf_page_get_gen() not only for the secondary index root page and some node pointer pages, but also to retrieve the leaf page:

            #0  0x0000555556826717 in btr_pcur_open_low (index=0x555558448818, level=0, 
                tuple=0x7fff8c1408d8, mode=PAGE_CUR_LE, latch_mode=16386, 
                cursor=0x7ffff03ecae8, 
                file=0x555556e70fc5 "/mariadb/10.5/storage/innobase/row/row0row.cc", 
                line=1313, autoinc=0, mtr=0x7ffff03ecd90)
                at /mariadb/10.5/storage/innobase/include/btr0pcur.ic:441
            #1  0x000055555682712e in row_search_index_entry (index=0x555558448818, 
                entry=0x7fff8c1408d8, mode=16386, pcur=0x7ffff03ecae8, mtr=0x7ffff03ecd90)
                at /mariadb/10.5/storage/innobase/row/row0row.cc:1313
            #2  0x0000555556b034c3 in row_undo_ins_remove_sec_low (mode=16386, 
                index=0x555558448818, entry=0x7fff8c1408d8, thr=0x7fff8c13a268)
                at /mariadb/10.5/storage/innobase/row/row0uins.cc:280
            #3  0x0000555556b03062 in row_undo_ins_remove_sec (index=0x555558448818, 
                entry=0x7fff8c1408d8, thr=0x7fff8c13a268)
                at /mariadb/10.5/storage/innobase/row/row0uins.cc:344
            #4  0x0000555556b005e0 in row_undo_ins_remove_sec_rec (node=0x7fff8c13eac8, 
                thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:524
            #5  0x0000555556aff745 in row_undo_ins (node=0x7fff8c13eac8, 
                thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:576
            #6  0x000055555684980b in row_undo (node=0x7fff8c13eac8, thr=0x7fff8c13a268)
                at /mariadb/10.5/storage/innobase/row/row0undo.cc:438
            #7  0x0000555556849380 in row_undo_step (thr=0x7fff8c13a268)
                at /mariadb/10.5/storage/innobase/row/row0undo.cc:499
            #15 0x00005555565ff26b in innobase_rollback_by_xid (hton=0x5555580cc1c8, 
                xid=0x7fff8c014b70)
                at /mariadb/10.5/storage/innobase/handler/ha_innodb.cc:16737
            

            In fact, the flag BTR_DELETE (which could have been passed here) is only being used by purge, never by rollback. Similarly, the flag BTR_DELETE_MARK is only being used by row_upd_sec_index_entry() (as part of an DELETE, or an UPDATE of a PRIMARY KEY column), never on rollback. So, the delete buffering was never being used to its full potential.

            I will prepare a patch that will remove the BTR_DELETE buffering and the buffer pool watch mechanism. It will be interesting to see how this affects performance. Similar performance should be obtained by setting innodb_change_buffering=changes.

            marko Marko Mäkelä added a comment - It turns out that the delete (purge) buffering ( BTR_DELETE ) is only attempted on purge, and not at all by rollback. Here is proof of that: --source include/have_innodb.inc --source include/have_sequence.inc   connect (con1,localhost,root,,);   CREATE TABLE t(a SERIAL, b CHAR (255) NOT NULL DEFAULT '' , KEY (b))ENGINE=InnoDB; XA START '123' ; INSERT INTO t (a) SELECT * FROM seq_1_to_10000; XA END '123' ; XA PREPARE '123' ;   connection default ;   --echo # Kill the server to work around MDEV-742 --let $shutdown_timeout=0 --source include/restart_mysqld.inc disconnect con1;   --echo # Initiate slow shutdown, to have an "empty" buffer pool after restart let $datadir=` select @@datadir`; --let $shutdown_timeout=60 SET GLOBAL innodb_fast_shutdown=0; --source include/shutdown_mysqld.inc   --remove_file $datadir/ib_buffer_pool --source include/start_mysqld.inc   replace_regex /.*operations:.* ( insert .* delete .* delete \d+).*discarded .*/\1/; SHOW ENGINE INNODB STATUS; XA ROLLBACK '123' ; replace_regex /.*operations:.* ( insert .* delete .* delete \d+).*discarded .*/\1/; SHOW ENGINE INNODB STATUS; DROP TABLE t; ./mtr --mysqld=--innodb-page-size=4k testname The test is reporting InnoDB insert 0, delete mark 0, delete 0 both before and after the XA ROLLBACK statement. None of the secondary index pages are present in the buffer pool after the slow shutdown, and the pages are only being loaded by the XA ROLLBACK . That rollback is invoking buf_page_get_gen() not only for the secondary index root page and some node pointer pages, but also to retrieve the leaf page: #0 0x0000555556826717 in btr_pcur_open_low (index=0x555558448818, level=0, tuple=0x7fff8c1408d8, mode=PAGE_CUR_LE, latch_mode=16386, cursor=0x7ffff03ecae8, file=0x555556e70fc5 "/mariadb/10.5/storage/innobase/row/row0row.cc", line=1313, autoinc=0, mtr=0x7ffff03ecd90) at /mariadb/10.5/storage/innobase/include/btr0pcur.ic:441 #1 0x000055555682712e in row_search_index_entry (index=0x555558448818, entry=0x7fff8c1408d8, mode=16386, pcur=0x7ffff03ecae8, mtr=0x7ffff03ecd90) at /mariadb/10.5/storage/innobase/row/row0row.cc:1313 #2 0x0000555556b034c3 in row_undo_ins_remove_sec_low (mode=16386, index=0x555558448818, entry=0x7fff8c1408d8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:280 #3 0x0000555556b03062 in row_undo_ins_remove_sec (index=0x555558448818, entry=0x7fff8c1408d8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:344 #4 0x0000555556b005e0 in row_undo_ins_remove_sec_rec (node=0x7fff8c13eac8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:524 #5 0x0000555556aff745 in row_undo_ins (node=0x7fff8c13eac8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0uins.cc:576 #6 0x000055555684980b in row_undo (node=0x7fff8c13eac8, thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0undo.cc:438 #7 0x0000555556849380 in row_undo_step (thr=0x7fff8c13a268) at /mariadb/10.5/storage/innobase/row/row0undo.cc:499 … #15 0x00005555565ff26b in innobase_rollback_by_xid (hton=0x5555580cc1c8, xid=0x7fff8c014b70) at /mariadb/10.5/storage/innobase/handler/ha_innodb.cc:16737 In fact, the flag BTR_DELETE (which could have been passed here) is only being used by purge, never by rollback. Similarly, the flag BTR_DELETE_MARK is only being used by row_upd_sec_index_entry() (as part of an DELETE , or an UPDATE of a PRIMARY KEY column), never on rollback. So, the delete buffering was never being used to its full potential. I will prepare a patch that will remove the BTR_DELETE buffering and the buffer pool watch mechanism. It will be interesting to see how this affects performance. Similar performance should be obtained by setting innodb_change_buffering=changes .
            marko Marko Mäkelä added a comment - - edited

            It turns out that we can employ the simple change buffer record format
            (table_id,index_id,index_fields,operation)
            and only buffer the latest operation (insert, delete-mark, delete-unmark, delete).

            Whenever we are about to buffer an operation, we would have to search for (table_id,index_id,index_fields) anyway, to find the correct position for inserting the record into the change buffer. If we find a match, we would merely update the operation in the change buffer record, instead of inserting a new record.

            The order of applying the operations to the index page does not matter, because we only buffer the latest pending operation for any given key, and applying the operations will be idempotent.

            We should make use of delete-mark, delete-unmark and delete buffering also on ROLLBACK, to speed up the rollback of large DELETE or UPDATE transactions. Rollback of INSERT would delete records directly from the secondary index pages, as it always has been the case.

            A separate benchmark has to be run to determine if it would be worthwhile to buffer delete operations (not only on purge, but also on ROLLBACK).

            When updating the operation of an existing change buffer record, we could try to perform some consistency checks. Perhaps a buffered 'insert' operation should never be replaced with a buffered 'delete-unmark' operation. If the to-be-buffered operation is the same as in the already buffered record, it might be a sign of corruption. Multiple 'delete' operations are probably fine, because sometimes ROLLBACK can perform a purge-like operation, and purge could attempt the same again. Also, when the ROLLBACK of a row operation occurs, sometimes the row operation was not completed for all indexes, yet the rollback will be attempted on every index.

            marko Marko Mäkelä added a comment - - edited It turns out that we can employ the simple change buffer record format (table_id,index_id,index_fields,operation) and only buffer the latest operation (insert, delete-mark, delete-unmark, delete). Whenever we are about to buffer an operation, we would have to search for (table_id,index_id,index_fields) anyway, to find the correct position for inserting the record into the change buffer. If we find a match, we would merely update the operation in the change buffer record, instead of inserting a new record. The order of applying the operations to the index page does not matter, because we only buffer the latest pending operation for any given key, and applying the operations will be idempotent. We should make use of delete-mark, delete-unmark and delete buffering also on ROLLBACK , to speed up the rollback of large DELETE or UPDATE transactions. Rollback of INSERT would delete records directly from the secondary index pages, as it always has been the case. A separate benchmark has to be run to determine if it would be worthwhile to buffer delete operations (not only on purge, but also on ROLLBACK ). When updating the operation of an existing change buffer record, we could try to perform some consistency checks. Perhaps a buffered 'insert' operation should never be replaced with a buffered 'delete-unmark' operation. If the to-be-buffered operation is the same as in the already buffered record, it might be a sign of corruption. Multiple 'delete' operations are probably fine, because sometimes ROLLBACK can perform a purge-like operation, and purge could attempt the same again. Also, when the ROLLBACK of a row operation occurs, sometimes the row operation was not completed for all indexes, yet the rollback will be attempted on every index.
            marko Marko Mäkelä made changes -

            Based on benchmarks that were run for MDEV-19344 for a 10.2 version of Stop-buffering-delete-purge-operations.patch, the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set.

            So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch. The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the MDEV-19514 development branch.

            marko Marko Mäkelä added a comment - Based on benchmarks that were run for MDEV-19344 for a 10.2 version of Stop-buffering-delete-purge-operations.patch , the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set. So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch . The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the MDEV-19514 development branch.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Parent MDEV-11633 [ 59463 ]
            Issue Type Technical task [ 7 ] Task [ 3 ]

            MDEV-19514 preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.

            marko Marko Mäkelä added a comment - MDEV-19514 preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 78866 ] MariaDB v4 [ 130572 ]
            marko Marko Mäkelä made changes -
            axel Axel Schwenke made changes -
            Attachment MDEV-11634-10.5.pdf [ 61941 ]
            Attachment MDEV-11634-10.6.pdf [ 61942 ]
            axel Axel Schwenke added a comment -

            Attached benchmark graphs for MariaDB 10.5 and 10.6

            axel Axel Schwenke added a comment - Attached benchmark graphs for MariaDB 10.5 and 10.6
            axel Axel Schwenke made changes -
            Attachment MDEV-11634-10.8B.pdf [ 62273 ]
            axel Axel Schwenke added a comment -

            Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes MDEV-27774. It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

            axel Axel Schwenke added a comment - Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes MDEV-27774 . It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.
            AirFocus AirFocus made changes -
            Description The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

            The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

            The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

            On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

            If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

            Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
            On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

            If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read.
            The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B\-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

            The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

            The key in the 5.5 format is (tablespace_id, page_number, operation\_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

            On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

            If the InnoDB change buffer key was something like (tablespace_id, index_id, page\_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

            Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
            On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

            If the InnoDB change buffer is to be preserved, it would be good to define it as a no\-rollback persistent table that privileged users can read.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly.

            A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written:

            • A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log.
            • Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered.

            Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages.

            Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823.

            marko Marko Mäkelä added a comment - Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly. A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written: A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log. Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered. Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages. Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823 .
            ralf.gebhardt Ralf Gebhardt made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s N/A [ 14700 ]
            Assignee Marko Mäkelä [ marko ]
            Resolution Won't Do [ 10201 ]
            Status Open [ 1 ] Closed [ 6 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            ralf.gebhardt Ralf Gebhardt made changes -

            People

              Unassigned Unassigned
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.