Details

    Description

      The InnoDB change buffer https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change aims to make the write patterns of leaf pages of non-unique, non-spatial indexes more sequential. If a leaf page is not present in the buffer pool, the operation can be buffered by writing a record to the special change buffer B-tree, provided that no page overflow or underflow can occur. When the page is read into the buffer pool for whatever reason, the change buffer will be merged to it.

      The change buffer format has severe design problems. Actually we still support all change buffer formats (MySQL 4.0 and earlier MySQL 4.1 with innodb_file_per_table, 5.0 with ROW_FORMAT=COMPACT, 5.5 with delete and purge buffering), even though an upgrade should always be preceded by a slow shutdown that should have emptied the change buffer.

      The key in the 5.5 format is (tablespace_id, page_number, operation_count), followed by the operation code (insert/delete/purge), record metadata, and the actual data of the record.

      On DROP INDEX or DROP TABLE in a shared tablespace, InnoDB cannot easily delete all buffered records for the tablespace. So, it will not even try. Instead, on page allocation, InnoDB will try to drop buffered changes if any existed.

      If the InnoDB change buffer key was something like (tablespace_id, index_id, page_number), it would be easy to discard all buffered changes for a given index. We could even avoid writing index metadata to the change buffer records. But this would require that the dictionary metadata be available to the buffer pool interface that takes care of merging buffered changes.

      Allocating the change buffer in the InnoDB system tablespace is problematic. IMPORT/EXPORT would work better if this link to the system tablespace did not exist.
      On the other hand, while having a dedicated change buffer in each tablespace would make IMPORT/EXPORT easier, the page write access pattens would be less sequential than with the current global change buffer in the system tablespace.

      If the InnoDB change buffer is to be preserved, it would be good to define it as a no-rollback persistent table that privileged users can read.

      Attachments

        Issue Links

          Activity

            Based on benchmarks that were run for MDEV-19344 for a 10.2 version of Stop-buffering-delete-purge-operations.patch, the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set.

            So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch. The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the MDEV-19514 development branch.

            marko Marko Mäkelä added a comment - Based on benchmarks that were run for MDEV-19344 for a 10.2 version of Stop-buffering-delete-purge-operations.patch , the change buffering for purge operations is beneficial for performance when the buffer pool is slightly smaller than the working set. So, it looks like we should maintain (and extend) the use of delete buffering. The logical change buffer format should allow us to remove the buffer pool watch mechanism, which is part of Stop-buffering-delete-purge-operations.patch . The patch almost cleanly applies to 10.5 as of a07be05302ccc3baea83b7920e9162f3e91dfdcc; it is for the MDEV-19514 development branch.

            MDEV-19514 preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.

            marko Marko Mäkelä added a comment - MDEV-19514 preserved one invocation of arbitrary change buffer merges during normal operation: ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change. This invocation could apply to any buffered index, not the one that is currently being inserted into. As part of this task, we should remove or adjust this call. We could merge buffered changes for the current index, or we could refuse to buffer the current change because there already exist too many buffered changes for other indexes.
            axel Axel Schwenke added a comment -

            Attached benchmark graphs for MariaDB 10.5 and 10.6

            axel Axel Schwenke added a comment - Attached benchmark graphs for MariaDB 10.5 and 10.6
            axel Axel Schwenke added a comment -

            Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes MDEV-27774. It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

            axel Axel Schwenke added a comment - Attached MDEV-11634-10.8B.pdf with results for 10.8. Two 10.8 commits were tested. The newer one (blue and pink lines) includes MDEV-27774 . It is both faster and the difference between innodb_change_buffering=all and innodb_change_buffering=none is smaller.

            Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly.

            A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written:

            • A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log.
            • Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered.

            Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages.

            Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823.

            marko Marko Mäkelä added a comment - Come to think of it, the current InnoDB change buffer is conceptually similar to a local write-ahead log. Because it is managed in a number of persistent redo-logged data pages (the change buffer bitmap pages in persistent tablespaces and the special B-tree in the system tablespace), any operations involving the change buffer will cause excessive redo logging, compared to a situation where the changes would be applied to the index directly. A possible reimplementation of the change buffer could avoid log write amplification by changing the way how the log is written: A record written to the global write-ahead log ib_logfile0 would merely indicate that some changes exist for an index page (or index) in a separate local log. Each local log (say, a "change buffer" for each index) would allow the individual index pages to be recovered. Some cases of recovery could run faster, because the global log would become much smaller. But, the first access to affected indexes could be much slower, depending on how the local logs are managed. Also, log checkpoints could become extremely slow if we would have to apply lots of buffered changes from the local logs to the index pages. Further challenges are related to the MVCC implementation at least until MDEV-17598 has been implemented. Rollback and purge currently have to check if it is safe to remove a secondary index record, which constitutes a dependency to undo logs and concurrently active transactions. There already is an open bug about this, in MDEV-29823 .

            People

              Unassigned Unassigned
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.