Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23521

Implement an InnoDB row format that can use overflow pages for smaller columns

Details

    Description

      With the DYNAMIC row format, variable-length fields can only be stored on overflow pages if they are defined with a maximum size of 256 bytes or more., and fixed-length fields can only be stored on overflow pages if their maximum size is 768 bytes or more. From the documentation:

      For BLOB and TEXT columns, only values longer than 40 bytes are considered for storage on overflow pages. For VARBINARY and VARCHAR columns, only values longer than 255 bytes are considered for storage on overflow pages. Bytes that are stored to track a value's length do not count towards these limits. These limits are only based on the length of the actual column's data.

      These limits differ from the limits for the COMPACT row format, where the limit is 767 bytes for all types.

      Fixed-length columns greater than 767 bytes are encoded as variable-length columns, so they can also be stored in overflow pages if the table's row size is greater than half of innodb_page_size. Even though a column using the CHAR data type can hold at most 255 characters, a CHAR column can still exceed 767 bytes in some cases. For example, a char(255) column can exceed 767 bytes if the character set is utf8mb4.

      https://mariadb.com/kb/en/library/innodb-dynamic-row-format/#overflow-pages-with-the-dynamic-row-format

      A lot of users run into "row size too large" errors due to this:

      https://mariadb.com/kb/en/library/troubleshooting-row-size-too-large-errors-with-innodb/

      A counter-intuitive solution to many of these issues is to actually increase the length of variable-length fields, so that they can be stored on overflow pages.

      https://mariadb.com/kb/en/library/troubleshooting-row-size-too-large-errors-with-innodb/#increasing-the-length-of-varbinary-columns

      https://mariadb.com/kb/en/library/troubleshooting-row-size-too-large-errors-with-innodb/#increasing-the-length-of-varchar-columns

      It might be a good idea to implement a row format that allows smaller pages to be stored on overflow pages, so that users don't have to use this workaround.

      Attachments

        Issue Links

          Activity

            For the reference, MDEV-12026 and MDEV-18644 (which implemented a fairly minor refinement to the existing InnoDB file format) were logged as 33 working days. I do not know who came up with the estimate of 5 days, but I think that it is completely unrealistic. As far as I can understand, this would involve designing not only a new index page format but also changes to the undo logging, to support that.

            Designing and implementing a completely new index page format for InnoDB would be a multi-month effort. If we go down that route, I would consider implementing a completely new B-tree storage engine, including features like the following:

            • Similar to MDEV-17598, have per-record transaction identifiers in secondary indexes.
            • Similar to RocksDB, store every visible record version in the clustered index (essentially, include DB_TRX_ID in the PRIMARY KEY). And implement some nice page format that allows column values to be shared between records, to minimize the overhead of updating a single column.

            These changes would greatly simplify MVCC and checks for implicit locks. The undo log would only be consulted for rollback and purge. For that to work, it would have to record (table_id,primary_key) for every transaction.

            I have also been thinking about writing undo log records in the same stream with redo log (which depends on MDEV-12353 and MDEV-14425). We would use memory-mapped files to persistent memory. But, it looks like the RocksDB-like change (including DB_TRX_ID in the PRIMARY KEY) would greatly reduce accesses to the undo log, and we might as well keep separate undo log pages, to avoid introducing a hardware dependency, and avoid introducing constraints for redo log checkpoints.

            marko Marko Mäkelä added a comment - For the reference, MDEV-12026 and MDEV-18644 (which implemented a fairly minor refinement to the existing InnoDB file format) were logged as 33 working days. I do not know who came up with the estimate of 5 days, but I think that it is completely unrealistic. As far as I can understand, this would involve designing not only a new index page format but also changes to the undo logging, to support that. Designing and implementing a completely new index page format for InnoDB would be a multi-month effort. If we go down that route, I would consider implementing a completely new B-tree storage engine, including features like the following: Similar to MDEV-17598 , have per-record transaction identifiers in secondary indexes. Similar to RocksDB, store every visible record version in the clustered index (essentially, include DB_TRX_ID in the PRIMARY KEY ). And implement some nice page format that allows column values to be shared between records, to minimize the overhead of updating a single column. These changes would greatly simplify MVCC and checks for implicit locks. The undo log would only be consulted for rollback and purge. For that to work, it would have to record (table_id,primary_key) for every transaction. I have also been thinking about writing undo log records in the same stream with redo log (which depends on MDEV-12353 and MDEV-14425 ). We would use memory-mapped files to persistent memory. But, it looks like the RocksDB-like change (including DB_TRX_ID in the PRIMARY KEY ) would greatly reduce accesses to the undo log, and we might as well keep separate undo log pages, to avoid introducing a hardware dependency, and avoid introducing constraints for redo log checkpoints.

            People

              marko Marko Mäkelä
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.