[MDEV-23521] Implement an InnoDB row format that can use overflow pages for smaller columns Created: 2019-10-23  Updated: 2023-11-30

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Geoff Montee (Inactive) Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 2
Labels: None

Issue Links:
Relates
relates to MDEV-10364 Row size too large (> 8126) Closed
relates to MDEV-16969 Fix error message as promised in upst... Open
relates to MDEV-20256 "Row size too large" on latest releas... Closed
relates to MDEV-20268 create table fails with row size too ... Closed
relates to MDEV-20346 Cannot create a certain table: Row si... Closed
relates to MDEV-20448 Add link to release notes about "Row ... Closed
relates to MDEV-20786 MariaDB server writes warnings into e... Closed
relates to MDEV-20832 Don't print "row size too large" warn... Closed
relates to MDEV-21261 When performing instant add column, I... Confirmed
relates to MDEV-19292 "Row size too large" error when creat... Closed
relates to MDEV-20194 Warnings inconsistently issued upon C... Closed
relates to MDEV-20400 Implement a way to query an InnoDB ta... Confirmed
relates to MDEV-20949 Stop performing unnecessary maximum r... Closed
relates to MDEV-21429 TRUNCATE and OPTIMIZE are being refus... Closed
relates to MDEV-22585 Incorrect note being written to the e... Open

 Description   

With the DYNAMIC row format, variable-length fields can only be stored on overflow pages if they are defined with a maximum size of 256 bytes or more., and fixed-length fields can only be stored on overflow pages if their maximum size is 768 bytes or more. From the documentation:

For BLOB and TEXT columns, only values longer than 40 bytes are considered for storage on overflow pages. For VARBINARY and VARCHAR columns, only values longer than 255 bytes are considered for storage on overflow pages. Bytes that are stored to track a value's length do not count towards these limits. These limits are only based on the length of the actual column's data.

These limits differ from the limits for the COMPACT row format, where the limit is 767 bytes for all types.

Fixed-length columns greater than 767 bytes are encoded as variable-length columns, so they can also be stored in overflow pages if the table's row size is greater than half of innodb_page_size. Even though a column using the CHAR data type can hold at most 255 characters, a CHAR column can still exceed 767 bytes in some cases. For example, a char(255) column can exceed 767 bytes if the character set is utf8mb4.

https://mariadb.com/kb/en/library/innodb-dynamic-row-format/#overflow-pages-with-the-dynamic-row-format

A lot of users run into "row size too large" errors due to this:

https://mariadb.com/kb/en/library/troubleshooting-row-size-too-large-errors-with-innodb/

A counter-intuitive solution to many of these issues is to actually increase the length of variable-length fields, so that they can be stored on overflow pages.

https://mariadb.com/kb/en/library/troubleshooting-row-size-too-large-errors-with-innodb/#increasing-the-length-of-varbinary-columns

https://mariadb.com/kb/en/library/troubleshooting-row-size-too-large-errors-with-innodb/#increasing-the-length-of-varchar-columns

It might be a good idea to implement a row format that allows smaller pages to be stored on overflow pages, so that users don't have to use this workaround.



 Comments   
Comment by Marko Mäkelä [ 2019-12-11 ]

For the reference, MDEV-12026 and MDEV-18644 (which implemented a fairly minor refinement to the existing InnoDB file format) were logged as 33 working days. I do not know who came up with the estimate of 5 days, but I think that it is completely unrealistic. As far as I can understand, this would involve designing not only a new index page format but also changes to the undo logging, to support that.

Designing and implementing a completely new index page format for InnoDB would be a multi-month effort. If we go down that route, I would consider implementing a completely new B-tree storage engine, including features like the following:

  • Similar to MDEV-17598, have per-record transaction identifiers in secondary indexes.
  • Similar to RocksDB, store every visible record version in the clustered index (essentially, include DB_TRX_ID in the PRIMARY KEY). And implement some nice page format that allows column values to be shared between records, to minimize the overhead of updating a single column.

These changes would greatly simplify MVCC and checks for implicit locks. The undo log would only be consulted for rollback and purge. For that to work, it would have to record (table_id,primary_key) for every transaction.

I have also been thinking about writing undo log records in the same stream with redo log (which depends on MDEV-12353 and MDEV-14425). We would use memory-mapped files to persistent memory. But, it looks like the RocksDB-like change (including DB_TRX_ID in the PRIMARY KEY) would greatly reduce accesses to the undo log, and we might as well keep separate undo log pages, to avoid introducing a hardware dependency, and avoid introducing constraints for redo log checkpoints.

Generated at Thu Feb 08 09:23:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.