[MDEV-16291] Allow ALGORITHM=NOCOPY for most ALTER TABLE in InnoDB Created: 2018-05-25 Updated: 2024-01-18 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 6 |
| Labels: | ddl | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
Operations that may fail will require validation by reading all data in the table, and thus they cannot be instantaneous. While we cannot support ALGORITHM=INSTANT, we can support ALGORITHM=NOCOPY, that is, perform the validation and use the MDEV-11424 mechanism for representing the operation, without modifying any data. Examples of such operations include:
Examples of changes that will require the table to be rebuilt (that will not be supported with ALGORITHM=NOCOPY) include:
Metadata format changes
TO BE DEFINED: What metadata exactly needs to be stored and how? At least the following:
Data format changesSince this work is a refinement of MDEV-11424, any INSERT or UPDATE after an ALTER TABLE operation that changes the data format would cause the clustered index leaf page to be rewritten in ROW_FORMAT=REDUNDANT. In this work, we must store a 'conversion recipe number' in the record or page header.
We may choose to limit the number of conversion recipes that can be supported. If an ALTER TABLE operation would require a conversion recipe, but there is no space to store one, the operation would be refused with ALGORITHM=NOCOPY. After any ALTER TABLE that rebuilds the table, the maximum number of conversion recipes will be available again. How it worksThe 'default row' record needs to store the conversion recipes that allow each clustered index leaf page record format version to be converted to the latest table definition. These recipes will be cached in dict_table_t. The 'default row' record will only be read when the table definition is first loaded to the data dictionary cache. Adding or dropping indexes or renaming columns or indexes would not touch the 'default row' record. If we store recipes for converting from version 1 to latest, 2 to latest, and so on, then each instant ALTER ( Read and write access to the clustered index would be affected as follows: Reads would have to be prepared to convert records from old format to the latest one. Possible future work: Cleaning up conversion recipesThere could be a separate online operation for clean-up, such as OPTIMIZE INDEX PRIMARY ON t1, which would also defragment the index. This would be useful if writes to the table are rare, because reads would not write back any converted records. The clean-up operation could also remove old version history from the 'default row' record. After |
| Comments |
| Comment by Marko Mäkelä [ 2018-05-30 ] |
|
There are a few reasons why I would require ROW_FORMAT changes to require a full table rebuild:
Technically, we could allow each clustered index leaf page to be in a different ROW_FORMAT, but it would be tricky to implement and test the logic for the column prefix indexes, and it could also be tricky to document the limitations when changing from ROW_FORMAT=DYNAMIC (the default since MariaDB Server 10.2) to a more restrictive format. |
| Comment by Marko Mäkelä [ 2018-05-30 ] |
|
Similar to ROW_FORMAT, changing KEY_BLOCK_SIZE (which only applies to ROW_FORMAT=COMPRESSED, which does not support any form of instant ALTER TABLE) or encryption parameters will require that the table be rebuilt. |
| Comment by Marko Mäkelä [ 2018-11-01 ] |
|
Currently, column type conversions are implemented in Copy_field::do_copy(), which is called by copy_data_between_tables(). This makes use of a function pointer, pointing to a conversion function, such as Field_long::store() or do_copy_not_null(). These conversion functions require that the data be available in Field::ptr. InnoDB stores data in a different format internally. Integers are stored in big-endian format, and the sign bit is inverted, so that data can be compared with memcmp(). In order to use Copy_field, the ALGORITHM=INPLACE code in InnoDB would have to convert both the source data and the copied data. It seems that we would have to refactor Copy_field and Field::get_copy_func() so that the copied data would be in the storage engine format. |
| Comment by Marko Mäkelä [ 2019-09-19 ] |
|
MDEV-17520 would be revived as part of this task. For dictionary format versioning, we could use the hidden system fields DB_TRX_ID and DB_ROLL_PTR as follows. If DB_TRX_ID=0 (a possibility that was introduced in Whenever DB_TRX_ID is being reset to 0, we would also update the dictionary format version in DB_ROLL_PTR. This work would extend the metadata BLOB, which was introduced in We could extend the metadata BLOB further, to remember the DB_TRX_ID of each instant ALTER TABLE operation that introduced a new dictionary format version. This would allow us to look up the corresponding clustered index definition, or the recipe to convert the record to the latest format. This is needed in a rare case where some purgeable history existed for the table while instant ALTER TABLE was executed. We might skip this, and do a full purge before continuing with the ALTER TABLE (so that all clustered index records will carry DB_TRX_ID=0). |