Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16291

Allow ALGORITHM=NOCOPY for most ALTER TABLE in InnoDB

Details

    Description

      MDEV-11424 introduces support for ALGORITHM=INSTANT for failure-free ALTER TABLE operations that do not involve adding or dropping indexes.

      Operations that may fail will require validation by reading all data in the table, and thus they cannot be instantaneous. While we cannot support ALGORITHM=INSTANT, we can support ALGORITHM=NOCOPY, that is, perform the validation and use the MDEV-11424 mechanism for representing the operation, without modifying any data.

      Examples of such operations include:

      • Adding or removing UNSIGNED attribute of an integer column
      • Adding NOT NULL attribute
      • Shortening CHAR or VARCHAR (to produce errors or warnings for data truncation)
      • Changing the character encoding of a CHAR or VARCHAR

      Examples of changes that will require the table to be rebuilt (that will not be supported with ALGORITHM=NOCOPY) include:

      • Adding, dropping or changing PRIMARY KEY columns
      • Changing the ROW_FORMAT or ENGINE
      • Altering a table that is in ROW_FORMAT=COMPRESSED

      Metadata format changes

      • This work will extend the 'default row' record introduced in MDEV-11369 further, to store 'conversion recipes' that make it possible to process records that correspond to an earlier definition of the table. To indicate the presence of the recipes in the 'default row', a new bit in info_bits will be set, and an additional BLOB column is appended to the record for storing the recipes.

      TO BE DEFINED: What metadata exactly needs to be stored and how? At least the following:

      • Character set encoding, so that data can be converted to the newest encoding
      • Whether integer columns are signed or unsigned
      • The original fixed length of string columns
      • Do we want to support conversions between string and integer columns? ENUM? BIT? SET?

      Data format changes

      Since this work is a refinement of MDEV-11424, any INSERT or UPDATE after an ALTER TABLE operation that changes the data format would cause the clustered index leaf page to be rewritten in ROW_FORMAT=REDUNDANT.

      In this work, we must store a 'conversion recipe number' in the record or page header.

      • We can repurpose the 64-bit page header field PAGE_MAX_TRX_ID. MDEV-6076 repurposed it for PAGE_ROOT_AUTO_INC in the root page. If the table consists of a single root page or it later shrinks to a single page, we can simply convert the table to non-instant format. Having a per-page format would require that all records of the page be converted to the newest format on any write.
      • In the ROW_FORMAT=REDUNDANT record header, there are 3 spare info_bits that we could use to support a small maximum number of conversion recipes.

      We may choose to limit the number of conversion recipes that can be supported. If an ALTER TABLE operation would require a conversion recipe, but there is no space to store one, the operation would be refused with ALGORITHM=NOCOPY. After any ALTER TABLE that rebuilds the table, the maximum number of conversion recipes will be available again.

      How it works

      The 'default row' record needs to store the conversion recipes that allow each clustered index leaf page record format version to be converted to the latest table definition. These recipes will be cached in dict_table_t. The 'default row' record will only be read when the table definition is first loaded to the data dictionary cache.

      Adding or dropping indexes or renaming columns or indexes would not touch the 'default row' record.

      If we store recipes for converting from version 1 to latest, 2 to latest, and so on, then each instant ALTER (MDEV-11424, MDEV-11369, MDEV-15562) may have to rewrite the old recipes, so that they will work from 1 to latest+1, 2 to latest+1 and so on, and finally add a recipe from latest to latest+1. Because we will store the recipes in a BLOB, and because BLOBs are copy-on-write in InnoDB, all conversion recipe pages will have to be rewritten in any case.

      Read and write access to the clustered index would be affected as follows:
      Writes would always use the latest dictionary version. If we use per-page identifier (instead of per-record) and the page that is being written to contains other old-format records, we could convert all of them while we are modifying the page.

      Reads would have to be prepared to convert records from old format to the latest one.

      Possible future work: Cleaning up conversion recipes

      There could be a separate online operation for clean-up, such as OPTIMIZE INDEX PRIMARY ON t1, which would also defragment the index. This would be useful if writes to the table are rare, because reads would not write back any converted records.

      The clean-up operation could also remove old version history from the 'default row' record. After MDEV-16264, we could also have a background cleanup operation that would (with some granularity) remember the last processed PRIMARY KEY values, by storing them in the (otherwise ignored) keys of the 'default row' records.

      Attachments

        Issue Links

          Activity

            There are a few reasons why I would require ROW_FORMAT changes to require a full table rebuild:

            • The attribute is supposed to apply to all indexes of the table; if we allowed NOCOPY operation, we would only change the format of subsequently modified clustered index leaf pages.
            • The maximum record length depends on the ROW_FORMAT, and so does the maximum column prefix length in secondary indexes.
            • Undo logging of column prefix indexes depends on the ROW_FORMAT.

            Technically, we could allow each clustered index leaf page to be in a different ROW_FORMAT, but it would be tricky to implement and test the logic for the column prefix indexes, and it could also be tricky to document the limitations when changing from ROW_FORMAT=DYNAMIC (the default since MariaDB Server 10.2) to a more restrictive format.

            marko Marko Mäkelä added a comment - There are a few reasons why I would require ROW_FORMAT changes to require a full table rebuild: The attribute is supposed to apply to all indexes of the table; if we allowed NOCOPY operation, we would only change the format of subsequently modified clustered index leaf pages. The maximum record length depends on the ROW_FORMAT , and so does the maximum column prefix length in secondary indexes. Undo logging of column prefix indexes depends on the ROW_FORMAT . Technically, we could allow each clustered index leaf page to be in a different ROW_FORMAT , but it would be tricky to implement and test the logic for the column prefix indexes, and it could also be tricky to document the limitations when changing from ROW_FORMAT=DYNAMIC (the default since MariaDB Server 10.2) to a more restrictive format.

            Similar to ROW_FORMAT, changing KEY_BLOCK_SIZE (which only applies to ROW_FORMAT=COMPRESSED, which does not support any form of instant ALTER TABLE) or encryption parameters will require that the table be rebuilt.

            marko Marko Mäkelä added a comment - Similar to ROW_FORMAT , changing KEY_BLOCK_SIZE (which only applies to ROW_FORMAT=COMPRESSED , which does not support any form of instant ALTER TABLE ) or encryption parameters will require that the table be rebuilt.

            Currently, column type conversions are implemented in Copy_field::do_copy(), which is called by copy_data_between_tables(). This makes use of a function pointer, pointing to a conversion function, such as Field_long::store() or do_copy_not_null(). These conversion functions require that the data be available in Field::ptr.

            InnoDB stores data in a different format internally. Integers are stored in big-endian format, and the sign bit is inverted, so that data can be compared with memcmp(). In order to use Copy_field, the ALGORITHM=INPLACE code in InnoDB would have to convert both the source data and the copied data. It seems that we would have to refactor Copy_field and Field::get_copy_func() so that the copied data would be in the storage engine format.

            marko Marko Mäkelä added a comment - Currently, column type conversions are implemented in Copy_field::do_copy() , which is called by copy_data_between_tables() . This makes use of a function pointer, pointing to a conversion function, such as Field_long::store() or do_copy_not_null() . These conversion functions require that the data be available in Field::ptr . InnoDB stores data in a different format internally. Integers are stored in big-endian format, and the sign bit is inverted, so that data can be compared with memcmp() . In order to use Copy_field , the ALGORITHM=INPLACE code in InnoDB would have to convert both the source data and the copied data. It seems that we would have to refactor Copy_field and Field::get_copy_func() so that the copied data would be in the storage engine format.

            MDEV-17520 would be revived as part of this task.

            For dictionary format versioning, we could use the hidden system fields DB_TRX_ID and DB_ROLL_PTR as follows.

            If DB_TRX_ID=0 (a possibility that was introduced in MDEV-12288), the most significant bit in DB_ROLL_PTR must be set. Up until now, all the 55 least significant bits were 0. This task will repurpose the least significant bits to be the dictionary format version.

            Whenever DB_TRX_ID is being reset to 0, we would also update the dictionary format version in DB_ROLL_PTR.

            This work would extend the metadata BLOB, which was introduced in MDEV-15562, to store conversion recipes for converting clustered index leaf page records from older dictionary format versions to the latest one.

            We could extend the metadata BLOB further, to remember the DB_TRX_ID of each instant ALTER TABLE operation that introduced a new dictionary format version. This would allow us to look up the corresponding clustered index definition, or the recipe to convert the record to the latest format. This is needed in a rare case where some purgeable history existed for the table while instant ALTER TABLE was executed. We might skip this, and do a full purge before continuing with the ALTER TABLE (so that all clustered index records will carry DB_TRX_ID=0).

            marko Marko Mäkelä added a comment - MDEV-17520 would be revived as part of this task. For dictionary format versioning, we could use the hidden system fields DB_TRX_ID and DB_ROLL_PTR as follows. If DB_TRX_ID =0 (a possibility that was introduced in MDEV-12288 ), the most significant bit in DB_ROLL_PTR must be set. Up until now, all the 55 least significant bits were 0. This task will repurpose the least significant bits to be the dictionary format version. Whenever DB_TRX_ID is being reset to 0, we would also update the dictionary format version in DB_ROLL_PTR . This work would extend the metadata BLOB, which was introduced in MDEV-15562 , to store conversion recipes for converting clustered index leaf page records from older dictionary format versions to the latest one. We could extend the metadata BLOB further, to remember the DB_TRX_ID of each instant ALTER TABLE operation that introduced a new dictionary format version. This would allow us to look up the corresponding clustered index definition, or the recipe to convert the record to the latest format. This is needed in a rare case where some purgeable history existed for the table while instant ALTER TABLE was executed. We might skip this, and do a full purge before continuing with the ALTER TABLE (so that all clustered index records will carry DB_TRX_ID =0).

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              6 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.