Details

    • 10.2.4-5, 10.3.1-2

    Description

      Add support for instant add column for InnoDB when adding a new column with a default value last.

      We have got patches from Alibaba and TenCent for doing this. We should choose either one or merge them.

      The idea in both patches is to add a new row type to InnoDB where the number of columns
      are stored in each row in 1-2 bytes. When creating a row we store the current number of columns. This allows us to fill any extra columns with default values on read.

      Attachments

        Issue Links

          Activity

            The original InnoDB ROW_FORMAT=REDUNDANT should support Postgres-style instant operations out of the box, because the record header explicitly stores the number of fields as well as the start address of each field. The Postgres-style operations are ADD COLUMN…LAST and DROP COLUMN (the latter is implemented by hiding columns and storing dummy columns for new records).

            A simple proof of concept exists in MariaDB 10.2 and upstream MySQL 5.7: The column MERGE_THRESHOLD was appended to the InnoDB data dictionary table SYS_INDEXES without rewriting existing records. Old data dictionary entries will deliver a default value.

            While ROW_FORMAT=REDUNDANT introduces some storage overhead, it should also provide faster access, because the start address of each field is explicitly stored in the index record header. No changes to applications are needed if innodb_default_row_format=redundant is set in the configuration files.

            Also in this task, ALTER TABLE will not be instantaneous if it includes any data-writing operations, such as ADD [UNIQUE] INDEX. DROP COLUMN may imply DROP INDEX, but that is reasonably fast (not traversing the entire index tree).

            In a later MariaDB version, we could support a wider range of instant ALTER TABLE operations by repurposing some bits or bytes in the clustered index leaf page record header for representing a data dictionary version, by keeping all versions of the table definition in the data dictionary, and by converting old-format records to the current format. Note that some operations will need to write to data files and cannot be instantaneous. Such operations may also fail. Examples include introducing a NOT NULL attribute, changing the data type of a column, or ADD UNIQUE INDEX.

            marko Marko Mäkelä added a comment - The original InnoDB ROW_FORMAT=REDUNDANT should support Postgres-style instant operations out of the box, because the record header explicitly stores the number of fields as well as the start address of each field. The Postgres-style operations are ADD COLUMN…LAST and DROP COLUMN (the latter is implemented by hiding columns and storing dummy columns for new records). A simple proof of concept exists in MariaDB 10.2 and upstream MySQL 5.7: The column MERGE_THRESHOLD was appended to the InnoDB data dictionary table SYS_INDEXES without rewriting existing records. Old data dictionary entries will deliver a default value. While ROW_FORMAT=REDUNDANT introduces some storage overhead, it should also provide faster access, because the start address of each field is explicitly stored in the index record header. No changes to applications are needed if innodb_default_row_format=redundant is set in the configuration files. Also in this task, ALTER TABLE will not be instantaneous if it includes any data-writing operations, such as ADD [UNIQUE] INDEX. DROP COLUMN may imply DROP INDEX, but that is reasonably fast (not traversing the entire index tree). In a later MariaDB version, we could support a wider range of instant ALTER TABLE operations by repurposing some bits or bytes in the clustered index leaf page record header for representing a data dictionary version, by keeping all versions of the table definition in the data dictionary, and by converting old-format records to the current format. Note that some operations will need to write to data files and cannot be instantaneous. Such operations may also fail. Examples include introducing a NOT NULL attribute, changing the data type of a column, or ADD UNIQUE INDEX.
            vinchen vinchen added a comment -

            Hi, I am the main designer of instant adding column from Tencent.

            It looks exciting of your new ideas of storeing dictionary version in record. Maybe it's a big project.

            But I have some questions of it:
            1. create table t1(a int, b int, c int); – dictionary version 1
            alter table t1 drop column c; – dictionary version 2
            alter table t1 add column c int not null default 10; – dictionary version 3

            How should the record of version 1 to identify the value of column "c“ ?
            Maybe the correct value is 10(default value). Not the value stored in the record(version 1).
            Maybe it should check all the dictionary versions to get the true result?

            2. Do it bring the complexity of parsing the record, and resulting in the performance loss?

            vinchen vinchen added a comment - Hi, I am the main designer of instant adding column from Tencent. It looks exciting of your new ideas of storeing dictionary version in record. Maybe it's a big project. But I have some questions of it: 1. create table t1(a int, b int, c int); – dictionary version 1 alter table t1 drop column c; – dictionary version 2 alter table t1 add column c int not null default 10; – dictionary version 3 How should the record of version 1 to identify the value of column "c“ ? Maybe the correct value is 10(default value). Not the value stored in the record(version 1). Maybe it should check all the dictionary versions to get the true result? 2. Do it bring the complexity of parsing the record, and resulting in the performance loss?

            Hi Vinchen, it was nice to get your feedback.

            Here are short answers to your questions.
            1. MDEV-11424 contains an extended version of your example that answers the question. We must use the value 10.
            2. Yes. To reduce the amount of repeated conversion of readers (which would not write back the converted records), we could say that any write to a page could update all records in the page to the newest version. In this case, we would only need one version identifier per page.

            I think that we should also defer DROP COLUMN to MDEV-11424.
            I am looking forward to your comments in MDEV-11424.
            In this ticket, let us discuss only the ADD COLUMN.

            marko Marko Mäkelä added a comment - Hi Vinchen, it was nice to get your feedback. Here are short answers to your questions. 1. MDEV-11424 contains an extended version of your example that answers the question. We must use the value 10. 2. Yes. To reduce the amount of repeated conversion of readers (which would not write back the converted records), we could say that any write to a page could update all records in the page to the newest version. In this case, we would only need one version identifier per page. I think that we should also defer DROP COLUMN to MDEV-11424 . I am looking forward to your comments in MDEV-11424 . In this ticket, let us discuss only the ADD COLUMN.

            I thought (it's not what Marko has written, but pretty close) about storing a "table definition version" per row. And — in some separate place, for example, in a new InnoDB system table — a set of steps of how to convert the row from the version N-1 to the version N. In the first version the only supported step could be "add column X at the end with the default value Y". Then this feature will be identical to your "instant add column", only in-row format will be different. But later this version-based idea could be extended to support other steps, like, "drop column X", "move column X to be after column Y", and so on. When supporting only adding a column, this is not a big project.

            1. yes, to convert a row from the version 1 to version 3, InnoDB would need to do two steps "drop column c", "add column c, default 10". it should execute all steps to transform from version 1 to 3.

            2. may be, but not necessarily. it depends on what these steps will be. if they're defined in terms of bytes, than executing a step will be a memcpy, no need to parse a row. On the other hand, InnoDB needs to parse a row anyway, to convert it to mysql row format. So, doing the conversion on the parsed row is also possible, it will not cause additional re-parsing.

            serg Sergei Golubchik added a comment - I thought (it's not what Marko has written, but pretty close) about storing a "table definition version" per row. And — in some separate place, for example, in a new InnoDB system table — a set of steps of how to convert the row from the version N-1 to the version N. In the first version the only supported step could be "add column X at the end with the default value Y". Then this feature will be identical to your "instant add column", only in-row format will be different. But later this version-based idea could be extended to support other steps, like, "drop column X", "move column X to be after column Y", and so on. When supporting only adding a column, this is not a big project. 1. yes, to convert a row from the version 1 to version 3, InnoDB would need to do two steps "drop column c", "add column c, default 10". it should execute all steps to transform from version 1 to 3. 2. may be, but not necessarily. it depends on what these steps will be. if they're defined in terms of bytes, than executing a step will be a memcpy, no need to parse a row. On the other hand, InnoDB needs to parse a row anyway, to convert it to mysql row format. So, doing the conversion on the parsed row is also possible, it will not cause additional re-parsing.

            I am against having a separate table for for table version as this makes it hard to do easy export/import of tables.

            What we need first and foremost is to be able to quickly add new columns last, as this is one of the most common ALTER TABLE operations that can cause problems in production. I don't want to see this task being delayed infinitely just because we want be over ambitious with what we want to do. Now we need something that we can have stable in February!

            monty Michael Widenius added a comment - I am against having a separate table for for table version as this makes it hard to do easy export/import of tables. What we need first and foremost is to be able to quickly add new columns last, as this is one of the most common ALTER TABLE operations that can cause problems in production. I don't want to see this task being delayed infinitely just because we want be over ambitious with what we want to do. Now we need something that we can have stable in February!

            serg, monty, I think that your comments apply to MDEV-11424. Please let us keep this MDEV-11369 to be clean of that.

            My proposal for this MDEV-11369 is that we cover ALTER TABLE…ADD COLUMN…LAST for tables that are already in ROW_FORMAT=REDUNDANT. In these tables, each record contains the number of columns. We can easily append columns to the end; upon reading, we would replace missing column values with the DEFAULT value.

            However, the .frm file should somehow remember the original column count, so that we can refuse the change of column DEFAULT values for those columns that have been added after the table was originally created. This would prevent the problematic scenario that vinchen presented on 2016-11-30.

            I think that related to this change, we should introduce a number of ALGORITHM variants. Quoting from MDEV-11424: We might want to introduce ALTER TABLE qualifiers to prevent negative surprises. For example, ALGORITHM=QUICK would refuse ADD INDEX, but it would allow DROP INDEX and any metadata-only changes. ALGORITHM=NOCOPY would allow ADD INDEX, but it would refuse to rebuild the whole table.

            In MDEV-11369, changing the DEFAULT value would be allowed with ALGORITHM=QUICK when we are not changing the DEFAULT of already added columns. And DROP COLUMN would always be refused with ALGORITHM=NOCOPY.

            marko Marko Mäkelä added a comment - serg , monty , I think that your comments apply to MDEV-11424 . Please let us keep this MDEV-11369 to be clean of that. My proposal for this MDEV-11369 is that we cover ALTER TABLE…ADD COLUMN…LAST for tables that are already in ROW_FORMAT=REDUNDANT. In these tables, each record contains the number of columns. We can easily append columns to the end; upon reading, we would replace missing column values with the DEFAULT value. However, the .frm file should somehow remember the original column count, so that we can refuse the change of column DEFAULT values for those columns that have been added after the table was originally created. This would prevent the problematic scenario that vinchen presented on 2016-11-30. I think that related to this change, we should introduce a number of ALGORITHM variants. Quoting from MDEV-11424 : We might want to introduce ALTER TABLE qualifiers to prevent negative surprises. For example, ALGORITHM=QUICK would refuse ADD INDEX, but it would allow DROP INDEX and any metadata-only changes. ALGORITHM=NOCOPY would allow ADD INDEX, but it would refuse to rebuild the whole table. In MDEV-11369 , changing the DEFAULT value would be allowed with ALGORITHM=QUICK when we are not changing the DEFAULT of already added columns. And DROP COLUMN would always be refused with ALGORITHM=NOCOPY.
            marko Marko Mäkelä added a comment - - edited

            Page-level format tracking instead of row-level

            Tencent has implemented instant ADD COLUMN by introducing new ROW_FORMAT variants that explicitly store the number of fields in index records. With the original InnoDB format (ROW_FORMAT=REDUNDANT) we could have something like that as is, because that format stores the number of fields.

            However, there is an alternative that is compatible with old data files that have been created in any ROW_FORMAT. Instead of tracking the number of columns for each record, we would track it for each page.

            Implementation of the instant ADD COLUMN operation

            As suggested in MDEV-11424 for more general instant ALTER, we can repurpose the unused field PAGE_MAX_TRX_ID in clustered index leaf pages. All InnoDB versions at least since MySQL 3.23.49 wrote this field as 0. We would only support ADD COLUMN with the following restrictions:

            1. Only ADD COLUMN…LAST. Any DROP COLUMN or change of column order will require a full table rebuild.
            2. Only constant DEFAULT values.
            3. Changing the DEFAULT of instantly-added columns will require a full table rebuild.

            A new field PAGE_CLUST_LEAF_FIELDS will be aliased to PAGE_MAX_TRX_ID and defined as follows:

            1. On non-leaf pages, there is no PAGE_CLUST_LEAF_FIELDS. (In clustered index non-leaf pages other than the root, the PAGE_MAX_TRX_ID will remain unused (0). For the clustered index root page, MDEV-6076 introduced PAGE_ROOT_AUTO_INC.)
            2. If the clustered index consists of a single page (root and leaf), PAGE_CLUST_LEAF_FIELDS does not exist, and the records must be in the newest format. (MDEV-6076 repurposed the PAGE_MAX_TRX_ID of the clustered index root page as PAGE_ROOT_AUTO_INC.)
            3. If the PAGE_CLUST_LEAF_FIELDS of the leftmost leaf page is 0, all PAGE_CLUST_LEAF_FIELDS must be 0. (These are tables where instant ALTER TABLE (MDEV-11369 or MDEV-11424) has not been used, or the table has been rebuilt since then.)
            4. In the leftmost leaf page, PAGE_CLUST_LEAF_FIELDS contains the original number of fields in the clustered index, before any instant ADD COLUMN operations. This field will be cached in dict_table_t::clust_leaf_fields. The records in the leftmost leaf page must always have the latest number of columns.
            5. In other leaf pages of the clustered index, if PAGE_CLUST_LEAF_FIELDS is 0, the number of fields in the records will be dict_table_t::clust_leaf_fields. Else, PAGE_CLUST_LEAF_FIELDS is the number of fields in the records, and it must be between index->table->clust_leaf_fields+1 and index->n_fields.
            6. Only the PAGE_CLUST_LEAF_FIELDS values 1 to 1023 will be reserved for this purpose. (InnoDB does not support indexes with more fields.) The values 1024 to (1<<64)-1 could be used by MDEV-11424 later.

            From the above it follows that instant ADD COLUMN must convert the leftmost leaf page or the root-leaf page to the newest format. Unless the conversion results into a page split, only a single data page will be modified.

            The data dictionary (both the .frm file and the InnoDB SYS_ tables) will contain the latest table definition. The number of clustered index fields before the first instant ADD COLUMN is only stored in dict_table_t::clust_leaf_fields the PAGE_CLUST_LEAF_FIELDS field of the leftmost leaf page. If multiple ALTER TABLE…ADD COLUMN statements are executed on the same table, the dict_table_t::clust_leaf_fields will not change.

            When determining whether changing the DEFAULT value of some columns requires a table rebuild, InnoDB must determine if the clustered index position of any of the affected columns is dict_table_t::clust_leaf_fields or greater. If table->clust_leaf_fields == 0, instant ADD COLUMN has not been used, and any DEFAULT values can be changed instantly.

            Data format changes

            The format of secondary index pages will be unaffected by instant ALTER TABLE.

            The clustered index leaf page format is affected by the instant ADD COLUMN, by the introduction of the PAGE_CLUST_LEAF_FIELDS (repurposed from PAGE_MAX_TRX_ID).

            Non-leaf, non-root clustered index pages will be unaffected.

            The clustered index root page format will have to be slightly changed to avoid an unintended change to the node pointer records when ROW_FORMAT is COMPACT, DYNAMIC or COMPRESSED. There is no problem with ROW_FORMAT=REDUNDANT.

            In clustered index node pointer records, all fields will be NOT NULL, because the PRIMARY KEY columns cannot be NULL. Nevertheless, the COMPACT, DYNAMIC and COMPRESSED clustered index node pointer record header will unnecessarily allocate 0‥128 zero bytes to represent NULL flags in the node pointers. The 0‥128 corresponds to UT_BITS_IN_BYTES(index->n_nullable). index->n_nullable can be 0‥1017 in the clustered index.

            So, we must find one byte in the clustered index root page to store the number of the redundant zero bytes in the node pointer record header. Luckily, only 3 bits of the 16-bit field PAGE_DIRECTION are used.

            We will repurpose the most significant byte of PAGE_DIRECTION in all COMPACT, DYNAMIC, REDUNDANT index root pages as follows: If the field is nonzero, it will be set to 1+UT_BITS_IN_BYTES(index->n_nullable), to be used when interpreting the node pointer records. (The byte was always set to 0 until now.) The value will also be cached in a new field dict_index_t::node_ptr_null_bytes (always 0 if ROW_FORMAT=REDUNDANT). This will not only solve the problem at hand, but it will also allow us to avoid wasting space on node pointer pages in the future.

            Changes to read operations

            When reading records from a page, we must use the current number of fields in the page. This will require some modifications to rec_get_offsets() or its callers, especially because UT_BITS_IN_BYTES(index->n_nullable) may change. When returning a record to the SQL layer, InnoDB has to fill in the DEFAULT values of any instantly-added columns that were missing from the leaf page.

            If an index is created on an instantly added column, InnoDB may need to access the DEFAULT values of instantly added columns. For this purpose, we probably should move row_prebuilt_t::default_rec to dict_table_t::default_rec.

            Changes to write operations

            Because we maintain the number of fields at the page level, all records within the clustered index leaf page must have the same number of fields.

            This means that whenever any record within the page is inserted, or a not-yet-instantiated instantly added column is updated, all records on the page must be converted. This could require the leaf page to be split into multiple pages.

            Page splits and merges will convert all records in all affected leaf pages to the newest format.

            Page reorganize does not need to convert records, and maybe it should, because the conversion could cause a page overflow. This could mean that page reorganize should memcpy() the individual records instead of invoking page_copy_rec_list_end_no_locks() which would by default use the latest index definition.

            It is optional to convert pages when records are purged (deleted) or the delete-mark bit is changed (for example, BEGIN; DELETE FROM instantly_altered_table; ROLLBACK).

            Example

            CREATE TABLE t(
              a INT NOT NULL, b INT NOT NULL,
              c1 CHAR(255) NOT NULL DEFAULT '',
              c2 CHAR(255) NOT NULL DEFAULT '',
              c3 CHAR(255) NOT NULL DEFAULT '',
              c4 CHAR(255) NOT NULL DEFAULT '',
              c5 CHAR(255) NOT NULL DEFAULT '',
              c6 CHAR(255) NOT NULL DEFAULT '',
              c7 CHAR(255) NOT NULL DEFAULT '',
              c8 CHAR(255) NOT NULL DEFAULT '',
              PRIMARY KEY(a,b), INDEX(b)
            ) ENGINE=InnoDB;
             
            BEGIN;
            INSERT INTO t(a,b) VALUES(1,1),(1,2),(1,3),(1,4),(1,5);
            INSERT INTO t(a,b) SELECT a+1,b FROM t;
            INSERT INTO t(a,b) SELECT a+2,b FROM t;
            INSERT INTO t(a,b) SELECT a+4,b FROM t;
            COMMIT;
            

            Our example table consists of multiple B-tree pages in the clustered index. There also is a secondary index which contains entries (b,a). The secondary index pages and the non-leaf clustered index pages (other than the clustered index root page) will be unaffected by instant ADD COLUMN.

            Let us demonstrate a few instant ADD COLUMN operations:

            ALTER TABLE t ADD COLUMN d CHAR(10) NOT NULL DEFAULT '';
            # Above, only the leftmost leaf page was converted to include column d.
            UPDATE t SET c8='eight' WHERE a=8;
            # The UPDATE may convert the rightmost leaf page(s) to include column d.
            # Column d was not updated, so the page conversion is optional.
            ALTER TABLE t ADD COLUMN e INT NULL;
            # Again, the ALTER only converted the leftmost leaf page.
            UPDATE t SET e=a*b WHERE a=7;
            # The UPDATE will convert the affected leaf page records to include columns d,e.
            

            Finally, let us demonstrate a few limitations:

            # Indexes can be added to instantly added columns, but not instantly.
            ALTER TABLE t ADD INDEX(e);
            

            Changing the DEFAULT value of a column was an instant operation until now. However, changing the DEFAULT of an instantly added column will rebuild the table, no matter if the column is declared as NULL or NOT NULL. Likewise, DROP COLUMN of non-virtual columns will rebuild the table:

            ALTER TABLE t CHANGE d d CHAR(10) NOT NULL DEFAULT 'unknown';
            ALTER TABLE t CHANGE e e INT NULL DEFAULT 10;
            # Dropping any non-virtual columns will rebuild the table.
            ALTER TABLE t DROP COLUMN c1, DROP COLUMN d, DROP COLUMN e;
            DROP TABLE t;
            

            Maybe we should introduce ALGORITHM=NOCOPY so that time-consuming table-rebuilding operations can be refused.

            marko Marko Mäkelä added a comment - - edited Page-level format tracking instead of row-level Tencent has implemented instant ADD COLUMN by introducing new ROW_FORMAT variants that explicitly store the number of fields in index records. With the original InnoDB format (ROW_FORMAT=REDUNDANT) we could have something like that as is, because that format stores the number of fields. However, there is an alternative that is compatible with old data files that have been created in any ROW_FORMAT. Instead of tracking the number of columns for each record, we would track it for each page. Implementation of the instant ADD COLUMN operation As suggested in MDEV-11424 for more general instant ALTER, we can repurpose the unused field PAGE_MAX_TRX_ID in clustered index leaf pages. All InnoDB versions at least since MySQL 3.23.49 wrote this field as 0. We would only support ADD COLUMN with the following restrictions: Only ADD COLUMN…LAST. Any DROP COLUMN or change of column order will require a full table rebuild. Only constant DEFAULT values. Changing the DEFAULT of instantly-added columns will require a full table rebuild. A new field PAGE_CLUST_LEAF_FIELDS will be aliased to PAGE_MAX_TRX_ID and defined as follows: On non-leaf pages, there is no PAGE_CLUST_LEAF_FIELDS. (In clustered index non-leaf pages other than the root, the PAGE_MAX_TRX_ID will remain unused (0). For the clustered index root page, MDEV-6076 introduced PAGE_ROOT_AUTO_INC.) If the clustered index consists of a single page (root and leaf), PAGE_CLUST_LEAF_FIELDS does not exist, and the records must be in the newest format. ( MDEV-6076 repurposed the PAGE_MAX_TRX_ID of the clustered index root page as PAGE_ROOT_AUTO_INC.) If the PAGE_CLUST_LEAF_FIELDS of the leftmost leaf page is 0, all PAGE_CLUST_LEAF_FIELDS must be 0. (These are tables where instant ALTER TABLE ( MDEV-11369 or MDEV-11424 ) has not been used, or the table has been rebuilt since then.) In the leftmost leaf page, PAGE_CLUST_LEAF_FIELDS contains the original number of fields in the clustered index, before any instant ADD COLUMN operations. This field will be cached in dict_table_t::clust_leaf_fields. The records in the leftmost leaf page must always have the latest number of columns. In other leaf pages of the clustered index, if PAGE_CLUST_LEAF_FIELDS is 0, the number of fields in the records will be dict_table_t::clust_leaf_fields. Else, PAGE_CLUST_LEAF_FIELDS is the number of fields in the records, and it must be between index->table->clust_leaf_fields+1 and index->n_fields. Only the PAGE_CLUST_LEAF_FIELDS values 1 to 1023 will be reserved for this purpose. (InnoDB does not support indexes with more fields.) The values 1024 to (1<<64)-1 could be used by MDEV-11424 later. From the above it follows that instant ADD COLUMN must convert the leftmost leaf page or the root-leaf page to the newest format. Unless the conversion results into a page split, only a single data page will be modified. The data dictionary (both the .frm file and the InnoDB SYS_ tables) will contain the latest table definition. The number of clustered index fields before the first instant ADD COLUMN is only stored in dict_table_t::clust_leaf_fields the PAGE_CLUST_LEAF_FIELDS field of the leftmost leaf page. If multiple ALTER TABLE…ADD COLUMN statements are executed on the same table, the dict_table_t::clust_leaf_fields will not change. When determining whether changing the DEFAULT value of some columns requires a table rebuild, InnoDB must determine if the clustered index position of any of the affected columns is dict_table_t::clust_leaf_fields or greater. If table->clust_leaf_fields == 0, instant ADD COLUMN has not been used, and any DEFAULT values can be changed instantly. Data format changes The format of secondary index pages will be unaffected by instant ALTER TABLE. The clustered index leaf page format is affected by the instant ADD COLUMN, by the introduction of the PAGE_CLUST_LEAF_FIELDS (repurposed from PAGE_MAX_TRX_ID). Non-leaf, non-root clustered index pages will be unaffected. The clustered index root page format will have to be slightly changed to avoid an unintended change to the node pointer records when ROW_FORMAT is COMPACT, DYNAMIC or COMPRESSED. There is no problem with ROW_FORMAT=REDUNDANT. In clustered index node pointer records, all fields will be NOT NULL, because the PRIMARY KEY columns cannot be NULL. Nevertheless, the COMPACT, DYNAMIC and COMPRESSED clustered index node pointer record header will unnecessarily allocate 0‥128 zero bytes to represent NULL flags in the node pointers. The 0‥128 corresponds to UT_BITS_IN_BYTES(index->n_nullable). index->n_nullable can be 0‥1017 in the clustered index. So, we must find one byte in the clustered index root page to store the number of the redundant zero bytes in the node pointer record header. Luckily, only 3 bits of the 16-bit field PAGE_DIRECTION are used. We will repurpose the most significant byte of PAGE_DIRECTION in all COMPACT, DYNAMIC, REDUNDANT index root pages as follows: If the field is nonzero, it will be set to 1+UT_BITS_IN_BYTES(index->n_nullable), to be used when interpreting the node pointer records. (The byte was always set to 0 until now.) The value will also be cached in a new field dict_index_t::node_ptr_null_bytes (always 0 if ROW_FORMAT=REDUNDANT). This will not only solve the problem at hand, but it will also allow us to avoid wasting space on node pointer pages in the future. Changes to read operations When reading records from a page, we must use the current number of fields in the page. This will require some modifications to rec_get_offsets() or its callers, especially because UT_BITS_IN_BYTES(index->n_nullable) may change. When returning a record to the SQL layer, InnoDB has to fill in the DEFAULT values of any instantly-added columns that were missing from the leaf page. If an index is created on an instantly added column, InnoDB may need to access the DEFAULT values of instantly added columns. For this purpose, we probably should move row_prebuilt_t::default_rec to dict_table_t::default_rec. Changes to write operations Because we maintain the number of fields at the page level, all records within the clustered index leaf page must have the same number of fields. This means that whenever any record within the page is inserted, or a not-yet-instantiated instantly added column is updated, all records on the page must be converted. This could require the leaf page to be split into multiple pages. Page splits and merges will convert all records in all affected leaf pages to the newest format. Page reorganize does not need to convert records, and maybe it should, because the conversion could cause a page overflow. This could mean that page reorganize should memcpy() the individual records instead of invoking page_copy_rec_list_end_no_locks() which would by default use the latest index definition. It is optional to convert pages when records are purged (deleted) or the delete-mark bit is changed (for example, BEGIN; DELETE FROM instantly_altered_table; ROLLBACK). Example CREATE TABLE t( a INT NOT NULL, b INT NOT NULL, c1 CHAR(255) NOT NULL DEFAULT '', c2 CHAR(255) NOT NULL DEFAULT '', c3 CHAR(255) NOT NULL DEFAULT '', c4 CHAR(255) NOT NULL DEFAULT '', c5 CHAR(255) NOT NULL DEFAULT '', c6 CHAR(255) NOT NULL DEFAULT '', c7 CHAR(255) NOT NULL DEFAULT '', c8 CHAR(255) NOT NULL DEFAULT '', PRIMARY KEY(a,b), INDEX(b) ) ENGINE=InnoDB;   BEGIN; INSERT INTO t(a,b) VALUES(1,1),(1,2),(1,3),(1,4),(1,5); INSERT INTO t(a,b) SELECT a+1,b FROM t; INSERT INTO t(a,b) SELECT a+2,b FROM t; INSERT INTO t(a,b) SELECT a+4,b FROM t; COMMIT; Our example table consists of multiple B-tree pages in the clustered index. There also is a secondary index which contains entries (b,a). The secondary index pages and the non-leaf clustered index pages (other than the clustered index root page) will be unaffected by instant ADD COLUMN. Let us demonstrate a few instant ADD COLUMN operations: ALTER TABLE t ADD COLUMN d CHAR(10) NOT NULL DEFAULT ''; # Above, only the leftmost leaf page was converted to include column d. UPDATE t SET c8='eight' WHERE a=8; # The UPDATE may convert the rightmost leaf page(s) to include column d. # Column d was not updated, so the page conversion is optional. ALTER TABLE t ADD COLUMN e INT NULL; # Again, the ALTER only converted the leftmost leaf page. UPDATE t SET e=a*b WHERE a=7; # The UPDATE will convert the affected leaf page records to include columns d,e. Finally, let us demonstrate a few limitations: # Indexes can be added to instantly added columns, but not instantly. ALTER TABLE t ADD INDEX(e); Changing the DEFAULT value of a column was an instant operation until now. However, changing the DEFAULT of an instantly added column will rebuild the table, no matter if the column is declared as NULL or NOT NULL. Likewise, DROP COLUMN of non-virtual columns will rebuild the table: ALTER TABLE t CHANGE d d CHAR(10) NOT NULL DEFAULT 'unknown'; ALTER TABLE t CHANGE e e INT NULL DEFAULT 10; # Dropping any non-virtual columns will rebuild the table. ALTER TABLE t DROP COLUMN c1, DROP COLUMN d, DROP COLUMN e; DROP TABLE t; Maybe we should introduce ALGORITHM=NOCOPY so that time-consuming table-rebuilding operations can be refused.
            vinchen vinchen added a comment -

            Hi Marko,

            —
            >> In clustered index node pointer records, all fields will be NOT NULL, because the PRIMARY KEY columns cannot be NULL. Nevertheless, the COMPACT, DYNAMIC and COMPRESSED clustered index node pointer record header will unnecessarily allocate 0‥128 zero bytes to represent NULL flags in the node pointers. The 0‥128 corresponds to UT_BITS_IN_BYTES(index->n_nullable). index->n_nullable can be 0‥1017 in the clustered index.
            >> So, we must find one byte in the clustered index root page to store the number of the redundant zero bytes in the node pointer record header. Luckily, only 3 bits of the 16-bit field PAGE_DIRECTION are used.

            I think we don't need to change the "clustered index node pointer page"
            In our Tencent's instant add column implementation, we introduce dict_index_t::n_nullable_core, just mean that nullable column number of "the first dict_table_t::clust_leaf_fields in the cluster index".
            And it can calculate easily by dict_table_t::clust_leaf_fields and dict_index_t::n_fields when loading dictionary of one table.
            When the dict_table_t::clust_leaf_fields is zero, dict_index_t::n_nullable_core is always equal to dict_index_t::n_nullable.
            When the dict_table_t::clust_leaf_fields is nonzero, dict_index_t::n_nullable_core is always less than or equal to dict_index_t::n_nullable.
            For clustered index node pointer page, we should always use UT_BITS_IN_BYTES(index->n_nullable_core) to represent NULL flags in the node pointers(instead of UT_BITS_IN_BYTES(index->n_nullable)).
            When the dict_table_t::clust_leaf_fields is nonzero and PAGE_CLUST_LEAF_FIELDS is zero, rec_get_offsets() of cluster index leaf page should use UT_BITS_IN_BYTES(index->n_nullable_core) also.
            Whenever instant add column happened, index->n_nullable_core would be not change.

            —

            In other hand, we think that the leftmost page brings some complexity of page spliting, page merging or page rasing, and so on.
            It maybe brings logic procesing of the leftmost page in lots of code. (Maybe I am wrong)
            I think we should reuse the page FSP_DICT_HDR_PAGE_NO(when space_id != 0) to store the addition dictonary information for MDEV-11369 or MDEV-11424 or other.
            And for MDEV-11369, we should only storage the dict_table_t::clust_leaf_fields in the page.
            When loading one table(space_id > 0), we can get dict_table_t::clust_leaf_fields from page FSP_DICT_HDR_PAGE_NO, instead of the leafmost page.
            Therefore, the leftmost page is the same as other cluster index page.
            And there are more space to storage the addition dictionary information for other feature.

            But, when space_id = 0 in a table, the instant add column shoule be refused.
            However, the "innodb_file_per_table" is on defalut when mysql version >= 5.6.6.
            And in most of production environment, innodb_file_per_table is always on.

            I think that this constraint can be tolerated, and more common for the new feature (like MDEV-11424), and simplifies the implementation also.
            How do you think?

            vinchen vinchen added a comment - Hi Marko, — >> In clustered index node pointer records, all fields will be NOT NULL, because the PRIMARY KEY columns cannot be NULL. Nevertheless, the COMPACT, DYNAMIC and COMPRESSED clustered index node pointer record header will unnecessarily allocate 0‥128 zero bytes to represent NULL flags in the node pointers. The 0‥128 corresponds to UT_BITS_IN_BYTES(index->n_nullable). index->n_nullable can be 0‥1017 in the clustered index. >> So, we must find one byte in the clustered index root page to store the number of the redundant zero bytes in the node pointer record header. Luckily, only 3 bits of the 16-bit field PAGE_DIRECTION are used. I think we don't need to change the "clustered index node pointer page" In our Tencent's instant add column implementation, we introduce dict_index_t::n_nullable_core, just mean that nullable column number of "the first dict_table_t::clust_leaf_fields in the cluster index". And it can calculate easily by dict_table_t::clust_leaf_fields and dict_index_t::n_fields when loading dictionary of one table. When the dict_table_t::clust_leaf_fields is zero, dict_index_t::n_nullable_core is always equal to dict_index_t::n_nullable. When the dict_table_t::clust_leaf_fields is nonzero, dict_index_t::n_nullable_core is always less than or equal to dict_index_t::n_nullable. For clustered index node pointer page, we should always use UT_BITS_IN_BYTES(index->n_nullable_core) to represent NULL flags in the node pointers(instead of UT_BITS_IN_BYTES(index->n_nullable)). When the dict_table_t::clust_leaf_fields is nonzero and PAGE_CLUST_LEAF_FIELDS is zero, rec_get_offsets() of cluster index leaf page should use UT_BITS_IN_BYTES(index->n_nullable_core) also. Whenever instant add column happened, index->n_nullable_core would be not change. — In other hand, we think that the leftmost page brings some complexity of page spliting, page merging or page rasing, and so on. It maybe brings logic procesing of the leftmost page in lots of code. (Maybe I am wrong) I think we should reuse the page FSP_DICT_HDR_PAGE_NO(when space_id != 0) to store the addition dictonary information for MDEV-11369 or MDEV-11424 or other. And for MDEV-11369 , we should only storage the dict_table_t::clust_leaf_fields in the page. When loading one table(space_id > 0), we can get dict_table_t::clust_leaf_fields from page FSP_DICT_HDR_PAGE_NO, instead of the leafmost page. Therefore, the leftmost page is the same as other cluster index page. And there are more space to storage the addition dictionary information for other feature. But, when space_id = 0 in a table, the instant add column shoule be refused. However, the "innodb_file_per_table" is on defalut when mysql version >= 5.6.6. And in most of production environment, innodb_file_per_table is always on. I think that this constraint can be tolerated, and more common for the new feature (like MDEV-11424 ), and simplifies the implementation also. How do you think?
            marko Marko Mäkelä added a comment - - edited

            Hi Vinchen,
            Thank you for your feedback.
            I have a long-term goal of making the InnoDB data files self-descriptive, so that eventually we can remove the InnoDB data dictionary tables (MDEV-11655) and even the system tablespace (MDEV-11633).

            Because of this goal, I think that all data regarding the past history of ‘instant ADD COLUMN’ operations must be available in the data file itself.

            The data dictionary (both the .frm file and the SYS_* table records) would only keep the newest table definition. We would be unable to retrieve the original number of columns or fields from the data dictionary. That data should be stored in the index tree only. Any data that is not reported by SHOW CREATE TABLE and is only written to the SYS_* table records is working against the long-term goal set in MDEV-11655.

            So, the dict_table_t::clust_leaf_fields or n_nullable_core would have to be stored in some way in the clustered index itself.
            In my plan, the original number of fields (before any instant ADD COLUMN) would be stored in the leftmost leaf page of the clustered index. (You challenged that suggestion; let me address that separately.)

            The need for storing the original n_null_bytes

            How can we get to a leaf page from the clustered index root? We are unfortunately suffering from a two bad design decisions here. (1) Instead of storing the page number of the leftmost leaf at a fixed location, InnoDB allocates a large number of useless bytes (which are all ignored when the REC_MIN_REC_FLAG is set, and that flag is only set for the leftmost node pointer record at each level). (2) In ROW_FORMAT=COMPACT and later, we are unnecessarily allocating null flags for clustered index node pointer records.

            An instant ADD COLUMN operation cannot possibly rewrite all node pointer pages, because the operation would not be instant any more. (In my proposal above, it would write the root page and the leftmost leaf page.) So, when reading or writing clustered index node pointer records we must act as if no ADD COLUMN had been executed.

            Let us consider an example:

            CREATE TABLE t (
             pk VARCHAR(255) PRIMARY KEY,
             c0 INT, c1 INT, c2 INT, c3 INT, c4 INT, c5 INT, c6 INT, c7 INT
            ) ROW_FORMAT=DYNAMIC;
            INSERT INTO t(pk) VALUES(''),('1'),('2'),…,('1000');
            ALTER TABLE t ADD COLUMN c8 INT;
            UPDATE t SET c8=1;
            ALTER TABLE t ADD COLUMN c9 INT;
            

            In this example, the format of the node pointer records would be (pk,child_page_number). In ROW_FORMAT!=REDUNDANT, the actual length of the field pk is stored in the record header, before the null flags. (Yes, storing the null flags for the clustered index node pointer is a complete waste; no PRIMARY KEY column can be NULL.) In this case, originally there was exactly 1 byte allocated for the null flags for the 8 columns c1…c8, at byte offset -5. If there had been 9 nullable columns, there would be 2 bytes of NULL flags, at byte offset -5 and -6.
            Because we cannot afford to modify all node pointer pages in instant ADD COLUMN, we must interpret the node pointer pages in an appropriate way after the instant ADD COLUMN. That is, we must know how many bytes of null flags were written before any instant ADD COLUMN operation, and we must keep using the same number of bytes.

            By the time the first ALTER is executed, there should be multiple leaf pages. Let us say that we are interpreting a node pointer ('500',child_page_number). If this was the leftmost node pointer (REC_INFO_MIN_REC_FLAG is set), all comparisons would ignore the actual key because of the flag, but we still must know the length of the PK fields so that we can read the child_page_number. Otherwise, we must know the length of the PK fields already for the purpose of comparing to our search key.

            In ROW_FORMAT!=REDUNDANT, the lengths of variable-length fields are stored in the record header before the null flags. We must use the correct number of null flags, or otherwise we could read the length of the PK fields from the wrong place. For example, if we read the length from a null_flags byte which would be 0, we would incorrectly treat the length as 0, and then we would interpret the PK field contents as the child_page_number.

            I think that this example demonstrates that we do need some information in the root page that allows us to avoid rewriting all node pointer pages in instant ADD COLUMN operations. The simplest fix that I was able to come up with was to repurpose the high-order 8 bits of the PAGE_DIRECTION field.

            Where to store the original number of columns?

            My proposal makes the leftmost leaf page special, because it would store the original number of clustered index fields, that is, the number of fields before any instant ADD COLUMN operations, in the repurposed PAGE_MAX_TRX_ID field.

            Vinchen suggests that this could complicate code that deals with splitting or merging pages or adding or removing tree levels. I think that we would only know the answer after prototyping it. We could introduce a simple predicate to make code readable:

            #define page_is_left_leaf(page) \
            page_is_leaf(page) && *static_cast<uint32_t*>(page + FIL_PAGE_PREV) == FIL_NULL
            

            If we indeed store the dict_table_t::clust_leaf_fields somewhere else than the leftmost leaf page, all leaf pages would be treated equal, and instant ADD COLUMN would never have to convert any leaf page, except when there is a single leaf page (which is also a root page). On any leaf page, the PAGE_CLUST_LEAF_FIELDS would either be the number of fields stored in the page, or if it is 0, it would refer to dict_table_t::clust_leaf_fields. We could also store the number of null flags in this page, instead of repurposing PAGE_DIRECTION.

            Vinchen’s proposal for off-index storage is FSP_DICT_HDR_PAGE_NO. Currently, this is the DICT_HDR page that currently only exists in the system tablespace (page number 7 in tablespace 0). We cannot use a fixed page number, because this scheme should be compatible with old data files. But we could allocate a new page and store its number in one of the first pages of the tablespace. Oracle did something similar in MySQL 8.0.0 to reserve space for ‘serialized dictionary information’ in WL#7053. Like Vinchen pointed out, it would be challenging to make this work with the system tablespace (or shared tablespaces in general).

            I agree that it is a good long-term goal to store low-level dictionary information in the data file itself. Something like that is definitely needed for more generic instant ALTER operations (MDEV-11424) and also for the transactional data dictionary (MDEV-11655; at least we should store the secondary index root page numbers).

            However, I would prefer not to introduce a new subformat at this point of time, because we would have to support the format in all future versions. Once we get around to designing the details of MDEV-11424 or MDEV-11655, we should have a better idea of what the format should look like. If it turns out that the page splits and merges indeed get very messy, we may of course revisit this decision.

            marko Marko Mäkelä added a comment - - edited Hi Vinchen, Thank you for your feedback. I have a long-term goal of making the InnoDB data files self-descriptive, so that eventually we can remove the InnoDB data dictionary tables ( MDEV-11655 ) and even the system tablespace ( MDEV-11633 ). Because of this goal, I think that all data regarding the past history of ‘instant ADD COLUMN’ operations must be available in the data file itself. The data dictionary (both the .frm file and the SYS_* table records) would only keep the newest table definition. We would be unable to retrieve the original number of columns or fields from the data dictionary. That data should be stored in the index tree only. Any data that is not reported by SHOW CREATE TABLE and is only written to the SYS_* table records is working against the long-term goal set in MDEV-11655 . So, the dict_table_t::clust_leaf_fields or n_nullable_core would have to be stored in some way in the clustered index itself. In my plan, the original number of fields (before any instant ADD COLUMN) would be stored in the leftmost leaf page of the clustered index. (You challenged that suggestion; let me address that separately.) The need for storing the original n_null_bytes How can we get to a leaf page from the clustered index root? We are unfortunately suffering from a two bad design decisions here. (1) Instead of storing the page number of the leftmost leaf at a fixed location, InnoDB allocates a large number of useless bytes (which are all ignored when the REC_MIN_REC_FLAG is set, and that flag is only set for the leftmost node pointer record at each level). (2) In ROW_FORMAT=COMPACT and later, we are unnecessarily allocating null flags for clustered index node pointer records. An instant ADD COLUMN operation cannot possibly rewrite all node pointer pages, because the operation would not be instant any more. (In my proposal above, it would write the root page and the leftmost leaf page.) So, when reading or writing clustered index node pointer records we must act as if no ADD COLUMN had been executed. Let us consider an example: CREATE TABLE t ( pk VARCHAR(255) PRIMARY KEY, c0 INT, c1 INT, c2 INT, c3 INT, c4 INT, c5 INT, c6 INT, c7 INT ) ROW_FORMAT=DYNAMIC; INSERT INTO t(pk) VALUES(''),('1'),('2'),…,('1000'); ALTER TABLE t ADD COLUMN c8 INT; UPDATE t SET c8=1; ALTER TABLE t ADD COLUMN c9 INT; In this example, the format of the node pointer records would be (pk,child_page_number). In ROW_FORMAT!=REDUNDANT, the actual length of the field pk is stored in the record header, before the null flags. (Yes, storing the null flags for the clustered index node pointer is a complete waste; no PRIMARY KEY column can be NULL.) In this case, originally there was exactly 1 byte allocated for the null flags for the 8 columns c1…c8, at byte offset -5. If there had been 9 nullable columns, there would be 2 bytes of NULL flags, at byte offset -5 and -6. Because we cannot afford to modify all node pointer pages in instant ADD COLUMN, we must interpret the node pointer pages in an appropriate way after the instant ADD COLUMN. That is, we must know how many bytes of null flags were written before any instant ADD COLUMN operation, and we must keep using the same number of bytes. By the time the first ALTER is executed, there should be multiple leaf pages. Let us say that we are interpreting a node pointer ('500',child_page_number). If this was the leftmost node pointer (REC_INFO_MIN_REC_FLAG is set), all comparisons would ignore the actual key because of the flag, but we still must know the length of the PK fields so that we can read the child_page_number. Otherwise, we must know the length of the PK fields already for the purpose of comparing to our search key. In ROW_FORMAT!=REDUNDANT, the lengths of variable-length fields are stored in the record header before the null flags. We must use the correct number of null flags, or otherwise we could read the length of the PK fields from the wrong place. For example, if we read the length from a null_flags byte which would be 0, we would incorrectly treat the length as 0, and then we would interpret the PK field contents as the child_page_number. I think that this example demonstrates that we do need some information in the root page that allows us to avoid rewriting all node pointer pages in instant ADD COLUMN operations. The simplest fix that I was able to come up with was to repurpose the high-order 8 bits of the PAGE_DIRECTION field. Where to store the original number of columns? My proposal makes the leftmost leaf page special, because it would store the original number of clustered index fields, that is, the number of fields before any instant ADD COLUMN operations, in the repurposed PAGE_MAX_TRX_ID field. Vinchen suggests that this could complicate code that deals with splitting or merging pages or adding or removing tree levels. I think that we would only know the answer after prototyping it. We could introduce a simple predicate to make code readable: #define page_is_left_leaf(page) \ page_is_leaf(page) && *static_cast<uint32_t*>(page + FIL_PAGE_PREV) == FIL_NULL If we indeed store the dict_table_t::clust_leaf_fields somewhere else than the leftmost leaf page, all leaf pages would be treated equal, and instant ADD COLUMN would never have to convert any leaf page, except when there is a single leaf page (which is also a root page). On any leaf page, the PAGE_CLUST_LEAF_FIELDS would either be the number of fields stored in the page, or if it is 0, it would refer to dict_table_t::clust_leaf_fields. We could also store the number of null flags in this page, instead of repurposing PAGE_DIRECTION. Vinchen’s proposal for off-index storage is FSP_DICT_HDR_PAGE_NO. Currently, this is the DICT_HDR page that currently only exists in the system tablespace (page number 7 in tablespace 0). We cannot use a fixed page number, because this scheme should be compatible with old data files. But we could allocate a new page and store its number in one of the first pages of the tablespace. Oracle did something similar in MySQL 8.0.0 to reserve space for ‘serialized dictionary information’ in WL#7053 . Like Vinchen pointed out, it would be challenging to make this work with the system tablespace (or shared tablespaces in general). I agree that it is a good long-term goal to store low-level dictionary information in the data file itself. Something like that is definitely needed for more generic instant ALTER operations ( MDEV-11424 ) and also for the transactional data dictionary ( MDEV-11655 ; at least we should store the secondary index root page numbers). However, I would prefer not to introduce a new subformat at this point of time, because we would have to support the format in all future versions. Once we get around to designing the details of MDEV-11424 or MDEV-11655 , we should have a better idea of what the format should look like. If it turns out that the page splits and merges indeed get very messy, we may of course revisit this decision.

            As noted in the MDEV-12123 fix, after IMPORT TABLESPACE all the clustered index pages may contain a (bogus) transaction ID in PAGE_MAX_TRX_ID. We can detect this to some extent. While seeking to the leftmost leaf page, we can check if the PAGE_MAX_TRX_ID is nonzero on any of the non-leaf pages, or if the ID on the root page matches the one on the leaf page.
            There would appear to be a problem if the clustered index tree height is only 1. In this case, there are no other node pointer pages than the root page, where PAGE_MAX_TRX_ID was repurposed as PAGE_ROOT_AUTO_INC. Perhaps in this case, we can check all leaf pages if they are carrying the same PAGE_MAX_TRX_ID value, and then decide that this is an old imported table, not one where instant ALTER TABLE…ADD COLUMN was used.

            marko Marko Mäkelä added a comment - As noted in the MDEV-12123 fix , after IMPORT TABLESPACE all the clustered index pages may contain a (bogus) transaction ID in PAGE_MAX_TRX_ID. We can detect this to some extent. While seeking to the leftmost leaf page, we can check if the PAGE_MAX_TRX_ID is nonzero on any of the non-leaf pages, or if the ID on the root page matches the one on the leaf page. There would appear to be a problem if the clustered index tree height is only 1. In this case, there are no other node pointer pages than the root page, where PAGE_MAX_TRX_ID was repurposed as PAGE_ROOT_AUTO_INC. Perhaps in this case, we can check all leaf pages if they are carrying the same PAGE_MAX_TRX_ID value, and then decide that this is an old imported table, not one where instant ALTER TABLE…ADD COLUMN was used.
            marko Marko Mäkelä added a comment - - edited

            Hi vinchen, it was a pleasure for me to see your live presentation of a revised idea that uses per-record tagging while maintaining backward compatibility with old data files.

            The following detailed design proposal is based on what we agreed upon today. Some consistency checks and tests could be omitted from the first prototype. Also support for ROW_FORMAT=REDUNDANT can be omitted from the first prototype. I would like to have good test coverage for this. I can try to help with tests if needed.

            ALTER TABLE syntax changes

            It would be nice to introduce new syntax to prevent nasty surprises. When an operation is expected to be quick, it could be better to return an error than to perform a disruptive (resource-intensive and time-consuming) operation.

            ALGORITHM=INSTANT will refuse any operation that must modify any data files.

            ALGORITHM=NOCOPY will refuse any operation that would rebuild the clustered index (and the whole table).

            With respect to the allowed operations, ALGORITHM=INSTANT is a subset of ALGORITHM=NOCOPY which is a subset of ALGORITHM=INPLACE.

            Example:

            # No data file change (instant operation)
            ALTER TABLE t ADD COLUMN b INT, ALGORITHM=INSTANT;
            # The following will change data files (ADD INDEX), but not rebuild the table:
            ALTER TABLE t ADD COLUMN c INT, ADD INDEX(c), ALGORITHM=NOCOPY;
            # The following are changing data files (not instant operation)
            --error ER_ALTER_OPERATION_NOT_SUPPORTED_REASON
            ALTER TABLE t DROP INDEX c, ALGORITHM=INSTANT;
            --error ER_ALTER_OPERATION_NOT_SUPPORTED_REASON
            ALTER TABLE t ADD COLUMN d INT, ADD INDEX(d), ALGORITHM=INSTANT;
            # All of the above will be allowed with any other ALGORITHM.
            # With ALGORITHM=DEFAULT or with no ALGORITHM, the most efficient
            # available algorithm will be used.
            

            No new ROW_FORMAT keywords will be introduced. Instead, ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC will be extended at the file level.

            Limitations

            Limitations on ALTER TABLE

            Instant ADD COLUMN will not be supported on ROW_FORMAT=COMPRESSED tables. (ALTER TABLE…ADD COLUMN…ALGORITHM=NOCOPY must return an error; the operation can be executed with ALGORITHM=INPLACE or ALGORITHM=COPY just like before.)

            Instant ADD COLUMN will not allow non-constant DEFAULT values, not as part of the ADD COLUMN operation nor later on columns that were created by instant ADD COLUMN. If such DEFAULT value is used, then ALGORITHM=INSTANT or ALGORITHM=NOCOPY will return an error, but ALGORITHM=INPLACE or ALGORITHM=COPY will rebuild the table.

            Instant ADD COLUMN is only allowed when the column is added last. Adding columns between or before existing columns will be refused by ALGORITHM=INSTANT or ALGORITHM=NOCOPY, but will be executed by ALGORITHM=INPLACE or ALGORITHM=COPY just like before.

            Limitations on IMPORT and EXPORT

            ALTER TABLE…IMPORT TABLESPACE is refused after instant ADD COLUMN was used on the table. (You must do TRUNCATE TABLE and ALTER TABLE…FORCE, or just DROP TABLE and CREATE TABLE before ALTER TABLE…DISCARD TABLESPACE and ALTER TABLE…IMPORT TABLESPACE.)

            ALTER TABLE…IMPORT TABLESPACE is refused if the clustered index root page of the to-be-imported data file contains a flag indicating that instant ADD COLUMN was executed.

            FLUSH TABLES FOR EXPORT must return an error or at least a warning if instant ADD COLUMN has been used. If you want to export tables, use ALTER TABLE…FORCE first to convert it to compatible format. (This is only to prevent a surprise when the copied .ibd file would cause an error later on IMPORT TABLESPACE.)

            Data dictionary changes

            For now, we will not care about the future goal of removing the InnoDB system tables. If/when a future version of MariaDB implements that, it will perform a conversion step. This conversion step could repurpose fields in index pages in some way that we do not need to account for now. We will not need to repurpose PAGE_MAX_TRX_ID in any way now.

            Just like now, *.frm files will store the latest table definition, including the latest DEFAULT values and the latest number of columns.

            Just like now, SYS_TABLES and SYS_COLUMNS will store all column definitions and the clustered index field definitions.

            (SYS_FIELDS entries for the SYS_INDEXES entry of the clustered index only reflect the PRIMARY KEY columns, so instant ADD COLUMN would not modify SYS_FIELDS.)

            A new InnoDB data dictionary table is added for identifying instantly added columns and for storing the initial DEFAULT values of these columns. This table would be created using the InnoDB SQL interpreter at startup, just like SYS_FOREIGN and other tables. Note that this will require the addition of a BIGINT literal to the InnoDB SQL parser, because INT is only 32 bits, and SYS_TABLES.TABLE_ID is 64 bits:

            PROCEDURE CREATE_SYS_COLUMNS_ADDED_PROC () IS
            BEGIN
            CREATE TABLE SYS_COLUMNS_ADDED(
               TABLE_ID BIGINT UNSIGNED NOT NULL,
               POS INT NOT NULL,
               DEFAULT_VALUE CHAR);
            CREATE UNIQUE CLUSTERED INDEX COL_IND ON SYS_FOREIGN (TABLE_ID, POS);
            END;
            

            The view INFORMATION_SCHEMA.INNODB_SYS_COLUMNS_ADDED will be created, to allow an easy way of checking which tables require a conversion (ALTER TABLE…FORCE) for moving to MySQL or older versions of MariaDB. Example:

            SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES
            WHERE table_id IN (SELECT table_id FROM INFORMATION_SCHEMA.INNODB_SYS_COLUMNS_ADDED);
            

            If any record with SYS_COLUMNS_ADDED.TABLE_ID matches the table definition that is being loaded from SYS_TABLES and SYS_COLUMNS, then the values of SYS_COLUMNS_ADDED.POS must be in strict sequential order (no gaps allowed) and run up to the number of columns in SYS_COLUMNS.

            The value SYS_COLUMNS_ADDED.DEFAULT_VALUE = NULL is not allowed if the column is declared NOT NULL.

            When SYS_COLUMNS_ADDED.DEFAULT_VALUE is not NULL, it must match the allowed length and value of the column (it must match the fixed length of the column declared in SYS_COLUMNS, or it must not exceed the maximum length of a variable-length column.

            The code that loads the table definition must report that the table is corrupted if any inconsistency between SYS_COLUMNS and SYS_COLUMNS_ADDED is detected.

            Example of the InnoDB data dictionary changes

            CREATE TABLE t1(a INT PRIMARY KEY, b INT) ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
            INSERT INTO t1 SET a=1;
            # instantly add columns
            ALTER TABLE t1 ADD COLUMN c INT, d CHAR(10) NULL DEFAULT 'foo', e INT NOT NULL DEFAULT 42;
            INSERT INTO t1 SET a=2;
            # change some DEFAULT values of old columns, and add one column
            ALTER TABLE t1 CHANGE COLUMN b b INT DEFAULT 5,
            CHANGE COLUMN c c INT DEFAULT 10,
            CHANGE COLUMN d d DEFAULT NULL, ADD COLUMN f INT UNSIGNED DEFAULT 0;
            INSERT INTO t1 SET a=3;
            # change DEFAULTs in .frm file only (nothing inside InnoDB)
            ALTER TABLE t1 CHANGE COLUMN a a INT 101, COLUMN b b INT DEFAULT 102,
            CHANGE COLUMN c c INT DEFAULT 103,
            CHANGE COLUMN d d DEFAULT 'eleventy',
            CHANGE COLUMN e e INT UNSIGNED DEFAULT 105,
            CHANGE COLUMN e e INT UNSIGNED DEFAULT 106;
            SELECT * FROM t1;
            # must return (1,NULL,NULL,'foo',42,0),(2,NULL,NULL,'foo',42,0),(3,5,10,NULL,0)
            

            (For simplicity, the ALTER TABLE statements in the example are not renaming the columns while changing defaults. Such a combination must be supported.)

            After the CREATE TABLE, we will have 2 SYS_COLUMNS entries for the table and no SYS_COLUMNS_ADDED entries.

            After the first INSERT, the clustered index will contain the following record:
            (a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,…,…,NULL). The table format is still compatible with MySQL or older MariaDB versions.

            After the first ALTER TABLE, we would have 5 entries in SYS_COLUMNS and 3 entries in SYS_COLUMNS_ADDED for the user table. The SYS_COLUMNS_ADDED entries would be: (table_id,2,NULL),(table_id,3,'foo'),(table_id,4,'\200\0\0*'). The columns in SYS_COLUMNS_ADDED are numbered from 0 onwards, so the first entry will have SYS_COLUMNS_ADDED.POS equal to the number of ‘core columns’. Initially, a table must always be created with at least 1 column. The last value is the binary value corresponding to the internal representation of the INT value 42 in InnoDB.

            After the second INSERT, the latest clustered index definition will be
            (a,DB_TRX_ID,DB_ROLL_PTR,b,c,d,e) and the index will contain the following:
            (1,…,…,NULL),(2,…,…,NULL,NULL,'foo',42).

            After the second ALTER TABLE, we will not update the existing SYS_COLUMNS_ADDED entries to reflect the changed DEFAULT values. They will keep the initial DEFAULT values that were in effect during the instant ADD COLUMN. The .frm file will contain the latest DEFAULT value. Inside InnoDB, the only changes will reflect the instant ADD COLUMN e: a SYS_COLUMNS record and the SYS_COLUMNS_ADDED record (table_id,5,'\0\0\0\0').

            After the third INSERT, the latest clustered index definition will be
            (a,DB_TRX_ID,DB_ROLL_PTR,b,c,d,e,f) and the index will contain the following:
            (1,…,…,NULL),(2,…,…,NULL,NULL,'foo',42),(3,…,…,5,10,NULL,0).

            The last ALTER does nothing inside InnoDB. It is only changing the DEFAULT values of already existing columns. Those are only stored in the .frm file. The SYS_COLUMNS_ADDED table only stores the initial DEFAULT values of instantly added columns.

            The SELECT will fill in the initial DEFAULT values of the columns that are missing from the index entries. That is, for the first record, we will fill in all defaults that are stored in SYS_COLUMNS_ADDED: (c=NULL,d='foo',e=42,f=0). The second and third INSERT stored the current DEFAULT values, not the initial ones. So, for the second record, we will fill in (f=0).

            InnoDB file format changes

            Instant ADD COLUMN will not be supported for ROW_FORMAT=COMPRESSED. It will be supported for all other ROW_FORMAT (REDUNDANT, COMPACT, DYNAMIC).

            We will not introduce new ROW_FORMAT names, but instead refine the binary format in the InnoDB data files.

            Feature flag in the clustered index root page

            If instant ADD COLUMN has been used, the following flag will be set in the clustered index root page:

            /** Determine if instant ADD COLUMN was used.
            This flag is only supposed to be set on the clustered index root page.
            @param[in]	page	index page
            @return	whether instant ADD COLUMN was used
            (should be false if the page is not a clustered index root page) */
            inline
            bool
            page_is_instant(const page_t* page)
            {
                     return(page[PAGE_HEADER + PAGE_N_HEAP] & 0x40);
            }
            

            It is an error (data corruption) if page_is_instant(clust_root) does not reflect the existence of SYS_COLUMNS_ADDED records for the table.
            At the same time, the functions page_dir_set_n_heap() and page_dir_get_n_heap() must be revised to mask the 2 most significant bits of PAGE_N_HEAP. Previously, only one bit (indicating ROW_FORMAT!=REDUNDANT) was masked, starting with MySQL 5.0.3 which introduced ROW_FORMAT=COMPACT.

            It is an error (data corruption) if page_is_instant(page) holds on any other index page than a clustered index root page. The corruption must be reported by CHECK TABLE and by btr_assert_not_corrupted().

            (Why use the second-most-significant bit in PAGE_N_HEAP? The most significant bit identifies ROW_FORMAT!=REDUNDANT. The second-most-significant bit is always 0 so far, because there can never be more than 16383 records in a page. The largest possible seems to be innodb_page_size/(5+2) which is 65536/7=9362. Yes, the minimum record size would be 5+1 bytes for a secondary index leaf page record, say CREATE TABLE(a CHAR(1) PRIMARY KEY, INDEX(a)), but you cannot have more than 256 distinct values for 1-byte payload. Even with 65536/6 we get 10922, which is less than 16384.)

            Record format changes

            Any data corruption must be detected by CHECK TABLE and by btr_index_rec_validate() and similar functions.

            After instant ADD COLUMN on a ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC table, INSERT or UPDATE will set a new flag REC_INFO_ADDED_FLAG=0x80 in info_bits and explicitly store the number of ‘non-core’ fields in the record header, immediately before the REC_N_NEW_EXTRA_BYTES, using a variable-length encoding of 2 bytes, something like this:

            	const byte* nulls = rec - REC_N_NEW_EXTRA_BYTES;
            	ulint n_nullable = index->n_nullable;
            	if (rec[-REC_NEW_INFO_BITS] & REC_INFO_ADDED_FLAG) {
            		/* This must be ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC.
            		ROW_FORMAT=COMPRESSED pages cannot even store this flag! */
            		ut_ad(page_rec_is_comp(rec));
            		ut_ad(!DICT_TF_GET_ZIP_SSIZE(index->table->flags));
            		ulint n = *nulls--;
            		if (n >= 0x80) {
            			n = (n & 0x7f) << 8 | *nulls--;
            		}
            		n_fields = n_core_fields + n + 1;
            		if (n_fields >= index->n_fields) ...; // error
            		n_nullable += ...;
            	}
            	const byte* lens = nulls - UT_BITS_IN_BYTES(n_nullable);
            

            It is an error (data corruption) if REC_INFO_ADDED_FLAG is set in a ROW_FORMAT=REDUNDANT table or in a node pointer record or in secondary indexes, or when instant ADD COLUMN has not been used.

            On ROW_FORMAT=REDUNDANT pages, the records already explicitly contain the number of fields.

            It is an error (data corruption) if a record is marked as containing more fields than the table definition implies, or if a stored field does not match the table definition.

            It is an error (data corruption) if the n_fields in a ROW_FORMAT=REDUNDANT record is less than the number of index record fields derived from the number of core columns.

            When REC_INFO_ADDED_FLAG is set (COMPACT or DYNAMIC), the above-mentioned variable-length encoding of n_fields is guaranteed by design to be more than the number of ‘core fields’.

            It is an error (data corruption) if instant ADD COLUMN has not been used, but the n_fields in a ROW_FORMAT=REDUNDANT clustered index definition does not match the clustered index definition.

            It is an error (data corruption) if the n_fields in a ROW_FORMAT=REDUNDANT node pointer record or secondary index record does not match the index definition.

            The n_nullable flags in the clustered index node pointer records in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC must reflect the number of ‘core columns’. These flags are useless garbage, and they are only reserved because of file format compatibility. (Clustered index node pointer records only contain the PRIMARY KEY columns, which are always NOT NULL, so we should have used n_nullable=0.) Because the size of n_nullable is not explicitly stored in the page and because we do not want to change the format of node pointer records or pages, this is our only choice.

            The size of the n_nullable flags in the ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC clustered index leaf page records must correspond to the number of columns and the stored n_fields in the record. (More bytes may be allocated for new records after ALTER TABLE…ADD COLUMN…NULL.)

            Redo log format changes

            Because the InnoDB redo log format (somewhat unnecessarily) is a mix of physical and logical operations, some change is needed to the way how INSERT and UPDATE operations are redo logged for ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC tables. Some ideas of simplifying the redo log format (and speeding up recovery) are listed in MDEV-12353.

            It is possible that actual changes to the redo log format (such as the function mlog_open_and_write_index()) are unavoidable.

            Even if the redo log format is not changed, it would seem safest to try to refuse startup with an older server version. We can do this by introducing a new redo log format identifier, similar to what was done in MDEV-11782.

            While we could theoretically retain support for crash recovery from old-format redo log, it is probably best to not try that. (The GA version of MySQL 5.7 or MariaDB 10.2 will refuse crash recovery from older server versions, and MySQL 8.0.0 will refuse crash recovery from MySQL 5.7 or older versions. It is acceptable that MariaDB 10.3 would not support crash recovery from 10.2.)

            Changes to DML operations

            If instant ADD COLUMN has not been used, the file format will not be changed. This means that any table-rebuilding ALTER operation (such as ALTER TABLE…FORCE or ALTER TABLE…ALGORITHM=COPY) will convert the table to the old format.

            If instant ADD COLUMN has been used, any INSERT or UPDATE to the clustered index leaf page must store the current number of columns in the record. (For ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC we must also set REC_INFO_ADDED_FLAG to signal the presence of the n_fields in the record header.)

            The REC_INFO_DELETED_FLAG of a record can be set or unset without changing other parts of the record. Usually this is done by a DELETE or an UPDATE of a PRIMARY KEY column, or the ROLLBACK of those operations.

            On SELECT (also as part of UPDATE), if the clustered index leaf page record is found to contain fewer index fields than the current table definition implies, the missing fields will be filled in from SYS_COLUMNS_ADDED.DEFAULT_VALUE.

            marko Marko Mäkelä added a comment - - edited Hi vinchen , it was a pleasure for me to see your live presentation of a revised idea that uses per-record tagging while maintaining backward compatibility with old data files. The following detailed design proposal is based on what we agreed upon today. Some consistency checks and tests could be omitted from the first prototype. Also support for ROW_FORMAT=REDUNDANT can be omitted from the first prototype. I would like to have good test coverage for this. I can try to help with tests if needed. ALTER TABLE syntax changes It would be nice to introduce new syntax to prevent nasty surprises. When an operation is expected to be quick, it could be better to return an error than to perform a disruptive (resource-intensive and time-consuming) operation. ALGORITHM=INSTANT will refuse any operation that must modify any data files. ALGORITHM=NOCOPY will refuse any operation that would rebuild the clustered index (and the whole table). With respect to the allowed operations, ALGORITHM=INSTANT is a subset of ALGORITHM=NOCOPY which is a subset of ALGORITHM=INPLACE. Example: # No data file change (instant operation) ALTER TABLE t ADD COLUMN b INT, ALGORITHM=INSTANT; # The following will change data files (ADD INDEX), but not rebuild the table: ALTER TABLE t ADD COLUMN c INT, ADD INDEX(c), ALGORITHM=NOCOPY; # The following are changing data files (not instant operation) --error ER_ALTER_OPERATION_NOT_SUPPORTED_REASON ALTER TABLE t DROP INDEX c, ALGORITHM=INSTANT; --error ER_ALTER_OPERATION_NOT_SUPPORTED_REASON ALTER TABLE t ADD COLUMN d INT, ADD INDEX(d), ALGORITHM=INSTANT; # All of the above will be allowed with any other ALGORITHM. # With ALGORITHM=DEFAULT or with no ALGORITHM, the most efficient # available algorithm will be used. No new ROW_FORMAT keywords will be introduced. Instead, ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC will be extended at the file level. Limitations Limitations on ALTER TABLE Instant ADD COLUMN will not be supported on ROW_FORMAT=COMPRESSED tables. (ALTER TABLE…ADD COLUMN…ALGORITHM=NOCOPY must return an error; the operation can be executed with ALGORITHM=INPLACE or ALGORITHM=COPY just like before.) Instant ADD COLUMN will not allow non-constant DEFAULT values, not as part of the ADD COLUMN operation nor later on columns that were created by instant ADD COLUMN. If such DEFAULT value is used, then ALGORITHM=INSTANT or ALGORITHM=NOCOPY will return an error, but ALGORITHM=INPLACE or ALGORITHM=COPY will rebuild the table. Instant ADD COLUMN is only allowed when the column is added last. Adding columns between or before existing columns will be refused by ALGORITHM=INSTANT or ALGORITHM=NOCOPY, but will be executed by ALGORITHM=INPLACE or ALGORITHM=COPY just like before. Limitations on IMPORT and EXPORT ALTER TABLE…IMPORT TABLESPACE is refused after instant ADD COLUMN was used on the table. (You must do TRUNCATE TABLE and ALTER TABLE…FORCE, or just DROP TABLE and CREATE TABLE before ALTER TABLE…DISCARD TABLESPACE and ALTER TABLE…IMPORT TABLESPACE.) ALTER TABLE…IMPORT TABLESPACE is refused if the clustered index root page of the to-be-imported data file contains a flag indicating that instant ADD COLUMN was executed. FLUSH TABLES FOR EXPORT must return an error or at least a warning if instant ADD COLUMN has been used. If you want to export tables, use ALTER TABLE…FORCE first to convert it to compatible format. (This is only to prevent a surprise when the copied .ibd file would cause an error later on IMPORT TABLESPACE.) Data dictionary changes For now, we will not care about the future goal of removing the InnoDB system tables. If/when a future version of MariaDB implements that, it will perform a conversion step. This conversion step could repurpose fields in index pages in some way that we do not need to account for now. We will not need to repurpose PAGE_MAX_TRX_ID in any way now. Just like now, *.frm files will store the latest table definition, including the latest DEFAULT values and the latest number of columns. Just like now, SYS_TABLES and SYS_COLUMNS will store all column definitions and the clustered index field definitions. (SYS_FIELDS entries for the SYS_INDEXES entry of the clustered index only reflect the PRIMARY KEY columns, so instant ADD COLUMN would not modify SYS_FIELDS.) A new InnoDB data dictionary table is added for identifying instantly added columns and for storing the initial DEFAULT values of these columns. This table would be created using the InnoDB SQL interpreter at startup, just like SYS_FOREIGN and other tables. Note that this will require the addition of a BIGINT literal to the InnoDB SQL parser, because INT is only 32 bits, and SYS_TABLES.TABLE_ID is 64 bits: PROCEDURE CREATE_SYS_COLUMNS_ADDED_PROC () IS BEGIN CREATE TABLE SYS_COLUMNS_ADDED( TABLE_ID BIGINT UNSIGNED NOT NULL, POS INT NOT NULL, DEFAULT_VALUE CHAR); CREATE UNIQUE CLUSTERED INDEX COL_IND ON SYS_FOREIGN (TABLE_ID, POS); END; The view INFORMATION_SCHEMA.INNODB_SYS_COLUMNS_ADDED will be created, to allow an easy way of checking which tables require a conversion (ALTER TABLE…FORCE) for moving to MySQL or older versions of MariaDB. Example: SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE table_id IN (SELECT table_id FROM INFORMATION_SCHEMA.INNODB_SYS_COLUMNS_ADDED); If any record with SYS_COLUMNS_ADDED.TABLE_ID matches the table definition that is being loaded from SYS_TABLES and SYS_COLUMNS, then the values of SYS_COLUMNS_ADDED.POS must be in strict sequential order (no gaps allowed) and run up to the number of columns in SYS_COLUMNS. The value SYS_COLUMNS_ADDED.DEFAULT_VALUE = NULL is not allowed if the column is declared NOT NULL. When SYS_COLUMNS_ADDED.DEFAULT_VALUE is not NULL, it must match the allowed length and value of the column (it must match the fixed length of the column declared in SYS_COLUMNS, or it must not exceed the maximum length of a variable-length column. The code that loads the table definition must report that the table is corrupted if any inconsistency between SYS_COLUMNS and SYS_COLUMNS_ADDED is detected. Example of the InnoDB data dictionary changes CREATE TABLE t1(a INT PRIMARY KEY, b INT) ENGINE=InnoDB ROW_FORMAT=REDUNDANT; INSERT INTO t1 SET a=1; # instantly add columns ALTER TABLE t1 ADD COLUMN c INT, d CHAR(10) NULL DEFAULT 'foo', e INT NOT NULL DEFAULT 42; INSERT INTO t1 SET a=2; # change some DEFAULT values of old columns, and add one column ALTER TABLE t1 CHANGE COLUMN b b INT DEFAULT 5, CHANGE COLUMN c c INT DEFAULT 10, CHANGE COLUMN d d DEFAULT NULL, ADD COLUMN f INT UNSIGNED DEFAULT 0; INSERT INTO t1 SET a=3; # change DEFAULTs in .frm file only (nothing inside InnoDB) ALTER TABLE t1 CHANGE COLUMN a a INT 101, COLUMN b b INT DEFAULT 102, CHANGE COLUMN c c INT DEFAULT 103, CHANGE COLUMN d d DEFAULT 'eleventy', CHANGE COLUMN e e INT UNSIGNED DEFAULT 105, CHANGE COLUMN e e INT UNSIGNED DEFAULT 106; SELECT * FROM t1; # must return (1,NULL,NULL,'foo',42,0),(2,NULL,NULL,'foo',42,0),(3,5,10,NULL,0) (For simplicity, the ALTER TABLE statements in the example are not renaming the columns while changing defaults. Such a combination must be supported.) After the CREATE TABLE, we will have 2 SYS_COLUMNS entries for the table and no SYS_COLUMNS_ADDED entries. After the first INSERT, the clustered index will contain the following record: (a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,…,…,NULL). The table format is still compatible with MySQL or older MariaDB versions. After the first ALTER TABLE, we would have 5 entries in SYS_COLUMNS and 3 entries in SYS_COLUMNS_ADDED for the user table. The SYS_COLUMNS_ADDED entries would be: (table_id,2,NULL),(table_id,3,'foo'),(table_id,4,'\200\0\0*'). The columns in SYS_COLUMNS_ADDED are numbered from 0 onwards, so the first entry will have SYS_COLUMNS_ADDED.POS equal to the number of ‘core columns’. Initially, a table must always be created with at least 1 column. The last value is the binary value corresponding to the internal representation of the INT value 42 in InnoDB. After the second INSERT, the latest clustered index definition will be (a,DB_TRX_ID,DB_ROLL_PTR,b,c,d,e) and the index will contain the following: (1,…,…,NULL),(2,…,…,NULL,NULL,'foo',42). After the second ALTER TABLE, we will not update the existing SYS_COLUMNS_ADDED entries to reflect the changed DEFAULT values. They will keep the initial DEFAULT values that were in effect during the instant ADD COLUMN. The .frm file will contain the latest DEFAULT value. Inside InnoDB, the only changes will reflect the instant ADD COLUMN e: a SYS_COLUMNS record and the SYS_COLUMNS_ADDED record (table_id,5,'\0\0\0\0'). After the third INSERT, the latest clustered index definition will be (a,DB_TRX_ID,DB_ROLL_PTR,b,c,d,e,f) and the index will contain the following: (1,…,…,NULL),(2,…,…,NULL,NULL,'foo',42),(3,…,…,5,10,NULL,0). The last ALTER does nothing inside InnoDB. It is only changing the DEFAULT values of already existing columns. Those are only stored in the .frm file. The SYS_COLUMNS_ADDED table only stores the initial DEFAULT values of instantly added columns. The SELECT will fill in the initial DEFAULT values of the columns that are missing from the index entries. That is, for the first record, we will fill in all defaults that are stored in SYS_COLUMNS_ADDED: (c=NULL,d='foo',e=42,f=0). The second and third INSERT stored the current DEFAULT values, not the initial ones. So, for the second record, we will fill in (f=0). InnoDB file format changes Instant ADD COLUMN will not be supported for ROW_FORMAT=COMPRESSED. It will be supported for all other ROW_FORMAT (REDUNDANT, COMPACT, DYNAMIC). We will not introduce new ROW_FORMAT names, but instead refine the binary format in the InnoDB data files. Feature flag in the clustered index root page If instant ADD COLUMN has been used, the following flag will be set in the clustered index root page: /** Determine if instant ADD COLUMN was used. This flag is only supposed to be set on the clustered index root page. @param[in] page index page @return whether instant ADD COLUMN was used (should be false if the page is not a clustered index root page) */ inline bool page_is_instant(const page_t* page) { return(page[PAGE_HEADER + PAGE_N_HEAP] & 0x40); } It is an error (data corruption) if page_is_instant(clust_root) does not reflect the existence of SYS_COLUMNS_ADDED records for the table. At the same time, the functions page_dir_set_n_heap() and page_dir_get_n_heap() must be revised to mask the 2 most significant bits of PAGE_N_HEAP. Previously, only one bit (indicating ROW_FORMAT!=REDUNDANT) was masked, starting with MySQL 5.0.3 which introduced ROW_FORMAT=COMPACT. It is an error (data corruption) if page_is_instant(page) holds on any other index page than a clustered index root page. The corruption must be reported by CHECK TABLE and by btr_assert_not_corrupted(). (Why use the second-most-significant bit in PAGE_N_HEAP? The most significant bit identifies ROW_FORMAT!=REDUNDANT. The second-most-significant bit is always 0 so far, because there can never be more than 16383 records in a page. The largest possible seems to be innodb_page_size/(5+2) which is 65536/7=9362. Yes, the minimum record size would be 5+1 bytes for a secondary index leaf page record, say CREATE TABLE(a CHAR(1) PRIMARY KEY, INDEX(a)), but you cannot have more than 256 distinct values for 1-byte payload. Even with 65536/6 we get 10922, which is less than 16384.) Record format changes Any data corruption must be detected by CHECK TABLE and by btr_index_rec_validate() and similar functions. After instant ADD COLUMN on a ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC table, INSERT or UPDATE will set a new flag REC_INFO_ADDED_FLAG=0x80 in info_bits and explicitly store the number of ‘non-core’ fields in the record header, immediately before the REC_N_NEW_EXTRA_BYTES, using a variable-length encoding of 2 bytes, something like this: const byte* nulls = rec - REC_N_NEW_EXTRA_BYTES; ulint n_nullable = index->n_nullable; if (rec[-REC_NEW_INFO_BITS] & REC_INFO_ADDED_FLAG) { /* This must be ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC. ROW_FORMAT=COMPRESSED pages cannot even store this flag! */ ut_ad(page_rec_is_comp(rec)); ut_ad(!DICT_TF_GET_ZIP_SSIZE(index->table->flags)); ulint n = *nulls--; if (n >= 0x80) { n = (n & 0x7f) << 8 | *nulls--; } n_fields = n_core_fields + n + 1; if (n_fields >= index->n_fields) ...; // error n_nullable += ...; } const byte* lens = nulls - UT_BITS_IN_BYTES(n_nullable); It is an error (data corruption) if REC_INFO_ADDED_FLAG is set in a ROW_FORMAT=REDUNDANT table or in a node pointer record or in secondary indexes, or when instant ADD COLUMN has not been used. On ROW_FORMAT=REDUNDANT pages, the records already explicitly contain the number of fields. It is an error (data corruption) if a record is marked as containing more fields than the table definition implies, or if a stored field does not match the table definition. It is an error (data corruption) if the n_fields in a ROW_FORMAT=REDUNDANT record is less than the number of index record fields derived from the number of core columns. When REC_INFO_ADDED_FLAG is set (COMPACT or DYNAMIC), the above-mentioned variable-length encoding of n_fields is guaranteed by design to be more than the number of ‘core fields’. It is an error (data corruption) if instant ADD COLUMN has not been used, but the n_fields in a ROW_FORMAT=REDUNDANT clustered index definition does not match the clustered index definition. It is an error (data corruption) if the n_fields in a ROW_FORMAT=REDUNDANT node pointer record or secondary index record does not match the index definition. The n_nullable flags in the clustered index node pointer records in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC must reflect the number of ‘core columns’. These flags are useless garbage, and they are only reserved because of file format compatibility. (Clustered index node pointer records only contain the PRIMARY KEY columns, which are always NOT NULL, so we should have used n_nullable=0.) Because the size of n_nullable is not explicitly stored in the page and because we do not want to change the format of node pointer records or pages, this is our only choice. The size of the n_nullable flags in the ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC clustered index leaf page records must correspond to the number of columns and the stored n_fields in the record. (More bytes may be allocated for new records after ALTER TABLE…ADD COLUMN…NULL.) Redo log format changes Because the InnoDB redo log format (somewhat unnecessarily) is a mix of physical and logical operations, some change is needed to the way how INSERT and UPDATE operations are redo logged for ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC tables. Some ideas of simplifying the redo log format (and speeding up recovery) are listed in MDEV-12353 . It is possible that actual changes to the redo log format (such as the function mlog_open_and_write_index()) are unavoidable. Even if the redo log format is not changed, it would seem safest to try to refuse startup with an older server version. We can do this by introducing a new redo log format identifier, similar to what was done in MDEV-11782 . While we could theoretically retain support for crash recovery from old-format redo log, it is probably best to not try that. (The GA version of MySQL 5.7 or MariaDB 10.2 will refuse crash recovery from older server versions, and MySQL 8.0.0 will refuse crash recovery from MySQL 5.7 or older versions. It is acceptable that MariaDB 10.3 would not support crash recovery from 10.2.) Changes to DML operations If instant ADD COLUMN has not been used, the file format will not be changed. This means that any table-rebuilding ALTER operation (such as ALTER TABLE…FORCE or ALTER TABLE…ALGORITHM=COPY) will convert the table to the old format. If instant ADD COLUMN has been used, any INSERT or UPDATE to the clustered index leaf page must store the current number of columns in the record. (For ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC we must also set REC_INFO_ADDED_FLAG to signal the presence of the n_fields in the record header.) The REC_INFO_DELETED_FLAG of a record can be set or unset without changing other parts of the record. Usually this is done by a DELETE or an UPDATE of a PRIMARY KEY column, or the ROLLBACK of those operations. On SELECT (also as part of UPDATE), if the clustered index leaf page record is found to contain fewer index fields than the current table definition implies, the missing fields will be filled in from SYS_COLUMNS_ADDED.DEFAULT_VALUE.
            vinchen vinchen added a comment -

            Hi Marko. I am very glad that we had a good face-to-face communication for instant add columns at MariaDB Developer Conference 2017. And we reached a consensus on it.

            Thanks for the detailed design proposal above, it is very clear for me. And most of the design had been done in TMySQL from Tencent.

            I will start to merge the feature from TMySQL to MariaDB 10.3 as soon as possible.

            vinchen vinchen added a comment - Hi Marko. I am very glad that we had a good face-to-face communication for instant add columns at MariaDB Developer Conference 2017. And we reached a consensus on it. Thanks for the detailed design proposal above, it is very clear for me. And most of the design had been done in TMySQL from Tencent. I will start to merge the feature from TMySQL to MariaDB 10.3 as soon as possible.

            To reduce the scope of this task, I filed a follow-up task:
            MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT

            marko Marko Mäkelä added a comment - To reduce the scope of this task, I filed a follow-up task: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT
            marko Marko Mäkelä added a comment - - edited

            To avoid complicating import, export, and future data dictionary related tasks such as MDEV-11655, I am trying to prototype a design change that minimizes the format changes to the InnoDB data dictionary tables.
            Instead of introducing a new table SYS_COLUMNS_ADDED to store the default values of instantly-added columns, my revised design would introduce a special 'default row' record at the start of the clustered index. This record would carry the magic REC_INFO_MIN_REC_FLAG, which was until now only set on the leftmost node pointer record of each non-leaf level.
            We must also store the original number of columns (’core columns’) somewhere in the clustered index tree, instead of only storing it somewhere in the InnoDB data dictionary.
            There is a chicken-and-egg problem here because of my unfortunate design mistake in MySQL 5.0.3 that reserves the n_nullable bits in node pointer records for ROW_FORMAT!=REDUNDANT. It is clearly not possible to navigate to the leftmost leaf page by reading child page numbers from mach_read_from_4(rec_get_next_ptr(node_ptr_with_min_rec_flag, 1)-4) because the next-record pointer may have been changed, for example if some pages were merged and the original next-record pointer was moved to the PAGE_FREE list.

            It looks like instant ADD COLUMN should write 1+UT_BITS_IN_BYTES(n_nullable) to spare 8 bits in the clustered index root page. A good candidate would be the high-order byte of PAGE_DIRECTION, which was always written as 0 starting with MySQL 4.0.14 and 4.1.1, so always for ROW_FORMAT!=REDUNDANT. Originally, it was sometimes written as uninitialized garbage for ROW_FORMAT=REDUNDANT in the initial InnoDB version.

            marko Marko Mäkelä added a comment - - edited To avoid complicating import, export, and future data dictionary related tasks such as MDEV-11655 , I am trying to prototype a design change that minimizes the format changes to the InnoDB data dictionary tables. Instead of introducing a new table SYS_COLUMNS_ADDED to store the default values of instantly-added columns, my revised design would introduce a special 'default row' record at the start of the clustered index. This record would carry the magic REC_INFO_MIN_REC_FLAG, which was until now only set on the leftmost node pointer record of each non-leaf level. We must also store the original number of columns (’core columns’) somewhere in the clustered index tree, instead of only storing it somewhere in the InnoDB data dictionary. There is a chicken-and-egg problem here because of my unfortunate design mistake in MySQL 5.0.3 that reserves the n_nullable bits in node pointer records for ROW_FORMAT!=REDUNDANT. It is clearly not possible to navigate to the leftmost leaf page by reading child page numbers from mach_read_from_4(rec_get_next_ptr(node_ptr_with_min_rec_flag, 1)-4) because the next-record pointer may have been changed, for example if some pages were merged and the original next-record pointer was moved to the PAGE_FREE list. It looks like instant ADD COLUMN should write 1+UT_BITS_IN_BYTES(n_nullable) to spare 8 bits in the clustered index root page. A good candidate would be the high-order byte of PAGE_DIRECTION, which was always written as 0 starting with MySQL 4.0.14 and 4.1.1 , so always for ROW_FORMAT!=REDUNDANT. Originally, it was sometimes written as uninitialized garbage for ROW_FORMAT=REDUNDANT in the initial InnoDB version .

            I think that we must actually store the original number of clustered index fields in the root page, using 10 bits.
            Because the PAGE_DIRECTION field may contain garbage for ROW_FORMAT=REDUNDANT tables, we must introduce a new FIL_PAGE_TYPE to tell if the new field PAGE_INSTANT is present in the former most significant 10 bits of PAGE_DIRECTION.
            The new FIL_PAGE_TYPE on the clustered index root page will also prevent older MariaDB versions from opening or importing tables where instant ADD COLUMN has been used.

            The special ‘default row’ would contain the values of instantly added columns. When the table is opened for the first time and the clustered index root page indicates that instant ADD COLUMN may have been used, we will fetch the ‘default row’. If its field count disagrees with the field count of the clustered index in the data dictionary, we will treat the table as corrupted.

            Note: Whenever the clustered index becomes empty of user records, the root page should be reset to ‘non-instant’ form (FIL_PAGE_TYPE reset to FIL_PAGE_INDEX, and the PAGE_INSTANT bits be cleared, and the ‘default row’ deleted). This improves compatibility with old versions.

            marko Marko Mäkelä added a comment - I think that we must actually store the original number of clustered index fields in the root page, using 10 bits. Because the PAGE_DIRECTION field may contain garbage for ROW_FORMAT=REDUNDANT tables, we must introduce a new FIL_PAGE_TYPE to tell if the new field PAGE_INSTANT is present in the former most significant 10 bits of PAGE_DIRECTION. The new FIL_PAGE_TYPE on the clustered index root page will also prevent older MariaDB versions from opening or importing tables where instant ADD COLUMN has been used. The special ‘default row’ would contain the values of instantly added columns. When the table is opened for the first time and the clustered index root page indicates that instant ADD COLUMN may have been used, we will fetch the ‘default row’. If its field count disagrees with the field count of the clustered index in the data dictionary, we will treat the table as corrupted. Note: Whenever the clustered index becomes empty of user records, the root page should be reset to ‘non-instant’ form (FIL_PAGE_TYPE reset to FIL_PAGE_INDEX, and the PAGE_INSTANT bits be cleared, and the ‘default row’ deleted). This improves compatibility with old versions.

            I have now mostly implemented the revised design.
            The record header format changes will be slightly different too. We will not introduce any new info_bits.
            For ROW_FORMAT=REDUNDANT records, there is no change to the record format. For ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC, the status bit value REC_STATUS_COLUMNS_ADDED=4 will signal that the clustered index leaf page record has additional fields.

            One change that I did not implement yet will be that for REC_STATUS_COLUMNS_ADDED records, the ‘number of fields’ stored in the record header will be n_fields-n_core_fields-1. That is, if 1 field was added, the value 0 will be encoded. This allows up to 128 instantly added columns to be represented in 1 additional header byte.

            marko Marko Mäkelä added a comment - I have now mostly implemented the revised design. The record header format changes will be slightly different too. We will not introduce any new info_bits. For ROW_FORMAT=REDUNDANT records, there is no change to the record format. For ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC, the status bit value REC_STATUS_COLUMNS_ADDED=4 will signal that the clustered index leaf page record has additional fields. One change that I did not implement yet will be that for REC_STATUS_COLUMNS_ADDED records, the ‘number of fields’ stored in the record header will be n_fields-n_core_fields-1. That is, if 1 field was added, the value 0 will be encoded. This allows up to 128 instantly added columns to be represented in 1 additional header byte.
            marko Marko Mäkelä added a comment - - edited

            Changes to the undo log format

            For inserting the ‘default row’ record, we need a new undo log record type TRX_UNDO_INSERT_DEFAULT. This is because the normal type TRX_UNDO_INSERT_REC can only be used for searching by key, not by the magic REC_INFO_MIN_REC_FLAG at the start of the index.

            For updating records so that the number of clustered index fields is changing, we can use the existing undo log record types. We must merely treat the rollback of the ‘default row’ record specially. This is demonstrated by the tests innodb.instant_alter_crash and innodb.instant_alter_inject. The trick is that on rollback, we will truncate the ‘default row’ record just like we would truncate user records, as described below.

            Optimizing storage

            To optimize storage, if a record is inserted or updated in such a way that the clustered index fields corresponding to the last instantly added columns have the same value as in the instant ADD COLUMN…DEFAULT clause (same values as in the ‘default row’ record), these fields will be omitted from the record. Example:

            CREATE TABLE t1 (id INT PRIMARY KEY) ENGINE=InnoDB;
            SET @ts = @@session.timestamp;
            ALTER TABLE t1 ADD COLUMN (t TIMESTAMP NOT NULL DEFAULT current_timestamp(), i INT, v VARCHAR(20) NOT NULL DEFAULT 'abcde');
            INSERT INTO t1 SET id=1;
            

            In this example, the ‘default row’ record would store (current_timestamp(),NULL,'abcde') for the instantly added columns. The value of current_timestamp() would be the one that was evaluated during the ALTER TABLE statement. The clustered index record for the INSERT would explicitly store the values of the columns (id), or (id,t) if the default value expression current_timestamp() during the ALTER and INSERT evaluate to a different value. The values of the instantly added columns i,v would correspond to the ‘default row’ record and could be omitted. If the last instantly added column differs from the ‘default row’ record, then all fields would have to be stored in the clustered index record.
            The same rules apply to update:

            BEGIN;
            UPDATE t1 SET v='foo';
            ROLLBACK;
            ALTER TABLE t1 CHANGE t t TIMESTAMP NOT NULL DEFAULT '2000-07-31 21:10:05';
            SET timestamp=@ts;
            UPDATE t1 SET t=current_timestamp(), v=DEFAULT;
            

            The first UPDATE would store all columns (id,t,i,v) in the clustered index. The ROLLBACK would shrink the record back to (id) or (id,t). The second UPDATE would shrink the record back to (id), because the timestamp is being assigned to the same value that existed when the table was created.
            Note: the current DEFAULT value of the column t does not matter. It is the DEFAULT value that was specified during instant ADD COLUMN that matters. This value is what the ‘default row’ record will contain.

            marko Marko Mäkelä added a comment - - edited Changes to the undo log format For inserting the ‘default row’ record, we need a new undo log record type TRX_UNDO_INSERT_DEFAULT. This is because the normal type TRX_UNDO_INSERT_REC can only be used for searching by key, not by the magic REC_INFO_MIN_REC_FLAG at the start of the index. For updating records so that the number of clustered index fields is changing, we can use the existing undo log record types. We must merely treat the rollback of the ‘default row’ record specially. This is demonstrated by the tests innodb.instant_alter_crash and innodb.instant_alter_inject. The trick is that on rollback, we will truncate the ‘default row’ record just like we would truncate user records, as described below. Optimizing storage To optimize storage, if a record is inserted or updated in such a way that the clustered index fields corresponding to the last instantly added columns have the same value as in the instant ADD COLUMN…DEFAULT clause (same values as in the ‘default row’ record), these fields will be omitted from the record. Example: CREATE TABLE t1 (id INT PRIMARY KEY ) ENGINE=InnoDB; SET @ts = @@session. timestamp ; ALTER TABLE t1 ADD COLUMN (t TIMESTAMP NOT NULL DEFAULT current_timestamp (), i INT , v VARCHAR (20) NOT NULL DEFAULT 'abcde' ); INSERT INTO t1 SET id=1; In this example, the ‘default row’ record would store (current_timestamp(),NULL,'abcde') for the instantly added columns. The value of current_timestamp() would be the one that was evaluated during the ALTER TABLE statement. The clustered index record for the INSERT would explicitly store the values of the columns (id), or (id,t) if the default value expression current_timestamp() during the ALTER and INSERT evaluate to a different value. The values of the instantly added columns i,v would correspond to the ‘default row’ record and could be omitted. If the last instantly added column differs from the ‘default row’ record, then all fields would have to be stored in the clustered index record. The same rules apply to update: BEGIN ; UPDATE t1 SET v= 'foo' ; ROLLBACK ; ALTER TABLE t1 CHANGE t t TIMESTAMP NOT NULL DEFAULT '2000-07-31 21:10:05' ; SET timestamp =@ts; UPDATE t1 SET t= current_timestamp (), v= DEFAULT ; The first UPDATE would store all columns (id,t,i,v) in the clustered index. The ROLLBACK would shrink the record back to (id) or (id,t). The second UPDATE would shrink the record back to (id), because the timestamp is being assigned to the same value that existed when the table was created. Note: the current DEFAULT value of the column t does not matter. It is the DEFAULT value that was specified during instant ADD COLUMN that matters. This value is what the ‘default row’ record will contain.

            A counter will be introduced to keep track of the instant ALTER TABLE operations that affect the columns of the table:

            SELECT variable_value FROM information_schema.global_status
            WHERE variable_name = 'innodb_instant_alter_column'
            

            This same counter can be used in the follow-up task MDEV-11424. For now, it is only incremented for instant ADD COLUMN operations.

            marko Marko Mäkelä added a comment - A counter will be introduced to keep track of the instant ALTER TABLE operations that affect the columns of the table: SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'innodb_instant_alter_column' This same counter can be used in the follow-up task MDEV-11424 . For now, it is only incremented for instant ADD COLUMN operations.

            People

              marko Marko Mäkelä
              monty Michael Widenius
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.