Details

    • 10.2.4-5, 10.3.1-2

    Description

      Add support for instant add column for InnoDB when adding a new column with a default value last.

      We have got patches from Alibaba and TenCent for doing this. We should choose either one or merge them.

      The idea in both patches is to add a new row type to InnoDB where the number of columns
      are stored in each row in 1-2 bytes. When creating a row we store the current number of columns. This allows us to fill any extra columns with default values on read.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            To avoid complicating import, export, and future data dictionary related tasks such as MDEV-11655, I am trying to prototype a design change that minimizes the format changes to the InnoDB data dictionary tables.
            Instead of introducing a new table SYS_COLUMNS_ADDED to store the default values of instantly-added columns, my revised design would introduce a special 'default row' record at the start of the clustered index. This record would carry the magic REC_INFO_MIN_REC_FLAG, which was until now only set on the leftmost node pointer record of each non-leaf level.
            We must also store the original number of columns (’core columns’) somewhere in the clustered index tree, instead of only storing it somewhere in the InnoDB data dictionary.
            There is a chicken-and-egg problem here because of my unfortunate design mistake in MySQL 5.0.3 that reserves the n_nullable bits in node pointer records for ROW_FORMAT!=REDUNDANT. It is clearly not possible to navigate to the leftmost leaf page by reading child page numbers from mach_read_from_4(rec_get_next_ptr(node_ptr_with_min_rec_flag, 1)-4) because the next-record pointer may have been changed, for example if some pages were merged and the original next-record pointer was moved to the PAGE_FREE list.

            It looks like instant ADD COLUMN should write 1+UT_BITS_IN_BYTES(n_nullable) to spare 8 bits in the clustered index root page. A good candidate would be the high-order byte of PAGE_DIRECTION, which was always written as 0 starting with MySQL 4.0.14 and 4.1.1, so always for ROW_FORMAT!=REDUNDANT. Originally, it was sometimes written as uninitialized garbage for ROW_FORMAT=REDUNDANT in the initial InnoDB version.

            marko Marko Mäkelä added a comment - - edited To avoid complicating import, export, and future data dictionary related tasks such as MDEV-11655 , I am trying to prototype a design change that minimizes the format changes to the InnoDB data dictionary tables. Instead of introducing a new table SYS_COLUMNS_ADDED to store the default values of instantly-added columns, my revised design would introduce a special 'default row' record at the start of the clustered index. This record would carry the magic REC_INFO_MIN_REC_FLAG, which was until now only set on the leftmost node pointer record of each non-leaf level. We must also store the original number of columns (’core columns’) somewhere in the clustered index tree, instead of only storing it somewhere in the InnoDB data dictionary. There is a chicken-and-egg problem here because of my unfortunate design mistake in MySQL 5.0.3 that reserves the n_nullable bits in node pointer records for ROW_FORMAT!=REDUNDANT. It is clearly not possible to navigate to the leftmost leaf page by reading child page numbers from mach_read_from_4(rec_get_next_ptr(node_ptr_with_min_rec_flag, 1)-4) because the next-record pointer may have been changed, for example if some pages were merged and the original next-record pointer was moved to the PAGE_FREE list. It looks like instant ADD COLUMN should write 1+UT_BITS_IN_BYTES(n_nullable) to spare 8 bits in the clustered index root page. A good candidate would be the high-order byte of PAGE_DIRECTION, which was always written as 0 starting with MySQL 4.0.14 and 4.1.1 , so always for ROW_FORMAT!=REDUNDANT. Originally, it was sometimes written as uninitialized garbage for ROW_FORMAT=REDUNDANT in the initial InnoDB version .

            I think that we must actually store the original number of clustered index fields in the root page, using 10 bits.
            Because the PAGE_DIRECTION field may contain garbage for ROW_FORMAT=REDUNDANT tables, we must introduce a new FIL_PAGE_TYPE to tell if the new field PAGE_INSTANT is present in the former most significant 10 bits of PAGE_DIRECTION.
            The new FIL_PAGE_TYPE on the clustered index root page will also prevent older MariaDB versions from opening or importing tables where instant ADD COLUMN has been used.

            The special ‘default row’ would contain the values of instantly added columns. When the table is opened for the first time and the clustered index root page indicates that instant ADD COLUMN may have been used, we will fetch the ‘default row’. If its field count disagrees with the field count of the clustered index in the data dictionary, we will treat the table as corrupted.

            Note: Whenever the clustered index becomes empty of user records, the root page should be reset to ‘non-instant’ form (FIL_PAGE_TYPE reset to FIL_PAGE_INDEX, and the PAGE_INSTANT bits be cleared, and the ‘default row’ deleted). This improves compatibility with old versions.

            marko Marko Mäkelä added a comment - I think that we must actually store the original number of clustered index fields in the root page, using 10 bits. Because the PAGE_DIRECTION field may contain garbage for ROW_FORMAT=REDUNDANT tables, we must introduce a new FIL_PAGE_TYPE to tell if the new field PAGE_INSTANT is present in the former most significant 10 bits of PAGE_DIRECTION. The new FIL_PAGE_TYPE on the clustered index root page will also prevent older MariaDB versions from opening or importing tables where instant ADD COLUMN has been used. The special ‘default row’ would contain the values of instantly added columns. When the table is opened for the first time and the clustered index root page indicates that instant ADD COLUMN may have been used, we will fetch the ‘default row’. If its field count disagrees with the field count of the clustered index in the data dictionary, we will treat the table as corrupted. Note: Whenever the clustered index becomes empty of user records, the root page should be reset to ‘non-instant’ form (FIL_PAGE_TYPE reset to FIL_PAGE_INDEX, and the PAGE_INSTANT bits be cleared, and the ‘default row’ deleted). This improves compatibility with old versions.

            I have now mostly implemented the revised design.
            The record header format changes will be slightly different too. We will not introduce any new info_bits.
            For ROW_FORMAT=REDUNDANT records, there is no change to the record format. For ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC, the status bit value REC_STATUS_COLUMNS_ADDED=4 will signal that the clustered index leaf page record has additional fields.

            One change that I did not implement yet will be that for REC_STATUS_COLUMNS_ADDED records, the ‘number of fields’ stored in the record header will be n_fields-n_core_fields-1. That is, if 1 field was added, the value 0 will be encoded. This allows up to 128 instantly added columns to be represented in 1 additional header byte.

            marko Marko Mäkelä added a comment - I have now mostly implemented the revised design. The record header format changes will be slightly different too. We will not introduce any new info_bits. For ROW_FORMAT=REDUNDANT records, there is no change to the record format. For ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC, the status bit value REC_STATUS_COLUMNS_ADDED=4 will signal that the clustered index leaf page record has additional fields. One change that I did not implement yet will be that for REC_STATUS_COLUMNS_ADDED records, the ‘number of fields’ stored in the record header will be n_fields-n_core_fields-1. That is, if 1 field was added, the value 0 will be encoded. This allows up to 128 instantly added columns to be represented in 1 additional header byte.
            marko Marko Mäkelä added a comment - - edited

            Changes to the undo log format

            For inserting the ‘default row’ record, we need a new undo log record type TRX_UNDO_INSERT_DEFAULT. This is because the normal type TRX_UNDO_INSERT_REC can only be used for searching by key, not by the magic REC_INFO_MIN_REC_FLAG at the start of the index.

            For updating records so that the number of clustered index fields is changing, we can use the existing undo log record types. We must merely treat the rollback of the ‘default row’ record specially. This is demonstrated by the tests innodb.instant_alter_crash and innodb.instant_alter_inject. The trick is that on rollback, we will truncate the ‘default row’ record just like we would truncate user records, as described below.

            Optimizing storage

            To optimize storage, if a record is inserted or updated in such a way that the clustered index fields corresponding to the last instantly added columns have the same value as in the instant ADD COLUMN…DEFAULT clause (same values as in the ‘default row’ record), these fields will be omitted from the record. Example:

            CREATE TABLE t1 (id INT PRIMARY KEY) ENGINE=InnoDB;
            SET @ts = @@session.timestamp;
            ALTER TABLE t1 ADD COLUMN (t TIMESTAMP NOT NULL DEFAULT current_timestamp(), i INT, v VARCHAR(20) NOT NULL DEFAULT 'abcde');
            INSERT INTO t1 SET id=1;
            

            In this example, the ‘default row’ record would store (current_timestamp(),NULL,'abcde') for the instantly added columns. The value of current_timestamp() would be the one that was evaluated during the ALTER TABLE statement. The clustered index record for the INSERT would explicitly store the values of the columns (id), or (id,t) if the default value expression current_timestamp() during the ALTER and INSERT evaluate to a different value. The values of the instantly added columns i,v would correspond to the ‘default row’ record and could be omitted. If the last instantly added column differs from the ‘default row’ record, then all fields would have to be stored in the clustered index record.
            The same rules apply to update:

            BEGIN;
            UPDATE t1 SET v='foo';
            ROLLBACK;
            ALTER TABLE t1 CHANGE t t TIMESTAMP NOT NULL DEFAULT '2000-07-31 21:10:05';
            SET timestamp=@ts;
            UPDATE t1 SET t=current_timestamp(), v=DEFAULT;
            

            The first UPDATE would store all columns (id,t,i,v) in the clustered index. The ROLLBACK would shrink the record back to (id) or (id,t). The second UPDATE would shrink the record back to (id), because the timestamp is being assigned to the same value that existed when the table was created.
            Note: the current DEFAULT value of the column t does not matter. It is the DEFAULT value that was specified during instant ADD COLUMN that matters. This value is what the ‘default row’ record will contain.

            marko Marko Mäkelä added a comment - - edited Changes to the undo log format For inserting the ‘default row’ record, we need a new undo log record type TRX_UNDO_INSERT_DEFAULT. This is because the normal type TRX_UNDO_INSERT_REC can only be used for searching by key, not by the magic REC_INFO_MIN_REC_FLAG at the start of the index. For updating records so that the number of clustered index fields is changing, we can use the existing undo log record types. We must merely treat the rollback of the ‘default row’ record specially. This is demonstrated by the tests innodb.instant_alter_crash and innodb.instant_alter_inject. The trick is that on rollback, we will truncate the ‘default row’ record just like we would truncate user records, as described below. Optimizing storage To optimize storage, if a record is inserted or updated in such a way that the clustered index fields corresponding to the last instantly added columns have the same value as in the instant ADD COLUMN…DEFAULT clause (same values as in the ‘default row’ record), these fields will be omitted from the record. Example: CREATE TABLE t1 (id INT PRIMARY KEY ) ENGINE=InnoDB; SET @ts = @@session. timestamp ; ALTER TABLE t1 ADD COLUMN (t TIMESTAMP NOT NULL DEFAULT current_timestamp (), i INT , v VARCHAR (20) NOT NULL DEFAULT 'abcde' ); INSERT INTO t1 SET id=1; In this example, the ‘default row’ record would store (current_timestamp(),NULL,'abcde') for the instantly added columns. The value of current_timestamp() would be the one that was evaluated during the ALTER TABLE statement. The clustered index record for the INSERT would explicitly store the values of the columns (id), or (id,t) if the default value expression current_timestamp() during the ALTER and INSERT evaluate to a different value. The values of the instantly added columns i,v would correspond to the ‘default row’ record and could be omitted. If the last instantly added column differs from the ‘default row’ record, then all fields would have to be stored in the clustered index record. The same rules apply to update: BEGIN ; UPDATE t1 SET v= 'foo' ; ROLLBACK ; ALTER TABLE t1 CHANGE t t TIMESTAMP NOT NULL DEFAULT '2000-07-31 21:10:05' ; SET timestamp =@ts; UPDATE t1 SET t= current_timestamp (), v= DEFAULT ; The first UPDATE would store all columns (id,t,i,v) in the clustered index. The ROLLBACK would shrink the record back to (id) or (id,t). The second UPDATE would shrink the record back to (id), because the timestamp is being assigned to the same value that existed when the table was created. Note: the current DEFAULT value of the column t does not matter. It is the DEFAULT value that was specified during instant ADD COLUMN that matters. This value is what the ‘default row’ record will contain.

            A counter will be introduced to keep track of the instant ALTER TABLE operations that affect the columns of the table:

            SELECT variable_value FROM information_schema.global_status
            WHERE variable_name = 'innodb_instant_alter_column'
            

            This same counter can be used in the follow-up task MDEV-11424. For now, it is only incremented for instant ADD COLUMN operations.

            marko Marko Mäkelä added a comment - A counter will be introduced to keep track of the instant ALTER TABLE operations that affect the columns of the table: SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'innodb_instant_alter_column' This same counter can be used in the follow-up task MDEV-11424 . For now, it is only incremented for instant ADD COLUMN operations.

            People

              marko Marko Mäkelä
              monty Michael Widenius
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.