Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11424

Instant ALTER TABLE of failure-free record format changes



      Implement instant ALTER TABLE operations where the specification of non-indexed columns is changed. Example: adding or dropping columns, or changing the ordering of columns.

      In InnoDB, this would affect clustered index page leaf records only. We would repurpose some bits to represent a table definition version. The data dictionary would be extended so that we would be able to convert any record from an older version to the latest version. One possibility would be to store 'conversion recipes' for each version that exists in the table. This would allow a quick direct conversion from (say) version 1 to version 1000.

      Note that some operations will need to write to data files and cannot be instantaneous. DROP COLUMN of an indexed column will imply DROP INDEX, which is not instantaneous, but fast.

      Some operations may also fail. Examples include introducing a NOT NULL attribute without DEFAULT value, or changing the data type of a column, or ADD UNIQUE INDEX.

      Any operation that needs to check or convert the table contents (such as changing the columns in the PRIMARY KEY, or introducing a NOT NULL attribute) will continue to rebuild the table.

      We might want to introduce ALTER TABLE qualifiers to prevent negative surprises. For example, ALGORITHM=QUICK would refuse ADD INDEX, but it would allow DROP INDEX and any metadata-only changes. ALGORITHM=NOCOPY would allow ADD INDEX, but it would refuse to rebuild the whole table.

      Read and write access to the clustered index would be affected as follows:
      Writes would always use the latest dictionary version. If the page that is being written to contains other old-format records, we could convert all of them while we are modifying the page.

      Reads would have to be prepared to convert records from old format to the latest one.

      Note: because ha_innobase::commit_inplace_alter_table() will be invoked while holding MDL_EXCLUSIVE, any transactions that read or modified the table must finish before the ALTER TABLE can commit. But it is possible that some old transaction tries to do its first access to the table after the ALTER TABLE committed. Such transactions may receive an error message 'table definition changed', as noted in MySQL Bug#28432. It would be too much effort to support MVCC if a transaction after ALTER modified a record (converting it to newer dictionary version) that would otherwise be visible to the old transaction.
      Here is the scenario in SQL:

      connection con1;
      connection con2;
      UPDATE t SET b=1 WHERE a=100;
      connection con1;
      SELECT * FROM t WHERE a=1; -- might be OK, if the record is still in old format
      SELECT * FROM t WHERE a=100; -- error: record is in to new format

      For simplicity and consistency, we could always return an error to the SELECT statements (after any ALTER TABLE).

      Another example (based on the question posted in MDEV-11369):

      create table t1(a int, b int, c int); -- dictionary version 1 (t1v1)
      insert into t1 values (1,2,3);
      alter table t1 drop column b; -- t1v2(a,c)
      select * from t1;
      alter table t1 add column b int not null default 10 after a; -- t1v3(a,b,c)
      select * from t1;
      alter table t1 change column b b int not null default 20 first; --t1v4(b,a,c)
      select * from t1;
      alter table t1 change column b b int null; --t1v5(b,a,c)
      select * from t1;

      In this case, the SELECT statements will cause the record t1v1(1,2,3) to be converted to t1v2(1,3), t1v3(1,10,3), t1v4(10,1,3), t1v5(10,1,3). The DEFAULT 10 value of the added column b must be unaffected by subsequent changes to column b, to be compatible with ALGORITHM=COPY, which would materialize the first specified DEFAULT value.

      To minimize the overhead for DML operations, we would implement a rule that whenever anything in the clustered index leaf page is modified, all records will be converted to the latest format. To make the check fast, we could try to store a 'latest dictionary version' somewhere in the page. The field PAGE_MAX_TRX_ID is only used in secondary index leaf pages.

      There could also be a separate online operation for clean-up, such as OPTIMIZE INDEX PRIMARY ON t1, which would also defragment the index. This would be useful if writes to the table are rare, because reads would not write back any converted records.

      The clean-up operation could also remove old version history from the data dictionary. We could also have a background cleanup operation that would (with some granularity) remember the last processed PRIMARY KEY value for each table.

      Finally, any ALTER TABLE that would be refused by the above suggested ALGORITHM=NOCOPY (anything that rebuilds the clustered index) would drop the "instant ALTER TABLE" history. Among other things, this would include ROW_FORMAT changes and any changes to the PRIMARY KEY.

      Adding or dropping indexes or renaming columns or indexes would not need to touch the "instant ALTER history" or the 'recipes to convert from version N to M'. Those recipes should not use column names, but column positions.

      If we store recipes for converting from version 1 to latest, 2 to latest, and so on, then each instant ALTER will have to rewrite the old recipes, so that they will work from 1 to latest+1, 2 to latest+1 and so on, and finally add a recipe from latest to latest+1.


          Issue Links



              • Assignee:
                marko Marko Mäkelä
                marko Marko Mäkelä
              • Votes:
                4 Vote for this issue
                12 Start watching this issue


                • Created: