Hi vinchen, it was a pleasure for me to see your live presentation of a revised idea that uses per-record tagging while maintaining backward compatibility with old data files.
The following detailed design proposal is based on what we agreed upon today. Some consistency checks and tests could be omitted from the first prototype. Also support for ROW_FORMAT=REDUNDANT can be omitted from the first prototype. I would like to have good test coverage for this. I can try to help with tests if needed.
ALTER TABLE syntax changes
It would be nice to introduce new syntax to prevent nasty surprises. When an operation is expected to be quick, it could be better to return an error than to perform a disruptive (resource-intensive and time-consuming) operation.
ALGORITHM=INSTANT will refuse any operation that must modify any data files.
ALGORITHM=NOCOPY will refuse any operation that would rebuild the clustered index (and the whole table).
With respect to the allowed operations, ALGORITHM=INSTANT is a subset of ALGORITHM=NOCOPY which is a subset of ALGORITHM=INPLACE.
Example:
# No data file change (instant operation)
|
ALTER TABLE t ADD COLUMN b INT, ALGORITHM=INSTANT;
|
# The following will change data files (ADD INDEX), but not rebuild the table:
|
ALTER TABLE t ADD COLUMN c INT, ADD INDEX(c), ALGORITHM=NOCOPY;
|
# The following are changing data files (not instant operation)
|
--error ER_ALTER_OPERATION_NOT_SUPPORTED_REASON
|
ALTER TABLE t DROP INDEX c, ALGORITHM=INSTANT;
|
--error ER_ALTER_OPERATION_NOT_SUPPORTED_REASON
|
ALTER TABLE t ADD COLUMN d INT, ADD INDEX(d), ALGORITHM=INSTANT;
|
# All of the above will be allowed with any other ALGORITHM.
|
# With ALGORITHM=DEFAULT or with no ALGORITHM, the most efficient
|
# available algorithm will be used.
|
No new ROW_FORMAT keywords will be introduced. Instead, ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC will be extended at the file level.
Limitations
Limitations on ALTER TABLE
Instant ADD COLUMN will not be supported on ROW_FORMAT=COMPRESSED tables. (ALTER TABLE…ADD COLUMN…ALGORITHM=NOCOPY must return an error; the operation can be executed with ALGORITHM=INPLACE or ALGORITHM=COPY just like before.)
Instant ADD COLUMN will not allow non-constant DEFAULT values, not as part of the ADD COLUMN operation nor later on columns that were created by instant ADD COLUMN. If such DEFAULT value is used, then ALGORITHM=INSTANT or ALGORITHM=NOCOPY will return an error, but ALGORITHM=INPLACE or ALGORITHM=COPY will rebuild the table.
Instant ADD COLUMN is only allowed when the column is added last. Adding columns between or before existing columns will be refused by ALGORITHM=INSTANT or ALGORITHM=NOCOPY, but will be executed by ALGORITHM=INPLACE or ALGORITHM=COPY just like before.
Limitations on IMPORT and EXPORT
ALTER TABLE…IMPORT TABLESPACE is refused after instant ADD COLUMN was used on the table. (You must do TRUNCATE TABLE and ALTER TABLE…FORCE, or just DROP TABLE and CREATE TABLE before ALTER TABLE…DISCARD TABLESPACE and ALTER TABLE…IMPORT TABLESPACE.)
ALTER TABLE…IMPORT TABLESPACE is refused if the clustered index root page of the to-be-imported data file contains a flag indicating that instant ADD COLUMN was executed.
FLUSH TABLES FOR EXPORT must return an error or at least a warning if instant ADD COLUMN has been used. If you want to export tables, use ALTER TABLE…FORCE first to convert it to compatible format. (This is only to prevent a surprise when the copied .ibd file would cause an error later on IMPORT TABLESPACE.)
Data dictionary changes
For now, we will not care about the future goal of removing the InnoDB system tables. If/when a future version of MariaDB implements that, it will perform a conversion step. This conversion step could repurpose fields in index pages in some way that we do not need to account for now. We will not need to repurpose PAGE_MAX_TRX_ID in any way now.
Just like now, *.frm files will store the latest table definition, including the latest DEFAULT values and the latest number of columns.
Just like now, SYS_TABLES and SYS_COLUMNS will store all column definitions and the clustered index field definitions.
(SYS_FIELDS entries for the SYS_INDEXES entry of the clustered index only reflect the PRIMARY KEY columns, so instant ADD COLUMN would not modify SYS_FIELDS.)
A new InnoDB data dictionary table is added for identifying instantly added columns and for storing the initial DEFAULT values of these columns. This table would be created using the InnoDB SQL interpreter at startup, just like SYS_FOREIGN and other tables. Note that this will require the addition of a BIGINT literal to the InnoDB SQL parser, because INT is only 32 bits, and SYS_TABLES.TABLE_ID is 64 bits:
PROCEDURE CREATE_SYS_COLUMNS_ADDED_PROC () IS
|
BEGIN
|
CREATE TABLE SYS_COLUMNS_ADDED(
|
TABLE_ID BIGINT UNSIGNED NOT NULL,
|
POS INT NOT NULL,
|
DEFAULT_VALUE CHAR);
|
CREATE UNIQUE CLUSTERED INDEX COL_IND ON SYS_FOREIGN (TABLE_ID, POS);
|
END;
|
The view INFORMATION_SCHEMA.INNODB_SYS_COLUMNS_ADDED will be created, to allow an easy way of checking which tables require a conversion (ALTER TABLE…FORCE) for moving to MySQL or older versions of MariaDB. Example:
SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES
|
WHERE table_id IN (SELECT table_id FROM INFORMATION_SCHEMA.INNODB_SYS_COLUMNS_ADDED);
|
If any record with SYS_COLUMNS_ADDED.TABLE_ID matches the table definition that is being loaded from SYS_TABLES and SYS_COLUMNS, then the values of SYS_COLUMNS_ADDED.POS must be in strict sequential order (no gaps allowed) and run up to the number of columns in SYS_COLUMNS.
The value SYS_COLUMNS_ADDED.DEFAULT_VALUE = NULL is not allowed if the column is declared NOT NULL.
When SYS_COLUMNS_ADDED.DEFAULT_VALUE is not NULL, it must match the allowed length and value of the column (it must match the fixed length of the column declared in SYS_COLUMNS, or it must not exceed the maximum length of a variable-length column.
The code that loads the table definition must report that the table is corrupted if any inconsistency between SYS_COLUMNS and SYS_COLUMNS_ADDED is detected.
Example of the InnoDB data dictionary changes
CREATE TABLE t1(a INT PRIMARY KEY, b INT) ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
|
INSERT INTO t1 SET a=1;
|
# instantly add columns
|
ALTER TABLE t1 ADD COLUMN c INT, d CHAR(10) NULL DEFAULT 'foo', e INT NOT NULL DEFAULT 42;
|
INSERT INTO t1 SET a=2;
|
# change some DEFAULT values of old columns, and add one column
|
ALTER TABLE t1 CHANGE COLUMN b b INT DEFAULT 5,
|
CHANGE COLUMN c c INT DEFAULT 10,
|
CHANGE COLUMN d d DEFAULT NULL, ADD COLUMN f INT UNSIGNED DEFAULT 0;
|
INSERT INTO t1 SET a=3;
|
# change DEFAULTs in .frm file only (nothing inside InnoDB)
|
ALTER TABLE t1 CHANGE COLUMN a a INT 101, COLUMN b b INT DEFAULT 102,
|
CHANGE COLUMN c c INT DEFAULT 103,
|
CHANGE COLUMN d d DEFAULT 'eleventy',
|
CHANGE COLUMN e e INT UNSIGNED DEFAULT 105,
|
CHANGE COLUMN e e INT UNSIGNED DEFAULT 106;
|
SELECT * FROM t1;
|
# must return (1,NULL,NULL,'foo',42,0),(2,NULL,NULL,'foo',42,0),(3,5,10,NULL,0)
|
(For simplicity, the ALTER TABLE statements in the example are not renaming the columns while changing defaults. Such a combination must be supported.)
After the CREATE TABLE, we will have 2 SYS_COLUMNS entries for the table and no SYS_COLUMNS_ADDED entries.
After the first INSERT, the clustered index will contain the following record:
(a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,…,…,NULL). The table format is still compatible with MySQL or older MariaDB versions.
After the first ALTER TABLE, we would have 5 entries in SYS_COLUMNS and 3 entries in SYS_COLUMNS_ADDED for the user table. The SYS_COLUMNS_ADDED entries would be: (table_id,2,NULL),(table_id,3,'foo'),(table_id,4,'\200\0\0*'). The columns in SYS_COLUMNS_ADDED are numbered from 0 onwards, so the first entry will have SYS_COLUMNS_ADDED.POS equal to the number of ‘core columns’. Initially, a table must always be created with at least 1 column. The last value is the binary value corresponding to the internal representation of the INT value 42 in InnoDB.
After the second INSERT, the latest clustered index definition will be
(a,DB_TRX_ID,DB_ROLL_PTR,b,c,d,e) and the index will contain the following:
(1,…,…,NULL),(2,…,…,NULL,NULL,'foo',42).
After the second ALTER TABLE, we will not update the existing SYS_COLUMNS_ADDED entries to reflect the changed DEFAULT values. They will keep the initial DEFAULT values that were in effect during the instant ADD COLUMN. The .frm file will contain the latest DEFAULT value. Inside InnoDB, the only changes will reflect the instant ADD COLUMN e: a SYS_COLUMNS record and the SYS_COLUMNS_ADDED record (table_id,5,'\0\0\0\0').
After the third INSERT, the latest clustered index definition will be
(a,DB_TRX_ID,DB_ROLL_PTR,b,c,d,e,f) and the index will contain the following:
(1,…,…,NULL),(2,…,…,NULL,NULL,'foo',42),(3,…,…,5,10,NULL,0).
The last ALTER does nothing inside InnoDB. It is only changing the DEFAULT values of already existing columns. Those are only stored in the .frm file. The SYS_COLUMNS_ADDED table only stores the initial DEFAULT values of instantly added columns.
The SELECT will fill in the initial DEFAULT values of the columns that are missing from the index entries. That is, for the first record, we will fill in all defaults that are stored in SYS_COLUMNS_ADDED: (c=NULL,d='foo',e=42,f=0). The second and third INSERT stored the current DEFAULT values, not the initial ones. So, for the second record, we will fill in (f=0).
InnoDB file format changes
Instant ADD COLUMN will not be supported for ROW_FORMAT=COMPRESSED. It will be supported for all other ROW_FORMAT (REDUNDANT, COMPACT, DYNAMIC).
We will not introduce new ROW_FORMAT names, but instead refine the binary format in the InnoDB data files.
Feature flag in the clustered index root page
If instant ADD COLUMN has been used, the following flag will be set in the clustered index root page:
/** Determine if instant ADD COLUMN was used.
|
This flag is only supposed to be set on the clustered index root page.
|
@param[in] page index page
|
@return whether instant ADD COLUMN was used
|
(should be false if the page is not a clustered index root page) */
|
inline
|
bool
|
page_is_instant(const page_t* page)
|
{
|
return(page[PAGE_HEADER + PAGE_N_HEAP] & 0x40);
|
}
|
It is an error (data corruption) if page_is_instant(clust_root) does not reflect the existence of SYS_COLUMNS_ADDED records for the table.
At the same time, the functions page_dir_set_n_heap() and page_dir_get_n_heap() must be revised to mask the 2 most significant bits of PAGE_N_HEAP. Previously, only one bit (indicating ROW_FORMAT!=REDUNDANT) was masked, starting with MySQL 5.0.3 which introduced ROW_FORMAT=COMPACT.
It is an error (data corruption) if page_is_instant(page) holds on any other index page than a clustered index root page. The corruption must be reported by CHECK TABLE and by btr_assert_not_corrupted().
(Why use the second-most-significant bit in PAGE_N_HEAP? The most significant bit identifies ROW_FORMAT!=REDUNDANT. The second-most-significant bit is always 0 so far, because there can never be more than 16383 records in a page. The largest possible seems to be innodb_page_size/(5+2) which is 65536/7=9362. Yes, the minimum record size would be 5+1 bytes for a secondary index leaf page record, say CREATE TABLE(a CHAR(1) PRIMARY KEY, INDEX(a)), but you cannot have more than 256 distinct values for 1-byte payload. Even with 65536/6 we get 10922, which is less than 16384.)
Record format changes
Any data corruption must be detected by CHECK TABLE and by btr_index_rec_validate() and similar functions.
After instant ADD COLUMN on a ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC table, INSERT or UPDATE will set a new flag REC_INFO_ADDED_FLAG=0x80 in info_bits and explicitly store the number of ‘non-core’ fields in the record header, immediately before the REC_N_NEW_EXTRA_BYTES, using a variable-length encoding of 2 bytes, something like this:
const byte* nulls = rec - REC_N_NEW_EXTRA_BYTES;
|
ulint n_nullable = index->n_nullable;
|
if (rec[-REC_NEW_INFO_BITS] & REC_INFO_ADDED_FLAG) {
|
/* This must be ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC.
|
ROW_FORMAT=COMPRESSED pages cannot even store this flag! */
|
ut_ad(page_rec_is_comp(rec));
|
ut_ad(!DICT_TF_GET_ZIP_SSIZE(index->table->flags));
|
ulint n = *nulls--;
|
if (n >= 0x80) {
|
n = (n & 0x7f) << 8 | *nulls--;
|
}
|
n_fields = n_core_fields + n + 1;
|
if (n_fields >= index->n_fields) ...; // error
|
n_nullable += ...;
|
}
|
const byte* lens = nulls - UT_BITS_IN_BYTES(n_nullable);
|
It is an error (data corruption) if REC_INFO_ADDED_FLAG is set in a ROW_FORMAT=REDUNDANT table or in a node pointer record or in secondary indexes, or when instant ADD COLUMN has not been used.
On ROW_FORMAT=REDUNDANT pages, the records already explicitly contain the number of fields.
It is an error (data corruption) if a record is marked as containing more fields than the table definition implies, or if a stored field does not match the table definition.
It is an error (data corruption) if the n_fields in a ROW_FORMAT=REDUNDANT record is less than the number of index record fields derived from the number of core columns.
When REC_INFO_ADDED_FLAG is set (COMPACT or DYNAMIC), the above-mentioned variable-length encoding of n_fields is guaranteed by design to be more than the number of ‘core fields’.
It is an error (data corruption) if instant ADD COLUMN has not been used, but the n_fields in a ROW_FORMAT=REDUNDANT clustered index definition does not match the clustered index definition.
It is an error (data corruption) if the n_fields in a ROW_FORMAT=REDUNDANT node pointer record or secondary index record does not match the index definition.
The n_nullable flags in the clustered index node pointer records in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC must reflect the number of ‘core columns’. These flags are useless garbage, and they are only reserved because of file format compatibility. (Clustered index node pointer records only contain the PRIMARY KEY columns, which are always NOT NULL, so we should have used n_nullable=0.) Because the size of n_nullable is not explicitly stored in the page and because we do not want to change the format of node pointer records or pages, this is our only choice.
The size of the n_nullable flags in the ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC clustered index leaf page records must correspond to the number of columns and the stored n_fields in the record. (More bytes may be allocated for new records after ALTER TABLE…ADD COLUMN…NULL.)
Redo log format changes
Because the InnoDB redo log format (somewhat unnecessarily) is a mix of physical and logical operations, some change is needed to the way how INSERT and UPDATE operations are redo logged for ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC tables. Some ideas of simplifying the redo log format (and speeding up recovery) are listed in MDEV-12353.
It is possible that actual changes to the redo log format (such as the function mlog_open_and_write_index()) are unavoidable.
Even if the redo log format is not changed, it would seem safest to try to refuse startup with an older server version. We can do this by introducing a new redo log format identifier, similar to what was done in MDEV-11782.
While we could theoretically retain support for crash recovery from old-format redo log, it is probably best to not try that. (The GA version of MySQL 5.7 or MariaDB 10.2 will refuse crash recovery from older server versions, and MySQL 8.0.0 will refuse crash recovery from MySQL 5.7 or older versions. It is acceptable that MariaDB 10.3 would not support crash recovery from 10.2.)
Changes to DML operations
If instant ADD COLUMN has not been used, the file format will not be changed. This means that any table-rebuilding ALTER operation (such as ALTER TABLE…FORCE or ALTER TABLE…ALGORITHM=COPY) will convert the table to the old format.
If instant ADD COLUMN has been used, any INSERT or UPDATE to the clustered index leaf page must store the current number of columns in the record. (For ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC we must also set REC_INFO_ADDED_FLAG to signal the presence of the n_fields in the record header.)
The REC_INFO_DELETED_FLAG of a record can be set or unset without changing other parts of the record. Usually this is done by a DELETE or an UPDATE of a PRIMARY KEY column, or the ROLLBACK of those operations.
On SELECT (also as part of UPDATE), if the clustered index leaf page record is found to contain fewer index fields than the current table definition implies, the missing fields will be filled in from SYS_COLUMNS_ADDED.DEFAULT_VALUE.
The original InnoDB ROW_FORMAT=REDUNDANT should support Postgres-style instant operations out of the box, because the record header explicitly stores the number of fields as well as the start address of each field. The Postgres-style operations are ADD COLUMN…LAST and DROP COLUMN (the latter is implemented by hiding columns and storing dummy columns for new records).
A simple proof of concept exists in MariaDB 10.2 and upstream MySQL 5.7: The column MERGE_THRESHOLD was appended to the InnoDB data dictionary table SYS_INDEXES without rewriting existing records. Old data dictionary entries will deliver a default value.
While ROW_FORMAT=REDUNDANT introduces some storage overhead, it should also provide faster access, because the start address of each field is explicitly stored in the index record header. No changes to applications are needed if innodb_default_row_format=redundant is set in the configuration files.
Also in this task, ALTER TABLE will not be instantaneous if it includes any data-writing operations, such as ADD [UNIQUE] INDEX. DROP COLUMN may imply DROP INDEX, but that is reasonably fast (not traversing the entire index tree).
In a later MariaDB version, we could support a wider range of instant ALTER TABLE operations by repurposing some bits or bytes in the clustered index leaf page record header for representing a data dictionary version, by keeping all versions of the table definition in the data dictionary, and by converting old-format records to the current format. Note that some operations will need to write to data files and cannot be instantaneous. Such operations may also fail. Examples include introducing a NOT NULL attribute, changing the data type of a column, or ADD UNIQUE INDEX.