[MDEV-17520] Instant ALTER TABLE for failure-free column type changes Created: 2018-10-22  Updated: 2024-01-18

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: None

Type: Task Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: ddl, instant

Attachments: File MDEV-17520.patch     File MDEV-17520_sec_indexes.10.4.3.diff    
Issue Links:
Blocks
is blocked by MDEV-15562 Instant DROP COLUMN or changing the o... Closed
PartOf
is part of MDEV-16291 Allow ALGORITHM=NOCOPY for most ALTER... Open
Relates
relates to MDEV-18627 Wrong result after instant size chang... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDEV-17892 Assertion `!(field->type.prtype & 256... Technical task Closed Marko Mäkelä  
MDEV-17893 Assertion `*nulls < null_mask' failed... Technical task Closed Marko Mäkelä  
MDEV-17897 Assertion `page_is_leaf(block->frame)... Technical task Open Marko Mäkelä  
MDEV-17922 Assertion `index.fields[i].col->same_... Technical task Closed Marko Mäkelä  
MDEV-17924 Failing assertion: page_is_comp(next_... Technical task Open Marko Mäkelä  
MDEV-17925 Assertion `!dfield_is_null(field)' fa... Technical task Open Marko Mäkelä  
MDEV-17927 Assertion `!field_map_it->is_dropped(... Technical task Open Marko Mäkelä  
MDEV-17931 Assertion `!(index->table)->not_redun... Technical task Open Marko Mäkelä  

 Description   

The original InnoDB ROW_FORMAT=REDUNDANT is essentially storing each field as variable-length and possibly NULL. For that format, we can trivially allow an instantaneous change of a column from NOT NULL to NULL.

The space-optimized row formats COMPACT and DYNAMIC do not allow NULL values to be represented for columns that were originally declared as NOT NULL. They also do not store any length for fixed-length columns. Because of this, some failure-free conversions that are instantaneous for ROW_FORMAT=REDUNDANT in MDEV-15563 would require the table to be rebuilt when it is in the COMPACT or DYNAMIC format.

Let us create a hybrid format that allows us to avoid rebuilding the table in the cases covered by MDEV-15563:

  1. To prevent downgrade or IMPORT TABLESPACE to older MariaDB releases, add a MDEV-15562 metadata record.
  2. For ROW_FORMAT=COMPACT or ROW_FORMAT=REDUNDANT, if the metadata record is present, allow clustered index leaf pages to be in ROW_FORMAT=REDUNDANT with the differences noted below.

The clustered index leaf pages of the table would be gradually converted into something that resembles ROW_FORMAT=REDUNDANT as a result of modifications. This will increase the size usage a little. Also, ROW_FORMAT=REDUNDANT limits the maximum in-page record size to 16,383 bytes, which for innodb_page_size=64k is less than the limit for COMPACT or DYNAMIC.

Because secondary index records would remain in the original ROW_FORMAT, secondary indexes may have to be rebuilt when an indexed column is changed. That is, changing an indexed column from NOT NULL to NULL will require the indexes to be rebuilt if ROW_FORMAT is not REDUNDANT.

Any INSERT or UPDATE after an instant ALTER that removes a NOT NULL constraint (or changes a column to a wider type later in MDEV-15563) will cause all records in the affected clustered index leaf page to be rewritten in a format that resembles ROW_FORMAT=REDUNDANT, with the following differences:

  • NULL columns will occupy 0 bytes of storage, also when fixed-length.
  • CHAR (unless NULL) will occupy n*mbminlenn*mbmaxlen bytes, instead of n*mbmaxlen.
  • For ROW_FORMAT=DYNAMIC, no local prefix of off-page columns will be stored. For ROW_FORMAT=COMPACT, we will continue to write the 768-byte prefixes.

In ROW_FORMAT=REDUNDANT, the record header will store the length of each column (including fixed-length and NULL columns), using n_fields or 2·n_fields bytes. The fixed-length record header is 6 instead of 5 bytes. This will increase the size of each record by at least 1 byte, up to 2·n_fields+1 bytes.

The metadata BLOB that was introduced in MDEV-15562 will be augmented. The flag 1U << 14 will be set in dict_instant_t::non_pk_col_map[] for ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC columns that were originally created as NOT NULL but no longer carry this attribute. Based on this information, we will initialize n_nullable and n_core_null_bytes for the clustered index based on the original column definition, instead of the latest one. Secondary indexes and ROW_FORMAT=REDUNDANT will use the latest definition.



 Comments   
Comment by Marko Mäkelä [ 2018-11-07 ]

I created bb-10.4-MDEV-17520 with a refactored version of what was originally developed in MDEV-15563.
According to the original plan, the remaining task in MDEV-15563 was to allow instant failure-free changes of column type (other than NOT NULL removal), in ROW_FORMAT=REDUNDANT only.

I changed the scope of this task and MDEV-15563 as follows:

  1. In MDEV-17520: Support NOT NULL removal only, but also for ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC.
  2. In MDEV-15563, support instant column type changes for all ROW_FORMAT, relying on the logic implemented in MDEV-17520.
Comment by Marko Mäkelä [ 2018-11-09 ]

Some code changes of this work will be needed in the final version of MDEV-15563 and MDEV-15564.

Comment by Marko Mäkelä [ 2018-11-25 ]

Currently, this only fully works for ROW_FORMAT=REDUNDANT.

For DYNAMIC and COMPRESSED, the basic case of INSERT after the instantaneous removal of NOT NULL will convert the pages to the flexible format. The UPDATE code paths have not been adjusted yet, and SELECT will misinterpret old pages, wrongly expecting there to be ‘is null’ flags in the record headers.

Due to these omissions, the functionality is only enabled for ROW_FORMAT=REDUNDANT at the moment, and MDEV-17520.patch has to be applied to test DYNAMIC or COMPACT. I applied the patch and conducted a simple performance test on a release build:

--source include/have_innodb.inc
--source include/have_sequence.inc
CREATE TABLE ti (a SERIAL, b INT NOT NULL) ENGINE=InnoDB;
CREATE TABLE tr (a SERIAL, b INT NOT NULL) ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
INSERT INTO ti SET a=NULL, b=0;
ALTER TABLE ti CHANGE b b INT NULL;
DELETE FROM ti;
CREATE TABLE tf LIKE ti;
ALTER TABLE tr CHANGE b b INT NULL;
INSERT tr SELECT NULL, NULL FROM seq_1_to_10000000;
TRUNCATE tr;
 
SET profiling = 1;
INSERT tr SELECT NULL, NULL FROM seq_1_to_10000000;
DROP TABLE tr;
INSERT ti SELECT NULL, NULL FROM seq_1_to_10000000;
DROP TABLE ti;
INSERT tf SELECT NULL, NULL FROM seq_1_to_10000000;
DROP TABLE tf;
SHOW PROFILES;

To eliminate the effect of the adaptive hash index and the file system operations, we test as follows:

./mtr --mem --mysqld=--innodb-file-per-table=0 --mysqld=--skip-innodb-adaptive-hash-index innodb.instant_alter_insert_performance

bb-10.4-MDEV-17520 1883da2a7e7f1fff067b62fbfdc9174e24b93a12

SHOW PROFILES;
Query_ID	Duration	Query
1	25.88988026	INSERT tr SELECT NULL, NULL FROM seq_1_to_10000000
2	0.00508767	DROP TABLE tr
3	26.42063776	INSERT ti SELECT NULL, NULL FROM seq_1_to_10000000
4	0.00493538	DROP TABLE ti
5	27.34734312	INSERT tf SELECT NULL, NULL FROM seq_1_to_10000000
6	0.00362084	DROP TABLE tf

As expected, the INSERT performance is a little worse than with ROW_FORMAT=REDUNDANT, but a little better than with the default ROW_FORMAT=DYNAMIC.
The performance difference between the INSERT goes away if we invoke the test as

./mtr --mem --mysqld=--innodb-file-per-table=0 --mysqld=--skip-innodb-adaptive-hash-index --mysqld=--innodb-default-row-format=redundant innodb.instant_alter_insert_performance

This is because the file format is constant in that case. The INSERT duration would only vary by some tens of seconds.

Finally, here are the results with the baseline version (the latest 10.4 that was merged to the branch):

10.4 27f3329ff6cb755b600d536347669bef1a7d98b5

SHOW PROFILES;
Query_ID	Duration	Query
1	25.67807430	INSERT tr SELECT NULL, NULL FROM seq_1_to_10000000
2	0.00440439	DROP TABLE tr
3	27.33882037	INSERT ti SELECT NULL, NULL FROM seq_1_to_10000000
4	0.01391525	DROP TABLE ti
5	27.25386693	INSERT tf SELECT NULL, NULL FROM seq_1_to_10000000
6	0.00725656	DROP TABLE tf

There does not seem to be significant performance degradation, or the degradation is within noise levels (the times can vary a couple of tens of seconds when rerunning). For the table ti, the results are not comparable, because the format is different.

Comment by Marko Mäkelä [ 2018-11-25 ]

The following benchmark is showing a slight performance improvement (instead of degradation) for the branch (1883da2a7e7f1fff067b62fbfdc9174e24b93a12) compared to plain 10.4 (27f3329ff6cb755b600d536347669bef1a7d98b5):

--source include/have_innodb.inc
--source include/have_sequence.inc
CREATE TABLE t (a SERIAL, b INT NULL) ENGINE=InnoDB;
SET profiling = 1;
let $N= 10;
while ($N) {
dec $N;
INSERT t SELECT NULL, NULL FROM seq_1_to_10000000;
TRUNCATE t;
}
SHOW PROFILES;
DROP TABLE t;

./mtr --mysqld=--skip-innodb-file-per-table --mysqld=--skip-innodb-adaptive-hash-index --mysqld=--innodb-default-row-format=redundant innodb.benchmark
./mtr --mysqld=--skip-innodb-file-per-table --mysqld=--skip-innodb-adaptive-hash-index innodb.benchmark

The difference is about 2%, while the noise should be 1% or less. For the slower row_format=dynamic, the test takes 5 minutes to run on my system.

Comment by Marko Mäkelä [ 2018-11-30 ]

The latest bb-10.4-MDEV-17520 passes the full mysql-test-run. Known problems:

  1. UPDATE or DELETE after an instant NOT NULL removal could fail.
  2. INSERT will unnecessarily split pages. This one we should fix by letting btr_page_reorganize() convert the pages.
  3. btr_page_reorganize() should return a failure in case the converted payload will not fit.
  4. If ROW_FORMAT is DYNAMIC or COMPACT, we will unnecessarily rebuild the table if any indexes exist. (The proper action would be to rebuild only those secondary indexes that are defined on the columns on which the NOT NULL attribute is being removed.)
Comment by Marko Mäkelä [ 2018-12-03 ]

I reran my above mentioned benchmarks using clang 7.0.1 -O2 (instead of gcc 8.2.0 -O2, which I used last time).
The branch still seems to be slightly faster than the latest merged 10.4 revision. Only for a table that was created in ROW_FORMAT=DYNAMIC and was rebuilt (instead of doing instantaneous removal of NOT NULL), the test with INSERT, SELECT COUNT(*), TRUNCATE had the opposite performance: 10.4 completed in 386.047 seconds while the branch took 388.309 seconds.

Comment by Marko Mäkelä [ 2018-12-03 ]

The test main.function_defaults_innodb turned out to be crashing randomly, and after my fix, it is returning wrong result. That is currently the only disabled test.

Comment by Matthias Leich [ 2018-12-04 ]

I have no idea if the test which follows is expected to assert or not on 
commit c3ca982a0d5abb358f95c7d94ca14382dc81e1ee (HEAD -> bb-10.4-MDEV-17520, origin/bb-10.4-MDEV-17520)
Date:   Tue Dec 4 17:25:28 2018 +0200
(It does not fail on a 10.4 mid of November)
 
--source include/have_innodb.inc
--disable_abort_on_error
 
 
--source include/have_innodb.inc
CREATE TABLE t1 (col1 INT, col2 INT, col_int INTEGER, col_text TEXT ) ENGINE = InnoDB  ;
SET AUTOCOMMIT = 0 ;
SET @fill_amount = (@@innodb_page_size / 2 ) + 1  ;
ALTER TABLE t1 ADD PRIMARY KEY IF NOT EXISTS ( col1, col_int ), ALGORITHM = DEFAULT, LOCK = SHARED  ;
ALTER TABLE t1 DROP PRIMARY KEY, LOCK = DEFAULT  ;
INSERT INTO t1 (col1,col2, col_int, col_text) VALUES ( 217, 217, 217, REPEAT(SUBSTR(CAST( 217 AS CHAR),1,1), @fill_amount) ) ;
ALTER TABLE t1 CHANGE COLUMN col1 col1 INT, LOCK = NONE  ;
INSERT INTO t1 (col1,col2, col_int, col_text) VALUES ( 43, 43, 43 - 1, REPEAT(SUBSTR(CAST( 43 AS CHAR),1,1), @fill_amount) ) ;
ROLLBACK  ;
--enable_abort_on_error
SHOW PROCESSLIST;    # <-- Just for "convincing"  MTR that there is a crash.
 
storage/innobase/btr/btr0cur.cc:8206: void btr_rec_free_externally_stored_fields(dict_index_t*, rec_t*, const ulint*, page_zip_des_t*, bool, mtr_t*): Assertion `(index->table)->not_redundant() == !!rec_offs_comp(offsets)' failed.
...
uery (0x62b000000220): ROLLBACK
Connection ID (thread ID): 9
Status: NOT_KILLED
 
3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f30acfc7f5d in __GI_abort () at abort.c:90
#6  0x00007f30acfbdf17 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x55a62957dbe0 "(index->table)->not_redundant() == !!rec_offs_comp(offsets)", file=file@entry=0x55a6295762e0 "storage/innobase/btr/btr0cur.cc", line=line@entry=8206, function=function@entry=0x55a629584860 <btr_rec_free_externally_stored_fields(dict_index_t*, unsigned char*, unsigned long const*, page_zip_des_t*, bool, mtr_t*)::__PRETTY_FUNCTION__> "void btr_rec_free_externally_stored_fields(dict_index_t*, rec_t*, const ulint*, page_zip_des_t*, bool, mtr_t*)") at assert.c:92
#7  0x00007f30acfbdfc2 in __GI___assert_fail (assertion=0x55a62957dbe0 "(index->table)->not_redundant() == !!rec_offs_comp(offsets)", file=0x55a6295762e0 "storage/innobase/btr/btr0cur.cc", line=8206, function=0x55a629584860 <btr_rec_free_externally_stored_fields(dict_index_t*, unsigned char*, unsigned long const*, page_zip_des_t*, bool, mtr_t*)::__PRETTY_FUNCTION__> "void btr_rec_free_externally_stored_fields(dict_index_t*, rec_t*, const ulint*, page_zip_des_t*, bool, mtr_t*)") at assert.c:101
#8  0x000055a6284f2f4f in btr_rec_free_externally_stored_fields (index=0x617000041408, rec=0x7f30a3b24121 "", offsets=0x61a000092b08, page_zip=0x0, rollback=true, mtr=0x7f30972f2ab0) at storage/innobase/btr/btr0cur.cc:8206
#9  0x000055a6284e9a18 in btr_cur_pessimistic_delete (err=0x7f30972f29f0, has_reserved_extents=0, cursor=0x61b000084378, flags=0, rollback=true, mtr=0x7f30972f2ab0) at storage/innobase/btr/btr0cur.cc:5888
#10 0x000055a62877f8cc in row_undo_ins_remove_clust_rec (node=0x61b000084308) at storage/innobase/row/row0uins.cc:187
#11 0x000055a628781ee6 in row_undo_ins (node=0x61b000084308, thr=0x61700004b248) at storage/innobase/row/row0uins.cc:588
#12 0x000055a628373823 in row_undo (node=0x61b000084308, thr=0x61700004b248) at storage/innobase/row/row0undo.cc:439
#13 0x000055a628373cde in row_undo_step (thr=0x61700004b248) at storage/innobase/row/row0undo.cc:499
#14 0x000055a62822362a in que_thr_step (thr=0x61700004b248) at storage/innobase/que/que0que.cc:1040
#15 0x000055a628223a4e in que_run_threads_low (thr=0x61700004b248) at storage/innobase/que/que0que.cc:1104
#16 0x000055a628223e37 in que_run_threads (thr=0x61700004b248) at storage/innobase/que/que0que.cc:1144
#17 0x000055a628435714 in trx_rollback_to_savepoint_low (trx=0x7f30a3ea1908, savept=0x0) at storage/innobase/trx/trx0roll.cc:137
#18 0x000055a628436004 in trx_rollback_for_mysql_low (trx=0x7f30a3ea1908) at storage/innobase/trx/trx0roll.cc:206
#19 0x000055a628436894 in trx_rollback_for_mysql (trx=0x7f30a3ea1908) at storage/innobase/trx/trx0roll.cc:232
#20 0x000055a627fc9ac8 in innobase_rollback (hton=0x614000002248, thd=0x62a0000de208, rollback_trx=true) at storage/innobase/handler/ha_innodb.cc:4716
#21 0x000055a627a91130 in ha_rollback_trans (thd=0x62a0000de208, all=true) at sql/handler.cc:1722
#22 0x000055a627712c75 in trans_rollback (thd=0x62a0000de208) at sql/transaction.cc:423
#23 0x000055a62735c0f8 in mysql_execute_command (thd=0x62a0000de208) at sql/sql_parse.cc:5837
#24 0x000055a62736a990 in mysql_parse (thd=0x62a0000de208, rawbuf=0x62b000000220 "ROLLBACK", length=8, parser_state=0x7f30972f5dd0, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:8092
#25 0x000055a627341df3 in dispatch_command (command=COM_QUERY, thd=0x62a0000de208, packet=0x629000271209 "ROLLBACK  ", packet_length=10, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:1850
#26 0x000055a62733ea1f in do_command (thd=0x62a0000de208) at sql/sql_parse.cc:1395
#27 0x000055a6276e10e5 in do_handle_one_connection (connect=0x6080000010a8) at sql/sql_connect.cc:1402
#28 0x000055a6276e09d7 in handle_one_connection (arg=0x6080000010a8) at sql/sql_connect.cc:1308

Comment by Marko Mäkelä [ 2018-12-05 ]

mleich, the assertions on rec_offs_comp() or page_rec_is_comp() or page_is_comp() are certainly failing due to this work. I just need a list of all such assertion failures (with stack traces), so that I can evaluate them. I have relaxed many such assertions with || index->dual_format(), and it should be applicable also in this case.
This work is introducing dual-format leaf pages for the clustered index of COMPACT or DYNAMIC tables.

Today I finished addressing the correctness-critical FIXME comments, except the one in dict_stats_analyze_index_level().

Comment by Marko Mäkelä [ 2018-12-27 ]

I rebased the work on the latest 10.4 (slightly after the 10.4.1 release) and pushed to bb-10.4-MDEV-17520-2.
This is very unlikely to be completed within the MariaDB Server 10.4 series.

The main reason why the work was stalled is the increased space usage due to the dual-format leaf pages.

For ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC, a better approach would be to store multiple versions of the table definition in the metadata BLOB that was introduced in MDEV-15562, and to use a new status value to indicate that a 'dictionary version number' field is present in the record header. This would be close to the original idea of MDEV-11424. The biggest unknown here is implementing the conversions between different dictionary versions. Ultimately, we would not only support NOT NULL removal, but also widening string or integer columns (MDEV-15563).

Comment by Marko Mäkelä [ 2019-04-03 ]

MDEV-17520_sec_indexes.10.4.3.diff could be useful when reviving this work. It applies cleanly to mariadb-10.4.3. The following tests would fail: main.ctype_utf8mb4_innodb innodb.instant_alter innodb.innodb-alter-nullable main.selectivity_innodb main.function_defaults_innodb

Generated at Thu Feb 08 08:37:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.