Engine-independent online ALTER TABLE (MDEV-16329)

[MDEV-28808] Test MDEV-16329 (ALTER ONLINE TABLE) - Core server part Created: 2022-06-12  Updated: 2023-08-12

Status: Stalled
Project: MariaDB Server
Component/s: Server, Tests
Affects Version/s: N/A
Fix Version/s: 11.2

Type: Technical task Priority: Critical
Reporter: Elena Stepanova Assignee: Elena Stepanova
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File coverage-mtr-c29ff60b.txt     Text File missings-mtr-c29ff60b.txt     File oalter-c29ff60b.diff    
Issue Links:
Problem/Incident
causes MDEV-31906 Processlist shows stage and max stage... Open
Relates
relates to MDEV-27986 Galera testing of ALTER ONLINE TABLE Closed
relates to MDEV-28809 Test MDEV-16329 (ALTER ONLINE TABLE) ... Closed
relates to MDEV-28810 Test MDEV-16329 (ALTER ONLINE TABLE) ... Closed
relates to MDEV-28825 Server crash in binlog_online_alter_e... Closed
relates to MDEV-28942 Online alter does not support ORDER BY Closed
relates to MDEV-28943 Online alter fails under LOCK TABLE w... Closed
relates to MDEV-28944 XA assertions failing in binlog_rollb... Closed
relates to MDEV-28949 Deadlock between online alter and DML Closed
relates to MDEV-28959 Online alter ignores strict table mode Closed
relates to MDEV-28966 Assertion `row_data' failed in unpack... Closed
relates to MDEV-28967 Assertion `marked_for_write_or_comput... Closed
relates to MDEV-29007 Assertion `marked_for_write_or_comput... Closed
relates to MDEV-29013 ER_KEY_NOT_FOUND/lock timeout upon on... Closed
relates to MDEV-29038 XA assertions failing in binlog_rollb... Closed
relates to MDEV-29067 Online alter ignores check constraint... Closed
relates to MDEV-29068 Cascade foreign key updates do not ap... Closed
relates to MDEV-29069 ER_KEY_NOT_FOUND upon concurrent onli... Closed
relates to MDEV-30902 Server crash in LEX::first_lists_tabl... Closed
relates to MDEV-30924 Server crashes in MYSQL_LOG::is_open ... Closed
relates to MDEV-30925 Assertion `share->now_transactional' ... Closed
relates to MDEV-30949 Direct leak in my_register_filename /... Closed
relates to MDEV-30984 Online ALTER table is denied with non... Closed
relates to MDEV-28825 Server crash in binlog_online_alter_e... Closed
relates to MDEV-30987 main.alter_table_online times out wit... Closed
relates to MDEV-31033 ER_KEY_NOT_FOUND upon online COPY ALT... Closed
relates to MDEV-31043 ER_KEY_NOT_FOUND upon concurrent ALTE... Closed
relates to MDEV-31058 ER_KEY_NOT_FOUND upon concurrent CHAN... Closed
relates to MDEV-31059 "Slave SQL" errors upon concurrent DM... Closed
relates to MDEV-31128 Server crashes in Rows_log_event::fin... Closed
relates to MDEV-31136 Online ALTER is allowed on master but... Closed
relates to MDEV-31172 Server crash or ASAN errors in online... Closed
relates to MDEV-31563 Server crashes in MDL_ticket::downgra... Closed
relates to MDEV-31624 Online ALTER fails due to intermediat... Open
relates to MDEV-31775 Server crash in Rows_log_event::updat... Closed
relates to MDEV-31781 ALTER TABLE ENGINE=s3 fails Closed

 Description   

JIRA, Info

MDEV-16329 Engine-independent online ALTER TABLE

Extra info: MDEV-11424 - InnoDB operations which can/cannot be instant

Github

Branch: bb-11.2-oalter

Last commit in testing: c29ff60b
Baseline: e81fa345 (11.2)

Specification/documentation notes

  • Despite all implications in MDEV-16329, online ALTER TABLE is not the same as ALTER ONLINE TABLE. The explicit syntax is

    ALTER TABLE .. ALGORITHM=COPY, LOCK=NONE
    

    In many cases it should happen automatically, on ALTER TABLE without ALGORITHM/LOCK specification.

  • New flag in old_mode: LOCK_ALTER_TABLE_COPY. Big DML events in concurrency with ALTER on tables without PK is a concern. old_mode is to be set for instances which can have such scenarios.
  • Exceptions (limitations) when the new online alter is not applicable are enumerated in MDEV-16329

Per-engines exploratory observations:

  • InnoDB, MyISAM, Aria, Heap, CSV, RocksDB, Archive, Mroonga – applies
  • MERGE, Spider, Blackhole, Sphinx – skips online alter (doesn't hit debug sync point), probably because there is no real data
  • Columnstore – skips online alter (doesn't increase Binlog_bytes_written, cannot test debug , it fails there, to be checked whether it's specific to the branch)
    S3, OQGRAPH, FEDERATED, FEDERATEDX – ALTER is prohibited in general
  • PERFORMANCE_SCHEMA – cannot be ALTERed
  • Conversion to S3 – prohibited. Conversion from S3 – applies .
  • Sequences – prohibited.
  • Connect – prohibited.

Testing

Buildbot

Code review notes

Extra MTR

  • GCOV for the patch
    default set + galera set, nm-big run: Lines with zero coverage (67/2603) missings-mtr-c29ff60b.txt
  • SAN tests
    • ASAN default big / nm + ps + view
    • MSAN default big / nm + ps + view
    • UBSAN default big / nm + ps + view
    • ASAN plugins big/nm
    • ASAN galera big/nm

Manual Testing

  • By spec/exploratory
  • integration map traverse

Crash recovery

  • No new problems with stability observed
  • testing of recovery correctness in comparison to binlog is limited due to a number of legacy issues in binlog replay

Replication

  • No new problems with server (primary or replica) stability observed
  • testing of replicaton stability and correctness is limited due to a number of legacy issues in replication

Larger integration

  • Galera
    outside the scope of this item, done in MDEV-27986
  • Columnstore
    does not support online alter
  • MariaBackup
    no specific issues observed
  • Protocol
  • Connectors

Upgrade

  • Live upgrade
  • Dump upgrade
  • Package upgrade
  • MariaBackup upgrade
  • OM => NS replication upgrade
    Some old bugs fixed along with online alter development (while not directly related to the feature), the fixes may cause OM => NS replication failures. E.g. if ALGORITHM=INSTANT was silently ignored in a previous version and ALTER with it succeeded, even though not executed as INSTANT, now it is rejected, thus upon replicating such events the replication will abort.
  • Galera rolling upgrade
    out of the scope of this item

Random

  • RQG GCOV for the patch
    not satisfactory, to be further worked on after the feature release
  • RQG-100 ASAN
    50% tests fail due to unrelated issues
  • RQG-100 Release

Compatibility

  • Windows
  • ARM
  • FreeBSD
  • i686
    MDEV-31646 downgraded to non-RC-blocking

Misc attention points

  • MTR with alter-algorithm=copy.
    Should be run without debug-sync or on a release build, otherwise too many low-level debug tests fail because they are not hitting expected sync points.

Performance considerations
Big DML events in concurrency with ALTER on tables without PK is a concern. old_mode is to be set for instances which can have such scenarios.

Updates/additions to non-MTR regression tests

  • various small ALTER-related changes
  • syncpoint grammar extensions
  • binlog replay improvements
  • new old_mode in parameter presets and dynamic variables

Branch commits (feature / --- unrelated)

c29ff60b2c8 fix cleenup??
00accd5e465 read_log_event: extract a shortcut with dedicated max_allowed_packet arg
1fa0e86f3cc MDEV-31646 fixes
f5191233418 MDEV-31812 Add switch to old_mode to disable non-locking ALTER
da81aa232c8 fix slave_exec_mode initialization
dd4e5d543d5 MDEV-31804 Assertion `thd->m_transaction_psi == __null' fails
f4c5b3bc1ba Cleanup: make slave_exec_mode of its enum type and pack Log_event better
b6b0a830774 MDEV-31838 Assertion fails upon replication online alter with MINIMAL row
a88019c0e42 MDEV-31781 ALTER TABLE ENGINE=s3 fails
7e9ea6317e1 MDEV-31777 ER_GET_ERRNO upon online alter on CONNECT table
0a32e10cb68 MDEV-31631 Adding auto-increment to table with history online misbehaves
69c7c84583d MDEV-31776 Online ALTER reports the number of affected rows incorrectly
0400e0014c7 Rows_log_event: reorganize the class layout for a lower memory footprint
ed44d2f69d0 MDEV-31775 Server crash upon online alter on sequence
b749e1287f9 add to binlog_bytes_written for an initial description event
41f7852d9f1 make a proper cleanup if online_alter_binlog is failed to create
8750a2e9ee8 MDEV-31755 Replica's DML event deadlocks wit online alter table
ebc53664028 fix key detection on replica with extra columns
6e5e24eac2a MDEV-31677 Assertion failed upon online ALTER with binlog_row_image=NOBLOB
df2b8694fef MDEV-31646 untie from max_allowed_packet and opt_binlog_rows_event_max_size
ca32d2342df MDEV-31646 Online alter applies binlog cache limit to cache writes
d80cb40f132 MDEV-31601 Some ALTER TABLEs fail ... with a wrong error message
7ed19179c69 fix -Werror build
c164dd95bd4 follow-up MDEV-30430: fix versioning.rpl
73bbbc07719 fix main.mysql57_virtual, main.alter_table, innodb.alter_algorithm
8d5a9341781 MDEV-30984 Online ALTER table is denied with non-informative error messages
3b8525ac5ee Add const to get_foreign_key_list/get_parent_foreign_key_list
c3ea029832a Add const to alloc-related thd methods
d07bd4316b5 MDEV-30987 main.alter_table_online times out with view-protocol
388d2039c5d MDEV-31059 "Slave SQL" errors upon concurrent DML and erroneous ALTER
7f0b27a33ce refactor unpack_row
ed8a15d44b5 unpack_row: unpack a correct number of fields
03f806acb95 MDEV-31058 ER_KEY_NOT_FOUND upon concurrent CHANGE column autoinc and DML
d61761f2cb4 unpack_row: set the correct fields in has_value_set for online alter
b599911c92a MDEV-30949 Direct leak in binlog_online_alter_end_trans
1a5b9fcc541 MDEV-31043 ER_KEY_NOT_FOUND upon concurrent ALTER and transaction
af106e27e27 MDEV-31033 ER_KEY_NOT_FOUND upon online COPY ALTER on a partitioned table
5258c19894c MDEV-30945 RPL tests are failing with MSAN use-of-uninitialized-value
ab3b7b8e2dc clean up Rows_log_event virtual methods
057bb76d6bb MDEV-30891 Assertion `!table->versioned(VERS_TRX_ID)' failed
4819c13a496 add partition test
b609b9c6d10 MDEV-30985 Replica stops with error on ALTER ONLINE with Geometry Types
c66d605c80e MDEV-30924 Server crashes in MYSQL_LOG::is_open upon ALTER vs FUNCTION
ff0290047d9 MDEV-30925 Assertion failed in translog_write_record in ONLINE ALTER + Aria
ef1caa6263b MDEV-30902 Server crash in LEX::first_lists_tables_same
ad27cab7806 MDEV-29068 Cascade foreign key updates do not apply in online alter
6b7a16a2f97 fix main.alter_table_{online,lock}
70e385bb6c9 MDEV-29069 follow-up: improve DEFAULT rules
3694afb4e6e MDEV-29069 follow-up: support partially usable keys
069a2063c44 MDEV-29069 follow-up: allow deterministic DEFAULTs
23f7233fb4c MDEV-29069 ER_KEY_NOT_FOUND on online autoinc addition + concurrent DELETE
8292140d1d5 cleanup: cache the result of Rows_log_event::find_key()
39820b1ef87 set table->pos_in_table_list in online alter
0d8a4a8dd19 cleanup: remove rpl_group_info::get_table_data()
0495785dd4f cleanup: ifdefs
8be6788bd88 few rgi assertions. this can proof that rgi is always present
6b90a751c57 rename rpl/rpl_alter_instant -> rpl/rpl_alter_innodb
f55f6916d2c MDEV-29038 XA assertions failing in binlog_rollback and binlog_commit
35401ae1e53 log_event.h: remove junk EOL spaces
a7e3bb4521d MDEV-29013 ER_KEY_NOT_FOUND/lock timeout upon online alter with long unique
ce7c344f861 Fix write_set too
ef10d88a43a MDEV-28816 Assertion `wsrep_thd_is_applying(thd)' failed in int wsrep_ignored_error_code(Log_event*, int)
26c94583e42 rpl: check should go after defaults and vcols update
dd1bde0a035 MDEV-29067 Online alter ignores check constraint violation
7b16df60410 don't do ALTER IGNORE TABLE online
c4d8107063b MDEV-29021 add test case from MDEV-29013
295156d1a54 Do not ignore sql_mode when replicating
37fecbc237f Simplify rgi->get_table_data call
3ad859d37ae reorder RPL_TABLE_LIST fields for better packing
6df8322726c MDEV-29021 fixup
4afa0b2e844 MDEV-29021 ALTER TABLE fails when a stored virtual column is dropped+added
36764a6f278 cleanup, remove dead code
f950f4453fd MDEV-28943 Online alter fails under LOCK TABLE with ER_ALTER_OPERATION_NOT_SUPPORTED_REASON
50ea1b9f3de MDEV-28930 ALTER TABLE Deadlocks with parallel TL_WRITE
c61df2b4a10 MDEV-28967 Assertion `marked_for_write_or_computed()' failed in Field_new_decimal::store_value / online_alter_read_from_binlog`
c6d89f0a236 remove redundant warnings in RBR and online alter
e27314cf40b cleanup: whitespace, etc
ed186479fea MDEV-28959 Online alter ignores strict table mode
43a364143ef MDEV-28944 XA assertions failing in binlog_rollback and binlog_commit
2a15af9ae9a control Cache_flip_event_log lifetime with reference count
7da5157220b MDEV-28771 Assertion `table->in_use&&tdc->flushed' failed after ALTER
462bc39a120 test rename alter_table_online -> alter_table_online_debug
d560f19f813 separate online_alter_cache_data from binlog_cache_data
e968c475e7a put binlog_cache_data on a memroot
e6d69e3f952 always commit for non-trans engines
a7b11479979 savepoints
e9891c063ff don't crash if ALTER TABLE fails and a long transaction blocks MDL upgrade
62fb6a20eb9 don't use start_consistent_snapshot
997aa8885b7 don't copy stmt IO_CACHE to trx IO_CACHE at the stmt end
28cc0fcf5a4 don't do DROP SYSTEM VERSIONING online
d96ffbde842 set read_set early, before row reads
adbba1c0920 no ALTER TABLE should return ER_NO_DEFAULT_FOR_FIELD
6fb0d9dc0a3 online alter always uses ALGORITHM=COPY, LOCK=NONE
9da6f1f4687 remove handler::open_read_view()
63d43ff120b cleanup
338fe3c6962 support 'alter online table t1 page_checksum=0'
64c21c742b2 tests: move around, add new
1af3b635610 MDEV-16329 [5/5] ALTER ONLINE TABLE
c4bb6280b2d MDEV-16329 [4/5] Refactor MYSQL_BIN_LOG: extract Event_log ancestor
bf15bfb49c7 MDEV-16329 [3/5] use binlog_cache_data directly in most places
d61f467ef4e MDEV-16329 [2/5] refactor binlog and cache_mngr
2fdc269ee25 MDEV-16329 [1/5] add THD::binlog_get_cache_mngr
--- 8c8027970ba rpl: repack table_def
--- 7c25ddfd5ff Copy_field: add const to arguments
--- 40bed167bb2 rename tests
--- 2d837d42759 binlog_combinations.inc -> binlog_format_combinations.inc
--- 979f808883b cleanup: remove vcol_info->stored_in_db
--- 10561a10ed1 Fix recalculation of vcols in binlog_row_image=minimal
--- e7f8898393b cleanup: clarify ha_innobase::column_bitmaps_signal()
--- 74b94197eb2 allow random_bytes() in virtual columns


Legend:
Some work still needs to be done on the item (continue testing, re-check after fixing, etc.)
No work expected on the item
The item is a problem which is currently considered a blocker for the feature
The item cannot be worked on for the time being
Attention point but not a blocker for the feature (low-prio bug, documentation point, unfinished check, etc.)
Work on the item is considered finished



 Comments   
Comment by Elena Stepanova [ 2023-08-11 ]

Test case to demonstrate benefits of online alter

--source include/have_innodb.inc
--source include/have_sequence.inc
 
create table t (pk int primary key, a int default 0) engine=InnoDB;
insert into t select seq, seq*10 from seq_1_to_10000000;
 
--connect (con1,localhost,root,,)
 
set @dml_start= @@timestamp;
--delimiter $
send
begin not atomic
  declare n int default 0;
  declare exit handler for 1364 begin end;
  loop
    set n = n + 1;
    update t set a = default where pk = n;
  end loop;
end $
--delimiter ;
 
--connection default
set @alter_start= @@timestamp;
alter table t modify a int not null, algorithm=copy;
select @@timestamp - @alter_start as alter_duration;
 
--connection con1
--reap
select count(*) as rows_updated from t where a = 0;
 
--connection default
show status like 'Binlog_bytes_written%';
 
# Cleanup
drop table t;

Baseline

alter_duration
24.46679401397705
 
rows_updated
0

Online alter

alter_duration
28.17583394050598
 
rows_updated
711035

Comment by Elena Stepanova [ 2023-08-12 ]

Test case to demonstrate disadvantages of online alter (in limited time)

--source include/have_innodb.inc
--source include/have_sequence.inc
 
create table t (i int, a int) engine=InnoDB;
insert into t select seq, seq*10 from seq_1_to_5000000;
 
--connect (con1,localhost,root,,)
 
--delimiter $
send
begin not atomic
  declare n int default 0;
  set @dml_start= @@timestamp;
  update t set a = 0 where a != 0 limit 10000;
  set @first_dml_duration= @@timestamp - @dml_start;
  set @dml_start= @@timestamp;
  while n < 500 do
    set n = n + 1;
    update t set a = 0 where a != 0 limit 100;
  end while;
  set @dml_loop_duration= @@timestamp - @dml_start;
end $
--delimiter ;
 
--connection default
set @alter_start= @@timestamp;
alter table t modify a int not null, algorithm=copy;
select @@timestamp - @alter_start as alter_duration;
 
--connection con1
--reap
select @first_dml_duration, @dml_loop_duration;
select count(*) as rows_updated from t where a = 0;
 
--connection default
show status like 'Binlog_bytes_written%';
 
# Cleanup
drop table t;

Baseline

alter_duration
13.081230878829956
 
@first_dml_duration	@dml_loop_duration
13.098241090774536	5.994330167770386
 
rows_updated
60000

Online alter

alter_duration
516.154294013977
 
@first_dml_duration	@dml_loop_duration
0.022433042526245117	12.315052032470703
 
rows_updated
60000

old_mode=LOCK_ALTER_TABLE_COPY

alter_duration
12.442718029022217
 
@first_dml_duration	@dml_loop_duration
12.461263179779053	6.53249192237854
 
rows_updated
60000

Generated at Thu Feb 08 10:03:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.