[MDEV-25292] Atomic CREATE OR REPLACE TABLE Created: 2021-03-29  Updated: 2024-01-31

Status: Stalled
Project: MariaDB Server
Component/s: Data Definition - Create Table
Fix Version/s: 11.5

Type: New Feature Priority: Critical
Reporter: Aleksey Midenkov Assignee: Michael Widenius
Resolution: Unresolved Votes: 1
Labels: Preview_10.10

Issue Links:
Blocks
blocks MDEV-4259 transactional DDL Open
blocks MDEV-27180 Fully atomic partitioning DDL operations Stalled
blocks MDEV-31138 Enable the test spider/bugfix.mdev_29... Open
is blocked by MDEV-28753 Review InnoDB changes for MDEV-25292 Closed
is blocked by MDEV-28933 CREATE OR REPLACE fails to recreate s... Closed
is blocked by MDEV-29824 Galera crashes when running CoR for a... Closed
is blocked by MDEV-29831 Galera crashes when running CoR for a... Closed
PartOf
is part of MDEV-25921 Implement CREATE TABLE...SELECT in on... Open
Problem/Incident
causes MDEV-28933 CREATE OR REPLACE fails to recreate s... Closed
causes MDEV-28952 Assertion `t->s->tmp_table || thd->md... Closed
causes MDEV-28956 Locking is broken if CREATE OR REPLAC... Closed
causes MDEV-29544 SIGSEGV in HA_CREATE_INFO::finalize_l... Closed
causes MDEV-29620 Assertion `next_insert_id == 0' faile... Closed
causes MDEV-29628 Memory leak after CREATE OR REPLACE w... Closed
causes MDEV-29664 Assertion `!n_mysql_tables_in_use' fa... Closed
causes MDEV-29676 Dual thread hang in 'closing tables' ... Closed
causes MDEV-29680 ASAN heap-use-after-free in myrocks::... Open
causes MDEV-29698 Server crash or assertion failure upo... Closed
causes MDEV-29699 Assertion `!table_list->table || frm.... Closed
causes MDEV-29701 Inconsistencies and eventual failure ... Closed
causes MDEV-29709 Unexpected ER_NO_SUCH_TABLE on CREATE... Closed
causes MDEV-29758 CREATE OR REPLACE .. SELECT of Federa... Closed
causes MDEV-29767 CREATE OR REPLACE bypasses S3 restric... Closed
causes MDEV-29770 Broken table cannot be CREATE OR REPL... Closed
causes MDEV-29772 CREATE OR REPLACE not atomic for sequ... Closed
causes MDEV-29779 Unexpected ER_ERROR_ON_RENAME upon CR... Closed
causes MDEV-29781 Server crashes in ha_myisammrg::add_c... Closed
causes MDEV-29783 ER_NO_SUCH_TABLE_IN_ENGINE after fail... Closed
causes MDEV-29787 CREATE OR REPLACE does not work with ... Closed
causes MDEV-29791 ER_NO_SUCH_TABLE_IN_ENGINE or "File n... Closed
causes MDEV-29793 Assertion failure in translog_write_r... Closed
causes MDEV-29800 ERROR 1062 (23000): Duplicate entry '... Closed
causes MDEV-29802 #sql-backup tables are visible upon C... Closed
causes MDEV-29824 Galera crashes when running CoR for a... Closed
causes MDEV-29831 Galera crashes when running CoR for a... Closed
Relates
relates to MDEV-24576 Atomic CREATE TABLE Closed
relates to MDEV-25595 DROP part of failed CREATE OR REPLACE... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDEV-28753 Review InnoDB changes for MDEV-25292 Technical task Closed Marko Mäkelä  

 Description   

Atomic drop table (MDEV-23844) does not recover the replaced table if CREATE OR REPLACE TABLE fails.

This is because current, as before, CREATE OR REPLACE TABLE foo ... is implemented as:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo ..

Because of that failed CREATE OR REPLACE TABLE or a crash during the create will lead to lost table, if the original table existed.

This task equips table drop with backing up the old table until the new table was successfully created.

In other words, this task ensures that if one executes CREATE OR REPLACE TABLE foo on an existing table, the table foo will either be the new or old table after the statement has executed, including in the cases the CREATE fails or the server crashes.

NOTE: this change was in a preview release for MariaDB Community Server 10.10.0, but could not be added to 10.10 RC



 Comments   
Comment by Aleksey Midenkov [ 2021-11-29 ]

Please review bb-10.7-midenok-MDEV-25292

Comment by Aleksey Midenkov [ 2022-03-16 ]

marko Please review InnoDB part of bb-10.7-midenok-MDEV-25292 in main commit MDEV-25292 Atomic CREATE OR REPLACE TABLE

Comment by Aleksey Midenkov [ 2022-03-21 ]

elenst Please test bb-10.7-MDEV-25292-atomic-replace

Comment by Marko Mäkelä [ 2022-04-29 ]

Sorry, I had missed the request to review the InnoDB changes. I found a branch from March 22 that is 12 commits ahead and 53 commits behind 10.7. I think that this needs to be based on the latest development branch.

I would appreciate a more detailed description of the changes, and the reasons for changing lower-level InnoDB code. I would suggest you to check the TRUNCATE TABLE implementation once more and consider if a smaller change is possible. If the motivation of the more extensive changes is to support the addition of FOREIGN KEY constraints with something like the following:

CREATE TABLE parent(a INT PRIMARY KEY) ENGINE=InnoDB;
CREATE TABLE child(a INT PRIMARY KEY) ENGINE=InnoDB;
CREATE OR REPLACE TABLE child (a INT PRIMARY KEY REFERENCES parent) ENGINE=InnoDB;

then it should be mentioned in the commit message and there should be such test cases in the same commit.

Comment by Aleksey Midenkov [ 2022-05-30 ]

marko "InnoDB changes" article has been added.

Any of the branches can be reviewed:
bb-10.7-midenok-MDEV-25292
preview-10.10-MDEV-25292-atomic-cor

Comment by Aleksey Midenkov [ 2022-06-04 ]

lstartseva Please test preview-10.10-MDEV-25292-atomic-cor

Comment by Marko Mäkelä [ 2022-06-06 ]

midenok, I see that you filed a separate ticket MDEV-28753 and assigned to me, to review the InnoDB changes. I do not think that it is helpful to split any code review to multiple tickets. I have added a "blocking" relation, so that this ticket cannot be completed without that review being completed.

Because no commit hashes were mentioned in earlier comments and the named branches have been overwritten, I cannot tell if any attempt to simplify the InnoDB changes as I requested was made.

Did someone review the rest of the changes before this entered testing?

Comment by Aleksey Midenkov [ 2022-06-09 ]

marko Since I made it originally I do not see how it can be more simple, the changes are pretty simple. If you know how to improve them, please do the concrete suggestion. That's the point of review: when the other people see what can't be seen by the author. Other code was reviewed and approved by Michael Widenius.

The new subtask is helpful because more than one people worked on the task simultaneously. To notify you I made a subtask. Blocking relation is not very helpful, the subtask just notifies you to do the review. I will not push that until your approval anyway.

Comment by Marko Mäkelä [ 2022-06-13 ]

I think that CREATE OR REPLACE TABLE must be implemented at as high level as possible, similar to how ha_innobase::truncate() currently implements TRUNCATE TABLE as a combination of RENAME, CREATE, DROP. The pseudocode should be something like this:

  1. Start a dictionary transaction.
  2. Try to rename a table from the specified name to a temporary name. If the table does not exist, fine.
  3. If the table existed, delete all its FOREIGN KEY constraints from SYS_FOREIGN and SYS_FOREIGN_COLS (in the same transaction).
  4. CREATE the table (in the same transaction).
  5. If the table existed on rename, drop the original table (in the same transaction).
  6. Commit the dictionary transaction.
  7. On any error, roll back the dictionary transaction. Try to avoid modifying the InnoDB dict_sys cache before the transaction was committed.

Similar to TRUNCATE TABLE, FOREIGN KEY constraints (which are maintained inside InnoDB until MDEV-16417 has been implemented) will require some additional handling. During TRUNCATE, FOREIGN KEY constraints must simply be preserved; during the RENAME part of CREATE OR REPLACE TABLE, the constraints may have to be renamed or deleted specially in order to avoid duplicated values of databasename/constraintname. Implicit names of foreign key constraints have names like databasename/tablename_ibfk_1. I think that it would be simplest to delete all constraints that were part of the table. If the dictionary transaction is rolled back, the deletion of those records will be rolled back as well. There should be no need to rename any FOREIGN KEY constraints.

I would like to see a clear description how FOREIGN KEY constraints are supposed to be handled. One commit message mentions "constraints" but does not specifically say FOREIGN KEY. For example, CHECK constraints are handled outside InnoDB. It also says something about "backup" and "tmp table", which might refer to renaming a persistent table to a generated name.

Furthermore, I would like to see a high level description how CREATE OR REPLACE TABLE…SELECT is intended to work. Currently, CREATE TABLE…SELECT is executing two transactions. First, ha_innobase::create() will create and empty table, and then in a separate transaction, something like INSERT…SELECT will be executed. On any error (deadlock, lock wait timeout, duplicate key error), the table will be dropped. What is expected to happen in the scope of this work? Should an existing table be dropped, or should it be preserved if populating the replacement table fails?

Comment by Michael Widenius [ 2022-06-13 ]

First to answer Marko:

  • Everything is done on a high level except a small change to InnoDB to inform it that we are creating
    a temporary table and a change in S3 to inform the top level that renames are very expensive.
  • The code guarantees that either you get a new table with all expected data or the old table will be restored.

The commit message explains how the code works:

Atomic replace algorithm

Two DDL chains are used for CREATE OR REPLACE:
ddl_log_state_create (C) and ddl_log_state_rm (D).

1. (C) Log CREATE_TABLE_ACTION of TMP table (drops TMP table);
2. Create new table as TMP;
3. Do everything with TMP (like insert data);

finalize_atomic_replace():
4. Link chains: (D) is executed only if (C) is closed;
5. (D) Log DROP_ACTION of BACKUP;
6. (C) Log RENAME_TABLE_ACTION from ORIG to BACKUP (replays BACKUP -> ORIG);
7. Rename ORIG to BACKUP;
8. (C) Log CREATE_TABLE_ACTION of ORIG (drops ORIG);
9. Rename TMP to ORIG;

finalize_ddl() in case of success:
10. Close (C);
11. Replay (D): BACKUP is dropped.

finalize_ddl() in case of error:
10. Close (D);
11. Replay (C):
1) ORIG is dropped (only after finalize_atomic_replace());
2) BACKUP renamed to ORIG (only after finalize_atomic_replace());
3) drop TMP.

Temporary table for CREATE OR REPLACE

Before dropping "old" table, CREATE OR REPLACE creates "tmp" table.
ddl_log_state_create holds the drop of the "tmp" table. When
everything is OK (data is inserted, "tmp" is ready) ddl_log_state_rm
is written to replace "old" with "tmp". Until ddl_log_state_create
is closed ddl_log_state_rm is not executed.

After the binlogging is done ddl_log_state_create is closed. At that
point ddl_log_state_rm is executed and "tmp" is replaced with
"old". That is: final rename is done by the DDL log.

With that important role of DDL log for CREATE OR REPLACE operation
replay of ddl_log_state_rm must fail at the first hit error and
print the error message if possible. F.ex. foreign key error is
discovered at this phase: InnoDB rejects to drop the "old" table and
returns corresponding foreign key error code.

Additional notes

  • CREATE TABLE without REPLACE is not affected by this commit.
  • Engines having HTON_EXPENSIVE_RENAME flag set are not affected by
    this commit.
  • CREATE TABLE .. SELECT XID usage is fixed and now there is no need
    to log DROP TABLE via DDL_CREATE_TABLE_PHASE_LOG (see comments in
    do_postlock()). XID is now correctly updated so it disables
    DDL_LOG_DROP_TABLE_ACTION. Note that binary log is flushed at the
    final stage when the table is ready. So if we have XID in the
    binary log we don't need to drop the table.
  • Three variations of CREATE OR REPLACE handled:

1. CREATE OR REPLACE TABLE t1 (..);
2. CREATE OR REPLACE TABLE t1 LIKE t2;
3. CREATE OR REPLACE TABLE t1 SELECT ..;

  • Test case uses 5 combinations for engines (aria, aria_notrans,
    myisam, ib, expensive_rename) and 2 combinations for binlog types
    (row, stmt). Combinations help to check differences between the
    results. Error failures are tested for the above three variations.
  • Triggers mechanism is unaffected by this change. This is tested in
    create_replace.test.
  • LOCK TABLES is affected. Lock restoration must be done after "rm"
    chain is replayed.
  • Moved ddl_log_complete() from send_eof() to finalize_ddl(). This
    checkpoint was not executed before for normal CREATE TABLE but is
    executed now.

This means that this checkpoint was not executed before for normal
CREATE TABLE but is excuted now.

Rename and drop via DDL log

We replay ddl_log_state_rm to drop the old table and rename the
temporary table. In that case we must throw the correct error
message if ddl_log_revert() fails (f.ex. on FK error).

If table is deleted earlier and not via DDL log and the crash
happened, the create chain is not closed. Linked drop chain is not
executed and the new table is not installed. But the old table is
already deleted.

ddl_log.cc changes

Now we can place action before DDL_LOG_DROP_INIT_ACTION and it will
be replayed after DDL_LOG_DROP_TABLE_ACTION.

report_error parameter for ddl_log_revert() allows to fail at first
error and print the error message if possible.
ddl_log_execute_action() now can print error message.

Since we now can handle errors from ddl_log_execute_action() (in
case of non-recovery execution) unconditional setting "error= TRUE"
is wrong (it was wrong anyway because it was overwritten at the end
of the function).

On XID usage

Like with all other atomic DDL operations XID is used to avoid
inconsistency between master and slave in the case of a crash after
binary log is written and before ddl_log_state_create is closed. On
recovery XIDs are taken from binary log and corresponding DDL log
events get disabled. That is done by
ddl_log_close_binlogged_events().

On linking two chains together

Chains are executed in the ascending order of entry_pos of execute
entries. But entry_pos assignment order is undefined: it may assign
bigger number for the first chain and then smaller number for the
second chain. So the execution order in that case will be reverse:
second chain will be executed first.

To avoid that we link one chain to another. While the base chain
(ddl_log_state_create) is active the secondary chain
(ddl_log_state_rm) is not executed. That is: only one chain can be
executed in two linked chains.

The interface ddl_log_link_chains() was done in "MDEV-22166
ddl_log_write_execute_entry() extension".

More on CREATE OR REPLACE .. SELECT

We use create_and_open_tmp_table() like in ALTER TABLE to create
temporary TABLE object (tmp_table is (NON_)TRANSACTIONAL_TMP_TABLE).

After we created such TABLE object we use create_info->tmp_table()
instead of table->s->tmp_table when we need to check for
parser-requested tmp-table.

External locking is required for temporary table created by
create_and_open_tmp_table(). F.ex. that disables logging for Aria
transactional tables and without that (when no mysql_lock_tables()
is done) it cannot work correctly.

For making external lock the patch requires Aria table to work in
non-transactional mode. That is usually done by
ha_enable_transaction(false). But we cannot disable transaction
completely because: 1. binlog rollback removes pending row events
(binlog_remove_pending_rows_event()). The row events are added
during CREATE .. SELECT data insertion phase. 2. replication slave
highly depends on transaction and cannot work without it.

So we put temporary Aria table into non-transactional mode with
"thd->transaction->on hack". See comment for on_save variable.

Note that Aria table has internal_table mode. But we cannot use it
because:

if (!internal_table)

{ mysql_mutex_lock(&THR_LOCK_myisam); old_info= test_if_reopen(name_buff); }

For internal_table test_if_reopen() is not called and we get a new
MARIA_SHARE for each file handler. In that case duplicate errors are
missed because insert and lookup in CREATE .. SELECT is done via two
different handlers (see create_lookup_handler()).

For temporary table before dropping TABLE_SHARE by
drop_temporary_table() we must do ha_reset(). ha_reset() releases
storage share. Without that the share is kept and the second CREATE
OR REPLACE .. SELECT fails with:

HA_ERR_TABLE_EXIST (156): MyISAM table '#sql-create-b5377-4-t2' is
in use (most likely by a MERGE table). Try FLUSH TABLES.

HA_EXTRA_PREPARE_FOR_DROP also removes MYISAM_SHARE, but that is
not needed as ha_reset() does the job.

ha_reset() is usually done by
mark_tmp_table_as_free_for_reuse(). But we don't need that mechanism
for our temporary table.

Atomic_info in HA_CREATE_INFO

Many functions in CREATE TABLE pass the same parameters. These
parameters are part of table creation info and should be in
HA_CREATE_INFO (or whatever). Passing parameters via single
structure is much easier for adding new data and
refactoring.

Comment by Aleksey Midenkov [ 2022-06-13 ]

marko the atomic DDL is engine independent and therefore transactions independent. It has to handle some engine peculiarities though, one of this is the requirement to preserve original foreign keys and restore them back.

Comment by Marko Mäkelä [ 2022-06-17 ]

Why exactly was the use of InnoDB persistent statistics disabled in several tests? The commit message is very vague. How is that related to this work at all? I think that it should be filed and fixed separately. Failures of those tests seem to be very seldom, and I did not see any test failures even after reverting that change.

Persistent statistics have been enabled by default since MySQL 5.6.6 or MariaDB Server 10.0. Ever since MDEV-25919 was implemented, no problems should be anticipated.

More alarming is that in the test main.create_or_replace, the use of InnoDB persistent statistics is disabled. That seems to be totally unnecessary; only the metadata locks on the InnoDB statistics tables need to be filtered out from the expected result. I did not observe any failures after not disabling persistent statistics.

I will post some suggested InnoDB code changes later.

Comment by Marko Mäkelä [ 2022-06-17 ]

I refactored dict_get_referenced_table() and its callers. I think that before this refactoring, there was a bug that CREATE OR REPLACE TABLE db.x with REFERENCES some_other_db.x was mishandled.

As far as I understand, the rename-based approach is needed so that CREATE OR REPLACE TABLE…SELECT can restore the original table in case the INSERT…SELECT part of the statement fails. I think that the following scenarios must be tested (or existing mtr tests expanded):

  • Crash recovery where CREATE OR REPLACE TABLE is replacing a table, such that the same named constraint (say, CONSTRAINT a FOREIGN KEY b(c) REFERENCES parent(c)) is specified for both the old and the replacing table
  • Rollback of CREATE OR REPLACE TABLE…SELECT when named FOREIGN KEY constraints exist
  • Crash recovery and rollback where another table is referencing the table via a named constraint
  • Schema and table names starting with #mysql50#, such that FOREIGN KEY constraints exist between tables
  • CREATE OR REPLACE TABLE that adds or removes a FOREIGN KEY constraint

I am requesting these, because I suspect that duplicate key errors for constraints like dbname/fk_constraint_name would be issued. I expect anonymous constraints to be fine; a unique internal name like dbname/tablename_ibfk_1 should be created when renaming the table.

I plan to suggest some changes to row_rename_table_for_mysql() as well. I hope to preserve the assertion ut_ad(err != DB_DUPLICATE_KEY); in some form.

Comment by Sergei Golubchik [ 2022-06-18 ]

In the branch preview-10.10-ddl
And in bb-10.10-MDEV-25292

Comment by Marko Mäkelä [ 2022-06-21 ]

I revised the RENAME changes in InnoDB and restored a removed debug assertion. Now this should be ready for testing.

Comment by Aleksey Midenkov [ 2022-06-21 ]

marko You can check cross-reference report: multi_source.multi_parallel fails at least in 10.5 - 10.9.

Comment by Aleksey Midenkov [ 2022-06-21 ]

More alarming is that in the test main.create_or_replace, the use of InnoDB persistent statistics is disabled. That seems to be totally unnecessary; only the metadata locks on the InnoDB statistics tables need to be filtered out from the expected result. I did not observe any failures after not disabling persistent statistics.

I don't like 2-line-long select in create_or_replace.test in such many places. Too much garbage out of business logic. Is it important to have persistent statististics in that test?

Comment by Marko Mäkelä [ 2022-06-23 ]

midenok, yes, it is important to test with default settings. The InnoDB persistent statistics do play a role during DDL, and we must cover it. Work-arounds like MDEV-4750 can significantly reduce coverage in our regression tests. Should there be a race condition or other glitch somewhere, sooner or later a regression test would fail on our CI system. If we disable persistent statistics, we will never catch such failures.

I amended the main commit with some clarified wording and my version of the InnoDB changes. I also confirmed that there is a regression for a case that I described 5 days ago:

--source include/have_innodb.inc
CREATE TABLE t(a INT PRIMARY KEY) ENGINE=InnoDB;
CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB;
CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB;
DROP TABLE u, t;

This test used to pass, but now it would fail due to a duplicate key on SYS_FOREIGN.NAME on the non-anonymous constraint name:

mysqltest: At line 4: query 'CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB' failed: ER_CANT_CREATE_TABLE (1005): Can't create table `test`.`u` (errno: 121 "Duplicate key on write or update")

For anonymous constraints there is no problem, because the automatically generated names will be of the form databasename/tablename_ibfk_1 and the different table name will make the constraint name unique. With the above test case, assuming that the name of the current database is test, there will be a duplicate on test/c or test/d. I didn’t check which of the names InnoDB will use internally. If either c or d is specified in the CREATE OR REPLACE statement and the statement is duplicated, the test will fail. It will pass if neither c nor c is specified, that is, the constraint will be anonymous.

This is a regression that needs to be fixed.

Comment by Aleksey Midenkov [ 2022-06-27 ]

marko, no, if you enable some innodb functionality for all tests, you do not require to change all tests. And you do not require to add `where table_schema!='mysql' or table_name not like 'innodb_%_stats'` to every `select * from information_schema.metadata_lock_info`! Please, test everything you need in innodb suite.

Comment by Lena Startseva [ 2022-07-28 ]

Testing is complete, if marko doesn't have any more objections, I guess it can be merged.

Comment by Michael Widenius [ 2023-01-25 ]

I will review this when I am ready with MDEV-27180. Because of the extensive changes in the code, the review process does take some time.

Comment by Aleksey Midenkov [ 2023-01-26 ]

monty I already closed the bugs. But this ticket waits you. The branch is bb-11.0-midenok-MDEV-25292 (27 top commits)

Comment by Aleksey Midenkov [ 2023-09-05 ]

Can we schedule that please or assign to another person? The task waits more than half year.

Generated at Thu Feb 08 09:36:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.