[MDEV-25292] Atomic CREATE OR REPLACE TABLE - Jira

Details

Type: New Feature
Status: In Testing (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: 12.0
Component/s: Data Definition - Create Table
Labels:
- Preview_10.10
- Preview_12.0

Description

Atomic drop table (~~MDEV-23844~~) does not recover the replaced table if CREATE OR REPLACE TABLE fails.

This is because current, as before, CREATE OR REPLACE TABLE foo ... is implemented as:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo ..

Because of that failed CREATE OR REPLACE TABLE or a crash during the create will lead to lost table, if the original table existed.

This task equips table drop with backing up the old table until the new table was successfully created.

In other words, this task ensures that if one executes CREATE OR REPLACE TABLE foo on an existing table, the table foo will either be the new or old table after the statement has executed, including in the cases the CREATE fails or the server crashes.

NOTE: this change was in a preview release for MariaDB Community Server 10.10.0, but could not be added to 10.10 RC

Attachments

Issue Links

blocks

MDEV-4259 transactional DDL

Open

MDEV-19191 Partitioning in child tables with foreign keys

In Review

MDEV-27180 Fully atomic partitioning DDL operations

In Review

MDEV-31138 Enable the test spider/bugfix.mdev_29676 once MDEV-25292 is pushed

Open

MDEV-36069 Remove dict_table_t::n_ref_count; implement event based table eviction

Open

causes

MDEV-28933 CREATE OR REPLACE fails to recreate same constraint name

Closed

MDEV-28952 Assertion `t->s->tmp_table || thd->mdl_context.is_lock_owner(MDL_key::TABLE, t->s->db.str, t->s->table_name.str, t->reginfo.lock_type >= TL_WRITE_ALLOW_WRITE ? MDL_SHARED_WRITE : MDL_SHARED_READ)' failed

Closed

MDEV-28956 Locking is broken if CREATE OR REPLACE fails under LOCK TABLES

Closed

MDEV-29544 SIGSEGV in HA_CREATE_INFO::finalize_locked_tables

Closed

MDEV-29620 Assertion `next_insert_id == 0' failed in handler::ha_external_lock

Closed

MDEV-29628 Memory leak after CREATE OR REPLACE with foreign key

Closed

MDEV-29664 Assertion `!n_mysql_tables_in_use' failed in innobase_close_connection

Closed

MDEV-29676 Dual thread hang in 'closing tables' and 'Waiting for table metadata lock' on Spider CREATE OR REPLACE TABLE

Closed

MDEV-29680 ASAN heap-use-after-free in myrocks::Rdb_transaction::on_commit upon CREATE OR REPLACE .. SELECT

Open

MDEV-29698 Server crash or assertion failure upon CREATE OR REPLACE with a MERGE table

Closed

MDEV-29699 Assertion `!table_list->table || frm.str || !atomic_replace' failed upon CREATE OR REPLACE with ARCHIVE

Closed

MDEV-29701 Inconsistencies and eventual failure upon CREATE OR REPLACE with foreign keys

Closed

MDEV-29709 Unexpected ER_NO_SUCH_TABLE on CREATE OR REPLACE with further assertion failures

Closed

MDEV-29758 CREATE OR REPLACE .. SELECT of Federated table doesn't store selected values

Closed

MDEV-29767 CREATE OR REPLACE bypasses S3 restriction for table creation

Closed

MDEV-29770 Broken table cannot be CREATE OR REPLACE -ed anymore

Closed

MDEV-29772 CREATE OR REPLACE not atomic for sequences (a.k.a tables with SEQUENCE=1)

Closed

MDEV-29779 Unexpected ER_ERROR_ON_RENAME upon CREATE OR REPLACE

Closed

MDEV-29781 Server crashes in ha_myisammrg::add_children_list upon CREATE OR REPLACE with MERGE

Closed

MDEV-29783 ER_NO_SUCH_TABLE_IN_ENGINE after failed CREATE OR REPLACE with FK violation

Closed

MDEV-29787 CREATE OR REPLACE does not work with custom DATA/INDEX DIRECTORY

Closed

MDEV-29791 ER_NO_SUCH_TABLE_IN_ENGINE or "File not found" upon concurrent DDL with CREATE OR REPLACE

Closed

MDEV-29793 Assertion failure in translog_write_record upon CREATE OR REPLACE

Closed

MDEV-29800 ERROR 1062 (23000): Duplicate entry 'test-t-0' for key 'PRIMARY' on CoR for Spider table, DDL_LOG: Got error 1032 or 12524 when trying to execute action for entry 1/2/3/4/5/6/7 of type 'rename table'

Closed

MDEV-29802 #sql-backup tables are visible upon CREATE OR REPLACE and get stuck in S3 storage

Closed

MDEV-29824 Galera crashes when running CoR for a locked table

Closed

MDEV-29831 Galera crashes when running CoR for a locked table after setting the minimum memory for a user session

Closed

MDEV-36435 Assertion failure in Table_specification_st::finalize_locked_tables upon attempt to create an already existing table under lock

Open

MDEV-36439 Assertion `thd_arg->lex->sql_command != SQLCOM_CREATE_SEQUENCE || (flags2 & FL_DDL) || thd_arg->in_mult i_stmt_transaction_mode()' failed

Open

is blocked by

MDEV-28753 Review InnoDB changes for MDEV-25292

Closed

MDEV-28933 CREATE OR REPLACE fails to recreate same constraint name

Closed

MDEV-29824 Galera crashes when running CoR for a locked table

Closed

MDEV-29831 Galera crashes when running CoR for a locked table after setting the minimum memory for a user session

Closed

MDEV-35854 Clean up some DDL code around FOREIGN KEY constraints

Closed

is part of

MDEV-25921 Implement CREATE TABLE...SELECT in one transaction

Open

relates to

MDEV-33432 Erroneous CREATE OR REPLACE can cause replication abort

Open

MDEV-35895 purge_sys_t::close_and_reopen() can busy loop for a long time

Open

MDEV-36099 Ensure that creation and usage of temporary tables in replication is predictable

In Testing

MDEV-24576 Atomic CREATE TABLE

Closed

MDEV-25595 DROP part of failed CREATE OR REPLACE is not written into binary log

Closed

(29 causes, 5 is blocked by, 1 is part of, 5 relates to)

Sub-Tasks

Review InnoDB changes for MDEV-25292

Closed

Marko Mäkelä

Activity

Ascending order - Click to sort in descending order

Aleksey Midenkov added a comment - 2021-11-29 23:01 - edited

Please review bb-10.7-midenok-MDEV-25292

Aleksey Midenkov added a comment - 2021-11-29 23:01 - edited Please review bb-10.7-midenok-MDEV-25292

Aleksey Midenkov added a comment - 2022-03-16 18:12 - edited

marko Please review InnoDB part of bb-10.7-midenok-MDEV-25292 in main commit MDEV-25292 Atomic CREATE OR REPLACE TABLE

Aleksey Midenkov added a comment - 2022-03-16 18:12 - edited marko Please review InnoDB part of bb-10.7-midenok-MDEV-25292 in main commit MDEV-25292 Atomic CREATE OR REPLACE TABLE

Aleksey Midenkov added a comment - 2022-03-21 15:32 - edited

elenst Please test bb-10.7-MDEV-25292-atomic-replace

Aleksey Midenkov added a comment - 2022-03-21 15:32 - edited elenst Please test bb-10.7- MDEV-25292 -atomic-replace

Marko Mäkelä added a comment - 2022-04-29 05:55

Sorry, I had missed the request to review the InnoDB changes. I found a branch from March 22 that is 12 commits ahead and 53 commits behind 10.7. I think that this needs to be based on the latest development branch.

I would appreciate a more detailed description of the changes, and the reasons for changing lower-level InnoDB code. I would suggest you to check the TRUNCATE TABLE implementation once more and consider if a smaller change is possible. If the motivation of the more extensive changes is to support the addition of FOREIGN KEY constraints with something like the following:

CREATE TABLE parent(a INT PRIMARY KEY) ENGINE=InnoDB;

CREATE TABLE child(a INT PRIMARY KEY) ENGINE=InnoDB;

CREATE OR REPLACE TABLE child (a INT PRIMARY KEY REFERENCES parent) ENGINE=InnoDB;

then it should be mentioned in the commit message and there should be such test cases in the same commit.

Marko Mäkelä added a comment - 2022-04-29 05:55 Sorry, I had missed the request to review the InnoDB changes. I found a branch from March 22 that is 12 commits ahead and 53 commits behind 10.7 . I think that this needs to be based on the latest development branch. I would appreciate a more detailed description of the changes, and the reasons for changing lower-level InnoDB code. I would suggest you to check the TRUNCATE TABLE implementation once more and consider if a smaller change is possible. If the motivation of the more extensive changes is to support the addition of FOREIGN KEY constraints with something like the following: CREATE TABLE parent(a INT PRIMARY KEY ) ENGINE=InnoDB; CREATE TABLE child(a INT PRIMARY KEY ) ENGINE=InnoDB; CREATE OR REPLACE TABLE child (a INT PRIMARY KEY REFERENCES parent) ENGINE=InnoDB; then it should be mentioned in the commit message and there should be such test cases in the same commit.

Aleksey Midenkov added a comment - 2022-05-30 19:34 - edited

marko "InnoDB changes" article has been added.

Any of the branches can be reviewed:
bb-10.7-midenok-MDEV-25292
preview-10.10-MDEV-25292-atomic-cor

Aleksey Midenkov added a comment - 2022-05-30 19:34 - edited marko "InnoDB changes" article has been added. Any of the branches can be reviewed: bb-10.7-midenok- MDEV-25292 preview-10.10- MDEV-25292 -atomic-cor

Aleksey Midenkov added a comment - 2022-06-04 16:02

lstartseva Please test preview-10.10-MDEV-25292-atomic-cor

Aleksey Midenkov added a comment - 2022-06-04 16:02 lstartseva Please test preview-10.10- MDEV-25292 -atomic-cor

Marko Mäkelä added a comment - 2022-06-06 09:34

midenok, I see that you filed a separate ticket ~~MDEV-28753~~ and assigned to me, to review the InnoDB changes. I do not think that it is helpful to split any code review to multiple tickets. I have added a "blocking" relation, so that this ticket cannot be completed without that review being completed.

Because no commit hashes were mentioned in earlier comments and the named branches have been overwritten, I cannot tell if any attempt to simplify the InnoDB changes as I requested was made.

Did someone review the rest of the changes before this entered testing?

Marko Mäkelä added a comment - 2022-06-06 09:34 midenok , I see that you filed a separate ticket MDEV-28753 and assigned to me, to review the InnoDB changes. I do not think that it is helpful to split any code review to multiple tickets. I have added a "blocking" relation, so that this ticket cannot be completed without that review being completed. Because no commit hashes were mentioned in earlier comments and the named branches have been overwritten, I cannot tell if any attempt to simplify the InnoDB changes as I requested was made. Did someone review the rest of the changes before this entered testing?

Aleksey Midenkov added a comment - 2022-06-09 18:10

marko Since I made it originally I do not see how it can be more simple, the changes are pretty simple. If you know how to improve them, please do the concrete suggestion. That's the point of review: when the other people see what can't be seen by the author. Other code was reviewed and approved by Michael Widenius.

The new subtask is helpful because more than one people worked on the task simultaneously. To notify you I made a subtask. Blocking relation is not very helpful, the subtask just notifies you to do the review. I will not push that until your approval anyway.

Aleksey Midenkov added a comment - 2022-06-09 18:10 marko Since I made it originally I do not see how it can be more simple, the changes are pretty simple. If you know how to improve them, please do the concrete suggestion. That's the point of review: when the other people see what can't be seen by the author. Other code was reviewed and approved by Michael Widenius. The new subtask is helpful because more than one people worked on the task simultaneously. To notify you I made a subtask. Blocking relation is not very helpful, the subtask just notifies you to do the review. I will not push that until your approval anyway.

Marko Mäkelä added a comment - 2022-06-13 14:00

I think that CREATE OR REPLACE TABLE must be implemented at as high level as possible, similar to how ha_innobase::truncate() currently implements TRUNCATE TABLE as a combination of RENAME, CREATE, DROP. The pseudocode should be something like this:

Start a dictionary transaction.
Try to rename a table from the specified name to a temporary name. If the table does not exist, fine.
If the table existed, delete all its FOREIGN KEY constraints from SYS_FOREIGN and SYS_FOREIGN_COLS (in the same transaction).
CREATE the table (in the same transaction).
If the table existed on rename, drop the original table (in the same transaction).
Commit the dictionary transaction.
On any error, roll back the dictionary transaction. Try to avoid modifying the InnoDB dict_sys cache before the transaction was committed.

Similar to TRUNCATE TABLE, FOREIGN KEY constraints (which are maintained inside InnoDB until MDEV-16417 has been implemented) will require some additional handling. During TRUNCATE, FOREIGN KEY constraints must simply be preserved; during the RENAME part of CREATE OR REPLACE TABLE, the constraints may have to be renamed or deleted specially in order to avoid duplicated values of databasename/constraintname. Implicit names of foreign key constraints have names like databasename/tablename_ibfk_1. I think that it would be simplest to delete all constraints that were part of the table. If the dictionary transaction is rolled back, the deletion of those records will be rolled back as well. There should be no need to rename any FOREIGN KEY constraints.

I would like to see a clear description how FOREIGN KEY constraints are supposed to be handled. One commit message mentions "constraints" but does not specifically say FOREIGN KEY. For example, CHECK constraints are handled outside InnoDB. It also says something about "backup" and "tmp table", which might refer to renaming a persistent table to a generated name.

Furthermore, I would like to see a high level description how CREATE OR REPLACE TABLE…SELECT is intended to work. Currently, CREATE TABLE…SELECT is executing two transactions. First, ha_innobase::create() will create and empty table, and then in a separate transaction, something like INSERT…SELECT will be executed. On any error (deadlock, lock wait timeout, duplicate key error), the table will be dropped. What is expected to happen in the scope of this work? Should an existing table be dropped, or should it be preserved if populating the replacement table fails?

Marko Mäkelä added a comment - 2022-06-13 14:00 I think that CREATE OR REPLACE TABLE must be implemented at as high level as possible, similar to how ha_innobase::truncate() currently implements TRUNCATE TABLE as a combination of RENAME , CREATE , DROP . The pseudocode should be something like this: Start a dictionary transaction. Try to rename a table from the specified name to a temporary name. If the table does not exist, fine. If the table existed, delete all its FOREIGN KEY constraints from SYS_FOREIGN and SYS_FOREIGN_COLS (in the same transaction). CREATE the table (in the same transaction). If the table existed on rename, drop the original table (in the same transaction). Commit the dictionary transaction. On any error, roll back the dictionary transaction. Try to avoid modifying the InnoDB dict_sys cache before the transaction was committed. Similar to TRUNCATE TABLE , FOREIGN KEY constraints (which are maintained inside InnoDB until MDEV-16417 has been implemented) will require some additional handling. During TRUNCATE , FOREIGN KEY constraints must simply be preserved; during the RENAME part of CREATE OR REPLACE TABLE , the constraints may have to be renamed or deleted specially in order to avoid duplicated values of databasename/constraintname . Implicit names of foreign key constraints have names like databasename/tablename_ibfk_1 . I think that it would be simplest to delete all constraints that were part of the table. If the dictionary transaction is rolled back, the deletion of those records will be rolled back as well. There should be no need to rename any FOREIGN KEY constraints. I would like to see a clear description how FOREIGN KEY constraints are supposed to be handled. One commit message mentions "constraints" but does not specifically say FOREIGN KEY . For example, CHECK constraints are handled outside InnoDB. It also says something about "backup" and "tmp table", which might refer to renaming a persistent table to a generated name. Furthermore, I would like to see a high level description how CREATE OR REPLACE TABLE…SELECT is intended to work. Currently, CREATE TABLE…SELECT is executing two transactions. First, ha_innobase::create() will create and empty table, and then in a separate transaction, something like INSERT…SELECT will be executed. On any error (deadlock, lock wait timeout, duplicate key error), the table will be dropped. What is expected to happen in the scope of this work? Should an existing table be dropped, or should it be preserved if populating the replacement table fails?

Michael Widenius added a comment - 2022-06-13 16:05

First to answer Marko:

Everything is done on a high level except a small change to InnoDB to inform it that we are creating
a temporary table and a change in S3 to inform the top level that renames are very expensive.
The code guarantees that either you get a new table with all expected data or the old table will be restored.

The commit message explains how the code works:

Atomic replace algorithm

Two DDL chains are used for CREATE OR REPLACE:
ddl_log_state_create (C) and ddl_log_state_rm (D).

1. (C) Log CREATE_TABLE_ACTION of TMP table (drops TMP table);
2. Create new table as TMP;
3. Do everything with TMP (like insert data);

finalize_atomic_replace():
4. Link chains: (D) is executed only if (C) is closed;
5. (D) Log DROP_ACTION of BACKUP;
6. (C) Log RENAME_TABLE_ACTION from ORIG to BACKUP (replays BACKUP -> ORIG);
7. Rename ORIG to BACKUP;
8. (C) Log CREATE_TABLE_ACTION of ORIG (drops ORIG);
9. Rename TMP to ORIG;

finalize_ddl() in case of success:
10. Close (C);
11. Replay (D): BACKUP is dropped.

finalize_ddl() in case of error:
10. Close (D);
11. Replay (C):
1) ORIG is dropped (only after finalize_atomic_replace());
2) BACKUP renamed to ORIG (only after finalize_atomic_replace());
3) drop TMP.

Temporary table for CREATE OR REPLACE

Before dropping "old" table, CREATE OR REPLACE creates "tmp" table.
ddl_log_state_create holds the drop of the "tmp" table. When
everything is OK (data is inserted, "tmp" is ready) ddl_log_state_rm
is written to replace "old" with "tmp". Until ddl_log_state_create
is closed ddl_log_state_rm is not executed.

After the binlogging is done ddl_log_state_create is closed. At that
point ddl_log_state_rm is executed and "tmp" is replaced with
"old". That is: final rename is done by the DDL log.

With that important role of DDL log for CREATE OR REPLACE operation
replay of ddl_log_state_rm must fail at the first hit error and
print the error message if possible. F.ex. foreign key error is
discovered at this phase: InnoDB rejects to drop the "old" table and
returns corresponding foreign key error code.

Additional notes

CREATE TABLE without REPLACE is not affected by this commit.

Engines having HTON_EXPENSIVE_RENAME flag set are not affected by
this commit.

CREATE TABLE .. SELECT XID usage is fixed and now there is no need
to log DROP TABLE via DDL_CREATE_TABLE_PHASE_LOG (see comments in
do_postlock()). XID is now correctly updated so it disables
DDL_LOG_DROP_TABLE_ACTION. Note that binary log is flushed at the
final stage when the table is ready. So if we have XID in the
binary log we don't need to drop the table.

Three variations of CREATE OR REPLACE handled:

1. CREATE OR REPLACE TABLE t1 (..);
2. CREATE OR REPLACE TABLE t1 LIKE t2;
3. CREATE OR REPLACE TABLE t1 SELECT ..;

Test case uses 5 combinations for engines (aria, aria_notrans,
myisam, ib, expensive_rename) and 2 combinations for binlog types
(row, stmt). Combinations help to check differences between the
results. Error failures are tested for the above three variations.

Triggers mechanism is unaffected by this change. This is tested in
create_replace.test.

LOCK TABLES is affected. Lock restoration must be done after "rm"
chain is replayed.

Moved ddl_log_complete() from send_eof() to finalize_ddl(). This
checkpoint was not executed before for normal CREATE TABLE but is
executed now.

This means that this checkpoint was not executed before for normal
CREATE TABLE but is excuted now.

Rename and drop via DDL log

We replay ddl_log_state_rm to drop the old table and rename the
temporary table. In that case we must throw the correct error
message if ddl_log_revert() fails (f.ex. on FK error).

If table is deleted earlier and not via DDL log and the crash
happened, the create chain is not closed. Linked drop chain is not
executed and the new table is not installed. But the old table is
already deleted.

ddl_log.cc changes

Now we can place action before DDL_LOG_DROP_INIT_ACTION and it will
be replayed after DDL_LOG_DROP_TABLE_ACTION.

report_error parameter for ddl_log_revert() allows to fail at first
error and print the error message if possible.
ddl_log_execute_action() now can print error message.

Since we now can handle errors from ddl_log_execute_action() (in
case of non-recovery execution) unconditional setting "error= TRUE"
is wrong (it was wrong anyway because it was overwritten at the end
of the function).

On XID usage

Like with all other atomic DDL operations XID is used to avoid
inconsistency between master and slave in the case of a crash after
binary log is written and before ddl_log_state_create is closed. On
recovery XIDs are taken from binary log and corresponding DDL log
events get disabled. That is done by
ddl_log_close_binlogged_events().

On linking two chains together

Chains are executed in the ascending order of entry_pos of execute
entries. But entry_pos assignment order is undefined: it may assign
bigger number for the first chain and then smaller number for the
second chain. So the execution order in that case will be reverse:
second chain will be executed first.

To avoid that we link one chain to another. While the base chain
(ddl_log_state_create) is active the secondary chain
(ddl_log_state_rm) is not executed. That is: only one chain can be
executed in two linked chains.

The interface ddl_log_link_chains() was done in "~~MDEV-22166~~
ddl_log_write_execute_entry() extension".

More on CREATE OR REPLACE .. SELECT

We use create_and_open_tmp_table() like in ALTER TABLE to create
temporary TABLE object (tmp_table is (NON_)TRANSACTIONAL_TMP_TABLE).

After we created such TABLE object we use create_info->tmp_table()
instead of table->s->tmp_table when we need to check for
parser-requested tmp-table.

External locking is required for temporary table created by
create_and_open_tmp_table(). F.ex. that disables logging for Aria
transactional tables and without that (when no mysql_lock_tables()
is done) it cannot work correctly.

For making external lock the patch requires Aria table to work in
non-transactional mode. That is usually done by
ha_enable_transaction(false). But we cannot disable transaction
completely because: 1. binlog rollback removes pending row events
(binlog_remove_pending_rows_event()). The row events are added
during CREATE .. SELECT data insertion phase. 2. replication slave
highly depends on transaction and cannot work without it.

So we put temporary Aria table into non-transactional mode with
"thd->transaction->on hack". See comment for on_save variable.

Note that Aria table has internal_table mode. But we cannot use it
because:

if (!internal_table)

{ mysql_mutex_lock(&THR_LOCK_myisam); old_info= test_if_reopen(name_buff); }

For internal_table test_if_reopen() is not called and we get a new
MARIA_SHARE for each file handler. In that case duplicate errors are
missed because insert and lookup in CREATE .. SELECT is done via two
different handlers (see create_lookup_handler()).

For temporary table before dropping TABLE_SHARE by
drop_temporary_table() we must do ha_reset(). ha_reset() releases
storage share. Without that the share is kept and the second CREATE
OR REPLACE .. SELECT fails with:

HA_ERR_TABLE_EXIST (156): MyISAM table '#sql-create-b5377-4-t2' is
in use (most likely by a MERGE table). Try FLUSH TABLES.

HA_EXTRA_PREPARE_FOR_DROP also removes MYISAM_SHARE, but that is
not needed as ha_reset() does the job.

ha_reset() is usually done by
mark_tmp_table_as_free_for_reuse(). But we don't need that mechanism
for our temporary table.

Atomic_info in HA_CREATE_INFO

Many functions in CREATE TABLE pass the same parameters. These
parameters are part of table creation info and should be in
HA_CREATE_INFO (or whatever). Passing parameters via single
structure is much easier for adding new data and
refactoring.

Michael Widenius added a comment - 2022-06-13 16:05 First to answer Marko: Everything is done on a high level except a small change to InnoDB to inform it that we are creating a temporary table and a change in S3 to inform the top level that renames are very expensive. The code guarantees that either you get a new table with all expected data or the old table will be restored. The commit message explains how the code works: Atomic replace algorithm Two DDL chains are used for CREATE OR REPLACE: ddl_log_state_create (C) and ddl_log_state_rm (D). 1. (C) Log CREATE_TABLE_ACTION of TMP table (drops TMP table); 2. Create new table as TMP; 3. Do everything with TMP (like insert data); finalize_atomic_replace(): 4. Link chains: (D) is executed only if (C) is closed; 5. (D) Log DROP_ACTION of BACKUP; 6. (C) Log RENAME_TABLE_ACTION from ORIG to BACKUP (replays BACKUP -> ORIG); 7. Rename ORIG to BACKUP; 8. (C) Log CREATE_TABLE_ACTION of ORIG (drops ORIG); 9. Rename TMP to ORIG; finalize_ddl() in case of success: 10. Close (C); 11. Replay (D): BACKUP is dropped. finalize_ddl() in case of error: 10. Close (D); 11. Replay (C): 1) ORIG is dropped (only after finalize_atomic_replace()); 2) BACKUP renamed to ORIG (only after finalize_atomic_replace()); 3) drop TMP. Temporary table for CREATE OR REPLACE Before dropping "old" table, CREATE OR REPLACE creates "tmp" table. ddl_log_state_create holds the drop of the "tmp" table. When everything is OK (data is inserted, "tmp" is ready) ddl_log_state_rm is written to replace "old" with "tmp". Until ddl_log_state_create is closed ddl_log_state_rm is not executed. After the binlogging is done ddl_log_state_create is closed. At that point ddl_log_state_rm is executed and "tmp" is replaced with "old". That is: final rename is done by the DDL log. With that important role of DDL log for CREATE OR REPLACE operation replay of ddl_log_state_rm must fail at the first hit error and print the error message if possible. F.ex. foreign key error is discovered at this phase: InnoDB rejects to drop the "old" table and returns corresponding foreign key error code. Additional notes CREATE TABLE without REPLACE is not affected by this commit. Engines having HTON_EXPENSIVE_RENAME flag set are not affected by this commit. CREATE TABLE .. SELECT XID usage is fixed and now there is no need to log DROP TABLE via DDL_CREATE_TABLE_PHASE_LOG (see comments in do_postlock()). XID is now correctly updated so it disables DDL_LOG_DROP_TABLE_ACTION. Note that binary log is flushed at the final stage when the table is ready. So if we have XID in the binary log we don't need to drop the table. Three variations of CREATE OR REPLACE handled: 1. CREATE OR REPLACE TABLE t1 (..); 2. CREATE OR REPLACE TABLE t1 LIKE t2; 3. CREATE OR REPLACE TABLE t1 SELECT ..; Test case uses 5 combinations for engines (aria, aria_notrans, myisam, ib, expensive_rename) and 2 combinations for binlog types (row, stmt). Combinations help to check differences between the results. Error failures are tested for the above three variations. Triggers mechanism is unaffected by this change. This is tested in create_replace.test. LOCK TABLES is affected. Lock restoration must be done after "rm" chain is replayed. Moved ddl_log_complete() from send_eof() to finalize_ddl(). This checkpoint was not executed before for normal CREATE TABLE but is executed now. This means that this checkpoint was not executed before for normal CREATE TABLE but is excuted now. Rename and drop via DDL log We replay ddl_log_state_rm to drop the old table and rename the temporary table. In that case we must throw the correct error message if ddl_log_revert() fails (f.ex. on FK error). If table is deleted earlier and not via DDL log and the crash happened, the create chain is not closed. Linked drop chain is not executed and the new table is not installed. But the old table is already deleted. ddl_log.cc changes Now we can place action before DDL_LOG_DROP_INIT_ACTION and it will be replayed after DDL_LOG_DROP_TABLE_ACTION. report_error parameter for ddl_log_revert() allows to fail at first error and print the error message if possible. ddl_log_execute_action() now can print error message. Since we now can handle errors from ddl_log_execute_action() (in case of non-recovery execution) unconditional setting "error= TRUE" is wrong (it was wrong anyway because it was overwritten at the end of the function). On XID usage Like with all other atomic DDL operations XID is used to avoid inconsistency between master and slave in the case of a crash after binary log is written and before ddl_log_state_create is closed. On recovery XIDs are taken from binary log and corresponding DDL log events get disabled. That is done by ddl_log_close_binlogged_events(). On linking two chains together Chains are executed in the ascending order of entry_pos of execute entries. But entry_pos assignment order is undefined: it may assign bigger number for the first chain and then smaller number for the second chain. So the execution order in that case will be reverse: second chain will be executed first. To avoid that we link one chain to another. While the base chain (ddl_log_state_create) is active the secondary chain (ddl_log_state_rm) is not executed. That is: only one chain can be executed in two linked chains. The interface ddl_log_link_chains() was done in " MDEV-22166 ddl_log_write_execute_entry() extension". More on CREATE OR REPLACE .. SELECT We use create_and_open_tmp_table() like in ALTER TABLE to create temporary TABLE object (tmp_table is (NON_)TRANSACTIONAL_TMP_TABLE). After we created such TABLE object we use create_info->tmp_table() instead of table->s->tmp_table when we need to check for parser-requested tmp-table. External locking is required for temporary table created by create_and_open_tmp_table(). F.ex. that disables logging for Aria transactional tables and without that (when no mysql_lock_tables() is done) it cannot work correctly. For making external lock the patch requires Aria table to work in non-transactional mode. That is usually done by ha_enable_transaction(false). But we cannot disable transaction completely because: 1. binlog rollback removes pending row events (binlog_remove_pending_rows_event()). The row events are added during CREATE .. SELECT data insertion phase. 2. replication slave highly depends on transaction and cannot work without it. So we put temporary Aria table into non-transactional mode with "thd->transaction->on hack". See comment for on_save variable. Note that Aria table has internal_table mode. But we cannot use it because: if (!internal_table) { mysql_mutex_lock(&THR_LOCK_myisam); old_info= test_if_reopen(name_buff); } For internal_table test_if_reopen() is not called and we get a new MARIA_SHARE for each file handler. In that case duplicate errors are missed because insert and lookup in CREATE .. SELECT is done via two different handlers (see create_lookup_handler()). For temporary table before dropping TABLE_SHARE by drop_temporary_table() we must do ha_reset(). ha_reset() releases storage share. Without that the share is kept and the second CREATE OR REPLACE .. SELECT fails with: HA_ERR_TABLE_EXIST (156): MyISAM table '#sql-create-b5377-4-t2' is in use (most likely by a MERGE table). Try FLUSH TABLES. HA_EXTRA_PREPARE_FOR_DROP also removes MYISAM_SHARE, but that is not needed as ha_reset() does the job. ha_reset() is usually done by mark_tmp_table_as_free_for_reuse(). But we don't need that mechanism for our temporary table. Atomic_info in HA_CREATE_INFO Many functions in CREATE TABLE pass the same parameters. These parameters are part of table creation info and should be in HA_CREATE_INFO (or whatever). Passing parameters via single structure is much easier for adding new data and refactoring.

Aleksey Midenkov added a comment - 2022-06-13 19:45

marko the atomic DDL is engine independent and therefore transactions independent. It has to handle some engine peculiarities though, one of this is the requirement to preserve original foreign keys and restore them back.

Aleksey Midenkov added a comment - 2022-06-13 19:45 marko the atomic DDL is engine independent and therefore transactions independent. It has to handle some engine peculiarities though, one of this is the requirement to preserve original foreign keys and restore them back.

Marko Mäkelä added a comment - 2022-06-17 09:50

Why exactly was the use of InnoDB persistent statistics disabled in several tests? The commit message is very vague. How is that related to this work at all? I think that it should be filed and fixed separately. Failures of those tests seem to be very seldom, and I did not see any test failures even after reverting that change.

Persistent statistics have been enabled by default since MySQL 5.6.6 or MariaDB Server 10.0. Ever since ~~MDEV-25919~~ was implemented, no problems should be anticipated.

More alarming is that in the test main.create_or_replace, the use of InnoDB persistent statistics is disabled. That seems to be totally unnecessary; only the metadata locks on the InnoDB statistics tables need to be filtered out from the expected result. I did not observe any failures after not disabling persistent statistics.

I will post some suggested InnoDB code changes later.

Marko Mäkelä added a comment - 2022-06-17 09:50 Why exactly was the use of InnoDB persistent statistics disabled in several tests ? The commit message is very vague. How is that related to this work at all? I think that it should be filed and fixed separately. Failures of those tests seem to be very seldom, and I did not see any test failures even after reverting that change. Persistent statistics have been enabled by default since MySQL 5.6.6 or MariaDB Server 10.0. Ever since MDEV-25919 was implemented, no problems should be anticipated. More alarming is that in the test main.create_or_replace , the use of InnoDB persistent statistics is disabled. That seems to be totally unnecessary; only the metadata locks on the InnoDB statistics tables need to be filtered out from the expected result. I did not observe any failures after not disabling persistent statistics . I will post some suggested InnoDB code changes later.

Marko Mäkelä added a comment - 2022-06-17 15:15

I refactored dict_get_referenced_table() and its callers. I think that before this refactoring, there was a bug that CREATE OR REPLACE TABLE db.x with REFERENCES some_other_db.x was mishandled.

As far as I understand, the rename-based approach is needed so that CREATE OR REPLACE TABLE…SELECT can restore the original table in case the INSERT…SELECT part of the statement fails. I think that the following scenarios must be tested (or existing mtr tests expanded):

Crash recovery where CREATE OR REPLACE TABLE is replacing a table, such that the same named constraint (say, CONSTRAINT a FOREIGN KEY b(c) REFERENCES parent(c)) is specified for both the old and the replacing table
Rollback of CREATE OR REPLACE TABLE…SELECT when named FOREIGN KEY constraints exist
Crash recovery and rollback where another table is referencing the table via a named constraint
Schema and table names starting with #mysql50#, such that FOREIGN KEY constraints exist between tables
CREATE OR REPLACE TABLE that adds or removes a FOREIGN KEY constraint

I am requesting these, because I suspect that duplicate key errors for constraints like dbname/fk_constraint_name would be issued. I expect anonymous constraints to be fine; a unique internal name like dbname/tablename_ibfk_1 should be created when renaming the table.

I plan to suggest some changes to row_rename_table_for_mysql() as well. I hope to preserve the assertion ut_ad(err != DB_DUPLICATE_KEY); in some form.

Marko Mäkelä added a comment - 2022-06-17 15:15 I refactored dict_get_referenced_table() and its callers . I think that before this refactoring, there was a bug that CREATE OR REPLACE TABLE db.x with REFERENCES some_other_db.x was mishandled. As far as I understand, the rename-based approach is needed so that CREATE OR REPLACE TABLE…SELECT can restore the original table in case the INSERT…SELECT part of the statement fails. I think that the following scenarios must be tested (or existing mtr tests expanded): Crash recovery where CREATE OR REPLACE TABLE is replacing a table, such that the same named constraint (say, CONSTRAINT a FOREIGN KEY b(c) REFERENCES parent(c) ) is specified for both the old and the replacing table Rollback of CREATE OR REPLACE TABLE…SELECT when named FOREIGN KEY constraints exist Crash recovery and rollback where another table is referencing the table via a named constraint Schema and table names starting with #mysql50# , such that FOREIGN KEY constraints exist between tables CREATE OR REPLACE TABLE that adds or removes a FOREIGN KEY constraint I am requesting these, because I suspect that duplicate key errors for constraints like dbname/fk_constraint_name would be issued. I expect anonymous constraints to be fine; a unique internal name like dbname/tablename_ibfk_1 should be created when renaming the table. I plan to suggest some changes to row_rename_table_for_mysql() as well. I hope to preserve the assertion ut_ad(err != DB_DUPLICATE_KEY); in some form.

Sergei Golubchik added a comment - 2022-06-18 19:08 - edited

In the branch preview-10.10-ddl
And in bb-10.10-MDEV-25292

Sergei Golubchik added a comment - 2022-06-18 19:08 - edited In the branch preview-10.10-ddl And in bb-10.10-MDEV-25292

Marko Mäkelä added a comment - 2022-06-21 16:41

I revised the RENAME changes in InnoDB and restored a removed debug assertion. Now this should be ready for testing.

Marko Mäkelä added a comment - 2022-06-21 16:41 I revised the RENAME changes in InnoDB and restored a removed debug assertion. Now this should be ready for testing.

Aleksey Midenkov added a comment - 2022-06-21 20:58

marko You can check cross-reference report: multi_source.multi_parallel fails at least in 10.5 - 10.9.

Aleksey Midenkov added a comment - 2022-06-21 20:58 marko You can check cross-reference report : multi_source.multi_parallel fails at least in 10.5 - 10.9.

Aleksey Midenkov added a comment - 2022-06-21 21:04

More alarming is that in the test main.create_or_replace, the use of InnoDB persistent statistics is disabled. That seems to be totally unnecessary; only the metadata locks on the InnoDB statistics tables need to be filtered out from the expected result. I did not observe any failures after not disabling persistent statistics.

I don't like 2-line-long select in create_or_replace.test in such many places. Too much garbage out of business logic. Is it important to have persistent statististics in that test?

Aleksey Midenkov added a comment - 2022-06-21 21:04 More alarming is that in the test main.create_or_replace, the use of InnoDB persistent statistics is disabled. That seems to be totally unnecessary; only the metadata locks on the InnoDB statistics tables need to be filtered out from the expected result. I did not observe any failures after not disabling persistent statistics. I don't like 2-line-long select in create_or_replace.test in such many places. Too much garbage out of business logic. Is it important to have persistent statististics in that test?

Marko Mäkelä added a comment - 2022-06-23 11:58

midenok, yes, it is important to test with default settings. The InnoDB persistent statistics do play a role during DDL, and we must cover it. Work-arounds like ~~MDEV-4750~~ can significantly reduce coverage in our regression tests. Should there be a race condition or other glitch somewhere, sooner or later a regression test would fail on our CI system. If we disable persistent statistics, we will never catch such failures.

I amended the main commit with some clarified wording and my version of the InnoDB changes. I also confirmed that there is a regression for a case that I described 5 days ago:

--source include/have_innodb.inc

CREATE TABLE t(a INT PRIMARY KEY) ENGINE=InnoDB;

CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB;

CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB;

DROP TABLE u, t;

This test used to pass, but now it would fail due to a duplicate key on SYS_FOREIGN.NAME on the non-anonymous constraint name:

mysqltest: At line 4: query 'CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB' failed: ER_CANT_CREATE_TABLE (1005): Can't create table `test`.`u` (errno: 121 "Duplicate key on write or update")

For anonymous constraints there is no problem, because the automatically generated names will be of the form databasename/tablename_ibfk_1 and the different table name will make the constraint name unique. With the above test case, assuming that the name of the current database is test, there will be a duplicate on test/c or test/d. I didn’t check which of the names InnoDB will use internally. If either c or d is specified in the CREATE OR REPLACE statement and the statement is duplicated, the test will fail. It will pass if neither c nor c is specified, that is, the constraint will be anonymous.

This is a regression that needs to be fixed.

Marko Mäkelä added a comment - 2022-06-23 11:58 midenok , yes, it is important to test with default settings. The InnoDB persistent statistics do play a role during DDL, and we must cover it. Work-arounds like MDEV-4750 can significantly reduce coverage in our regression tests. Should there be a race condition or other glitch somewhere, sooner or later a regression test would fail on our CI system. If we disable persistent statistics, we will never catch such failures. I amended the main commit with some clarified wording and my version of the InnoDB changes. I also confirmed that there is a regression for a case that I described 5 days ago : --source include/have_innodb.inc CREATE TABLE t(a INT PRIMARY KEY ) ENGINE=InnoDB; CREATE OR REPLACE TABLE u(a INT PRIMARY KEY , CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB; CREATE OR REPLACE TABLE u(a INT PRIMARY KEY , CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB; DROP TABLE u, t; This test used to pass, but now it would fail due to a duplicate key on SYS_FOREIGN.NAME on the non-anonymous constraint name: mysqltest: At line 4: query 'CREATE OR REPLACE TABLE u(a INT PRIMARY KEY, CONSTRAINT c FOREIGN KEY d (a) REFERENCES t (a)) ENGINE=InnoDB' failed: ER_CANT_CREATE_TABLE (1005): Can't create table `test`.`u` (errno: 121 "Duplicate key on write or update") For anonymous constraints there is no problem, because the automatically generated names will be of the form databasename/tablename_ibfk_1 and the different table name will make the constraint name unique. With the above test case, assuming that the name of the current database is test , there will be a duplicate on test/c or test/d . I didn’t check which of the names InnoDB will use internally. If either c or d is specified in the CREATE OR REPLACE statement and the statement is duplicated, the test will fail. It will pass if neither c nor c is specified, that is, the constraint will be anonymous. This is a regression that needs to be fixed.

Aleksey Midenkov added a comment - 2022-06-27 17:20

marko, no, if you enable some innodb functionality for all tests, you do not require to change all tests. And you do not require to add `where table_schema!='mysql' or table_name not like 'innodb_%_stats'` to every `select * from information_schema.metadata_lock_info`! Please, test everything you need in innodb suite.

Aleksey Midenkov added a comment - 2022-06-27 17:20 marko , no, if you enable some innodb functionality for all tests, you do not require to change all tests. And you do not require to add `where table_schema!='mysql' or table_name not like 'innodb_%_stats'` to every `select * from information_schema.metadata_lock_info`! Please, test everything you need in innodb suite.

Lena Startseva added a comment - 2022-07-28 05:16

Testing is complete, if marko doesn't have any more objections, I guess it can be merged.

Lena Startseva added a comment - 2022-07-28 05:16 Testing is complete, if marko doesn't have any more objections, I guess it can be merged.

Michael Widenius added a comment - 2023-01-25 11:23

I will review this when I am ready with MDEV-27180. Because of the extensive changes in the code, the review process does take some time.

Michael Widenius added a comment - 2023-01-25 11:23 I will review this when I am ready with MDEV-27180 . Because of the extensive changes in the code, the review process does take some time.

Aleksey Midenkov added a comment - 2023-01-26 14:08 - edited

monty I already closed the bugs. But this ticket waits you. The branch is bb-11.0-midenok-MDEV-25292 (27 top commits)

Aleksey Midenkov added a comment - 2023-01-26 14:08 - edited monty I already closed the bugs. But this ticket waits you. The branch is bb-11.0-midenok-MDEV-25292 (27 top commits)

Aleksey Midenkov added a comment - 2023-09-05 04:16

Can we schedule that please or assign to another person? The task waits more than half year.

Aleksey Midenkov added a comment - 2023-09-05 04:16 Can we schedule that please or assign to another person? The task waits more than half year.

Aleksey Midenkov added a comment - 2025-01-17 14:16

Updated branch is bb-main-midenok-MDEV-25292

Aleksey Midenkov added a comment - 2025-01-17 14:16 Updated branch is bb-main-midenok- MDEV-25292

Marko Mäkelä added a comment - 2025-01-22 08:58

I filed MDEV-35895 for an anomaly that was found by monty while he was testing this. I didn’t investigate it deeper yet, but I suspect that this anomaly might be avoided if CREATE OR REPLACE TABLE…SELECT downgraded the exclusive meta-data lock (MDL_EXCLUSIVE) on the table name for the duration of copying the data (the INSERT…SELECT part), similar to how handler::inplace_alter_table() is being executed under a non-exclusive MDL. In that way, the purge of committed transaction history could proceed on a possible old copy of the table while a new copy is being created. InnoDB can only purge history in order. Until or unless the DDL operation has been committed, the purge can’t skip any undo log records for the old copy of the table.

Marko Mäkelä added a comment - 2025-01-22 08:58 I filed MDEV-35895 for an anomaly that was found by monty while he was testing this. I didn’t investigate it deeper yet, but I suspect that this anomaly might be avoided if CREATE OR REPLACE TABLE…SELECT downgraded the exclusive meta-data lock ( MDL_EXCLUSIVE ) on the table name for the duration of copying the data (the INSERT…SELECT part), similar to how handler::inplace_alter_table() is being executed under a non-exclusive MDL. In that way, the purge of committed transaction history could proceed on a possible old copy of the table while a new copy is being created. InnoDB can only purge history in order. Until or unless the DDL operation has been committed, the purge can’t skip any undo log records for the old copy of the table.

Michael Widenius added a comment - 2025-02-25 15:21

bb-11.8-monty pushed for QA testing.

Michael Widenius added a comment - 2025-02-25 15:21 bb-11.8-monty pushed for QA testing.

Michael Widenius added a comment - 2025-03-11 12:44

bb-11.8-monty updated
All known things are fixed, except the mdl locks that InnoDB needs.

Michael Widenius added a comment - 2025-03-11 12:44 bb-11.8-monty updated All known things are fixed, except the mdl locks that InnoDB needs.

Sergei Golubchik added a comment - 2025-03-19 17:58 - edited

See commits
https://github.com/MariaDB/server/commit/6ff731317239
https://github.com/MariaDB/server/commit/5ed14c2fdeff
https://github.com/MariaDB/server/commit/a6b0394cffb4
https://github.com/MariaDB/server/commit/59e3b88fbd8e
https://github.com/MariaDB/server/commit/9c3281fc9679
https://github.com/MariaDB/server/commit/c0fe20dbf24

Please, consider cherry-picking them into bb-11.8-monty

Sergei Golubchik added a comment - 2025-03-19 17:58 - edited See commits https://github.com/MariaDB/server/commit/6ff731317239 https://github.com/MariaDB/server/commit/5ed14c2fdeff https://github.com/MariaDB/server/commit/a6b0394cffb4 https://github.com/MariaDB/server/commit/59e3b88fbd8e https://github.com/MariaDB/server/commit/9c3281fc9679 https://github.com/MariaDB/server/commit/c0fe20dbf24 Please, consider cherry-picking them into bb-11.8-monty

Michael Widenius added a comment - 6 days ago - edited

Merge of the above commits are now done. Each change has been added to it's original commit
pushed to bb-11.8-monty

Michael Widenius added a comment - 6 days ago - edited Merge of the above commits are now done. Each change has been added to it's original commit pushed to bb-11.8-monty

Michael Widenius added a comment - 3 days ago

bb-main-monty is now updated to latest main.

Michael Widenius added a comment - 3 days ago bb-main-monty is now updated to latest main.

People

Assignee:: Elena Stepanova

Reporter:: Aleksey Midenkov

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2021-03-29 22:46

Updated:: Yesterday 18:10

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.