|
Neither in SYS_TABLES nor in fil_system we have any trace of table tt11. In SYS_TABLES, the closest neighbours are table0_innodb_int and tt17. Thus, the InnoDB error message seems appropriate (although it might be nice to suppress it when performing DDL log recovery).
Near the time the server is killed, one DDL operation is in progress:
#11 0x000055e28a7ae39b in ddl_log_sync_file () at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/ddl_log.cc:190
|
#12 0x000055e28a7ae4c4 in ddl_log_sync_no_lock () at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/ddl_log.cc:200
|
#13 0x000055e28a7aeea3 in ddl_log_disable_execute_entry (active_entry=0x7ffcec0161d8) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/ddl_log.cc:408
|
#14 0x000055e28a7b544e in ddl_log_complete (state=0x7ffcec0161d0) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/ddl_log.cc:2832
|
#15 0x000055e28a641447 in select_create::send_eof (this=0x7ffcec0160d0) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_insert.cc:5063
|
#16 0x000055e28a70a5a6 in do_select (join=0x7ffcec016210, procedure=0x0) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_select.cc:20593
|
#17 0x000055e28a6ddc51 in JOIN::exec_inner (this=0x7ffcec016210) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_select.cc:4723
|
#18 0x000055e28a6dccd1 in JOIN::exec (this=0x7ffcec016210) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_select.cc:4501
|
#19 0x000055e28a6de5ef in mysql_select (thd=0x7ffcec000d78, tables=0x7ffcec014fe0, fields=..., conds=0x0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2201171004160,
|
result=0x7ffcec0160d0, unit=0x7ffcec0050e8, select_lex=0x7ffcec014a08) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_select.cc:4979
|
#20 0x000055e28a6cd9ac in handle_select (thd=0x7ffcec000d78, lex=0x7ffcec005020, result=0x7ffcec0160d0, setup_tables_done_option=0) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_select.cc:552
|
#21 0x000055e28a793588 in Sql_cmd_create_table_like::execute (this=0x7ffcec0142c8, thd=0x7ffcec000d78) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_table.cc:11662
|
#22 0x000055e28a68e654 in mysql_execute_command (thd=0x7ffcec000d78) at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_parse.cc:5987
|
#23 0x000055e28a694785 in mysql_parse (thd=0x7ffcec000d78, rawbuf=0x7ffcec0141d0 "CREATE /* QNO 73 CON_ID 24 */ TABLE IF NOT EXISTS tt4 AS SELECT * FROM tt17", length=75, parser_state=0x3e684a1b9510)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty/sql/sql_parse.cc:8019
|
Note: In rr replay, all threads except 1 will vanish abruptly if the traced process receives SIGKILL. It would be nicer if some other signal was sent first, such as SIGABRT.
In the data directory, there indeed is a file tt11.ibd.
During buffer pool recovery, the predecessor of the SYS_TABLES record for tt17 was modified in page 8 of the system tablespace, both for DELETE and INSERT operation. Let us check what last happened to that page not long before the server was killed.
Actually, we are lucky, and the only thread that ‘survived’ the SIGKILL in rr replay is the interesting one:
#12 0x000055e28b1e3d73 in fil_ibd_create (space_id=462, name=..., path=0x7ffcec2cd6f8 "./test/tt11.ibd", flags=21, size=4, mode=FIL_ENCRYPTION_DEFAULT, key_id=1, err=0x3e684a1b5260)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/fil/fil0fil.cc:2338
|
#13 0x000055e28b190cdc in dict_build_table_def_step (thr=0x7ffcec32e608, node=0x7ffcec097138) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/dict/dict0crea.cc:425
|
#14 0x000055e28b192c16 in dict_create_table_step (thr=0x7ffcec32e608) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/dict/dict0crea.cc:1075
|
#15 0x000055e28afb51a5 in que_thr_step (thr=0x7ffcec32e608) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/que/que0que.cc:661
|
#16 0x000055e28afb53af in que_run_threads_low (thr=0x7ffcec32e608) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/que/que0que.cc:709
|
#17 0x000055e28afb54c3 in que_run_threads (thr=0x7ffcec32e608) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/que/que0que.cc:729
|
#18 0x000055e28b0233db in row_create_table_for_mysql (table=0x4e2c6cf706a8, trx=0x705c19ca4648, mode=FIL_ENCRYPTION_DEFAULT, key_id=1)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/row/row0mysql.cc:2365
|
#19 0x000055e28ae8fc90 in create_table_info_t::create_table_def (this=0x3e684a1b5f30) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:10464
|
#20 0x000055e28ae77cfa in create_table_info_t::create_table (this=0x3e684a1b5f30, create_fk=true) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:12284
|
#21 0x000055e28ae9120b in ha_innobase::create (this=0x7ffcec32d4b0, name=0x3e684a1b78e0 "./test/tt11", form=0x3e684a1b64a0, create_info=0x3e684a1b8b50, file_per_table=true, trx=0x705c19ca4648)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:12832
|
If by this point, we had have written undo log for creating the table and redo log for creating the file, then the rollback of the incomplete CREATE TABLE transaction should delete the file.
Alas, it looks like we had not written undo log yet, and further changes will be necessary:
#13 0x000055e28b190cdc in dict_build_table_def_step (thr=0x7ffcec32e608, node=0x7ffcec097138) at /home/mdbe/atomic_ddl/bb-10.6-monty/storage/innobase/dict/dict0crea.cc:425
|
425 table->space = fil_ibd_create(
|
(rr) p thr.graph.trx.undo_no
|
$1 = 0
|
To fix this bug, we must first write a SYS_TABLES record for the table, then write a redo log record for creating (or modifying) the file, and only after we have persistently written the log, we may create the data file. In this way, if the file exists on recovery, the buffer pool recovery is guaranteed to open the file, and the undo log recovery is guaranteed to find an incomplete DDL transaction that needs to be rolled back.
I think that fixing this bug depends on MDEV-24626, which cleans up InnoDB data file creation.
|
|
We should attempt to simplify some logic while fixing this:
- The parameter replace_new_file of dict_table_rename_in_cache() as well as the second parameter of dict_table_t::rename_tablespace() should be removed.
- The debug assertion in os_file_rename_func() must be made strict again, that is, the destination file name must never exist.
I believe that both cases of sloppiness are basically working around this bug. During the recovery of a TRUNCATE TABLE t1 transaction, we could currently have:
- space->chain.start->name="./test/#sql-ib123.ibd"
- table->name="test/t1" (or "test/#sql-ib123")
- the contents of the pre-TRUNCATE table in the file test/#sql-ib123.ibd
- an orphan file test/t1.ibd with a dummy page 0 (observed without any
MDEV-24626 changes) that recovery or fil_system.spaces knows nothing about
|
|
I think that this could be essentially the same as MDEV-18518. I will try to fix this.
|
|
I now think that fixing MDEV-18518 is a prerequisite to fixing this. Basically, before creating a data file, we must write undo log for inserting a SYS_INDEXES record for the clustered index of the table that would be stored in the data file.
|
|
We will durably write the following before creating an .ibd file, to allow the recovery of MDEV-18518 to properly roll back incomplete DDL operations:
- Undo log for writing the SYS_INDEXES clustered index record
- Inserting the SYS_INDEXES clustered index record that identifies the tablespace for the rollback in dict_drop_index_tree()
- A FILE_CREATE record for creating the data file
|
|
Still happens on bb-10.6-monty 387d673edb which supposedly contains the fix.
Same visible symptoms.
Data and rr profile are available.
|
|
elenst, in the rr replay trace, I observe the following. The last InnoDB redo log write before the SIGKILL was for writing a FILE_CREATE record for a CREATE TABLE tt13 that was not committed:
|
bb-10.6-monty 387d673edb5899adb31695f02032b060ed7574f7
|
Thread 2 received signal SIGKILL, Killed.
|
[Switching to Thread 24565.24575]
|
0x0000000070000002 in ?? ()
|
(rr) when
|
Current event: 794425
|
(rr) watch -l log_sys.flushed_to_disk_lsn._M_i
|
Hardware watchpoint 1: -location log_sys.flushed_to_disk_lsn._M_i
|
(rr) rc
|
Continuing.
|
…
|
(rr) bt
|
#0 0x00005631c774643f in std::__atomic_base<unsigned long>::store (
|
__m=std::memory_order_release, __i=47163374,
|
this=0x5631c915a408 <log_sys+8>)
|
at /usr/include/c++/7/bits/atomic_base.h:374
|
#1 log_t::set_flushed_lsn (this=0x5631c915a400 <log_sys>, lsn=47163374)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/include/log0log.h:656
|
#2 0x00005631c774342c in log_write_flush_to_disk_low (lsn=47163374)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0log.cc:629
|
#3 0x00005631c7743d2a in log_write_up_to (lsn=47163374, flush_to_disk=true,
|
rotate_key=false, callback=0x0)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0log.cc:830
|
#4 0x00005631c79df732 in fil_ibd_create (space_id=105, name=...,
|
path=0x232f5c32c7d0 "./test/tt13.ibd", flags=21, size=4,
|
mode=FIL_ENCRYPTION_DEFAULT, key_id=1, err=0x6673081cd60c)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/fil/fil0fil.cc:2037
|
…
|
#15 0x00005631c72e2586 in ha_create_table (thd=0x232f5c001b08,
|
path=0x6673081cf220 "./test/tt13", db=0x232f5c466570 "test",
|
table_name=0x232f5c465e60 "tt13", create_info=0x6673081d03d0,
|
frm=0x6673081cf210)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/handler.cc:5613
|
After recovery, we see the following bogus message:
|
bb-10.6-monty 387d673edb5899adb31695f02032b060ed7574f7
|
2021-05-04 19:40:28 0 [ERROR] InnoDB: Table `test`.`tt13` does not exist in the InnoDB internal data dictionary though MariaDB is trying to drop it. Have you copied the .frm file of the table to the MariaDB database directory from another database? Please refer to https://mariadb.com/kb/en/innodb-troubleshooting/ for how to resolve the issue.
|
This is a bogus message, and needs to be suppressed, possibly in MDEV-17567. I will post a comment also there, saying that there is no point to issue such messages when the DDL log recovery is in progress:
|
bb-10.6-monty 387d673edb5899adb31695f02032b060ed7574f7
|
(rr) bt
|
#0 sql_print_error (format=0x561255a94148 "InnoDB: %s")
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/log.cc:9177
|
#1 0x00005612553fb730 in ib::error::~error (this=0x7fff4e168310, __in_chrg=<optimized out>)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/ut/ut0ut.cc:508
|
#2 0x00005612551b9e9a in ha_innobase::delete_table (this=0x5612573d54c0,
|
name=0x561256ecdad2 "./test/tt13", sqlcom=SQLCOM_END)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/handler/ha_innodb.cc:13142
|
#3 0x00005612551a2616 in ha_innobase::delete_table (this=0x5612573d54c0,
|
name=0x561256ecdad2 "./test/tt13")
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/handler/ha_innodb.cc:13214
|
#4 0x0000561254de755f in hton_drop_table (hton=0x561256eeed08, path=0x561256ecdad2 "./test/tt13")
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/handler.cc:577
|
#5 0x0000561254bb4248 in ddl_log_execute_action (thd=0x561257109408, mem_root=0x7fff4e168db0,
|
ddl_log_entry=0x7fff4e168df0) at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/ddl_log.cc:1711
|
#6 0x0000561254bb5d67 in ddl_log_execute_entry_no_lock (thd=0x561257109408, first_entry=2)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/ddl_log.cc:2358
|
#7 0x0000561254bb679c in ddl_log_execute_recovery ()
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/ddl_log.cc:2714
|
That said, the branch that you tested is missing some fixes for the InnoDB part of MDEV-25180 ALTER TABLE as well as something related to MDEV-18518.
|
|
The following fix should ensure the removal of the tablespace 105 and the file test/tt13.ibd:
diff --git a/storage/innobase/dict/dict0crea.cc b/storage/innobase/dict/dict0crea.cc
|
index 34d2a00083a..9a5ea6f9865 100644
|
--- a/storage/innobase/dict/dict0crea.cc
|
+++ b/storage/innobase/dict/dict0crea.cc
|
@@ -871,10 +871,6 @@ void dict_drop_index_tree(btr_pcur_t *pcur, trx_t *trx, dict_table_t *table,
|
if (len != 4)
|
goto rec_corrupted;
|
|
- if (root_page_no == FIL_NULL)
|
- /* The tree has already been freed */
|
- return;
|
-
|
static_assert(FIL_NULL == 0xffffffff, "compatibility");
|
static_assert(DICT_FLD__SYS_INDEXES__PAGE_NO ==
|
DICT_FLD__SYS_INDEXES__SPACE + 1, "compatibility");
|
@@ -891,6 +887,8 @@ void dict_drop_index_tree(btr_pcur_t *pcur, trx_t *trx, dict_table_t *table,
|
ut_ad(!table);
|
fil_delete_tablespace(space_id, true);
|
}
|
+ else if (root_page_no == FIL_NULL)
|
+ /* The tree has already been freed */;
|
else if (fil_space_t*s= fil_space_t::get(space_id))
|
{
|
/* Ensure that the tablespace file exists
|
We actually do know the tablespace, but we did not bother to drop it because the clustered index root page had not been allocated yet.
|
|
In the -3 trace, the SIGKILL arrives soon after we persisted the write of a FILE_DELETE record:
|
bb-10.3-monty 387d673edb5899adb31695f02032b060ed7574f7
|
#4 0x000055f364a7d0db in fil_delete_tablespace (id=35, if_exists=false,
|
detached_handles=0x7ffad51728d0)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/fil/fil0fil.cc:1708
|
#5 0x000055f3648cd86f in row_drop_table_for_mysql (name=0x7ffad5172f10 "test/#sql-backup-a73a-26",
|
trx=0x4a662d1d7918, sqlcom=SQLCOM_ALTER_TABLE, create_failed=false, nonatomic=true)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/row/row0mysql.cc:3618
|
In dict_drop_index_tree() we would attempt to delete the tablespace, but the tablespace will not be found in fil_system. The proper fix would seem to delete the file earlier during recovery:
|
bb-10.6-monty 387d673edb5899adb31695f02032b060ed7574f7
|
#0 fil_name_process (name=0x66fd1444f074 "./test/#sql-backup-a73a-26.ibd", len=30, space_id=35,
|
deleted=true) at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0recv.cc:797
|
#1 0x0000559db92c0645 in recv_sys_t::parse (this=0x559dba2c18a0 <recv_sys>, checkpoint_lsn=54307,
|
store=0x7fff495fd9dc, apply=true)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0recv.cc:2202
|
#2 0x0000559db92c4d43 in recv_scan_log_recs (store=0x7fff495fd9dc, log_block=0x7e3d1ab2d400 "",
|
checkpoint_lsn=54307, start_lsn=64954880, end_lsn=64997376, contiguous_lsn=0x7fff495fda28,
|
group_scanned_lsn=0x559dbaccb608 <log_sys+520>)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0recv.cc:3146
|
#3 0x0000559db92c523d in recv_group_scan_log_recs (checkpoint_lsn=54307,
|
contiguous_lsn=0x7fff495fda28, last_phase=true)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0recv.cc:3229
|
#4 0x0000559db92c7273 in recv_recovery_from_checkpoint_start (flush_lsn=54307)
|
at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/log/log0recv.cc:3663
|
At this point of time, no fil_space_t for the tablespace is available, but we would know the file name.
I am not sure if any problem would still exist in this area. The data directory recovered in an improved server just fine:
|
bb-10.6-monty-innodb f179fdefd1d8b3c8b3d4e7961dca0144fb608c4e
|
2021-05-06 17:47:54 0 [Note] InnoDB: Deleting ./test/#sql-backup-a73a-26.ibd
|
…
|
(rr) bt
|
#0 unlink () at ../sysdeps/unix/syscall-template.S:120
|
#1 0x000055ee0db8cac4 in os_file_delete_if_exists_func (name=0x55ee10b1bc90 "./test/#sql-backup-a73a-26.ibd", exist=exist@entry=0x0) at /mariadb/bb-10.6-monty/storage/innobase/os/os0file.cc:1571
|
#2 0x000055ee0ddde184 in fil_delete_file (ibd_filepath=<optimized out>, ibd_filepath@entry=0x55ee10b1bc90 "./test/#sql-backup-a73a-26.ibd") at /mariadb/bb-10.6-monty/storage/innobase/fil/fil0fil.cc:3163
|
#3 0x000055ee0dc39290 in row_drop_table_for_mysql (name=name@entry=0x7ffd90856c20 "test/#sql-backup-a73a-26", trx=trx@entry=0x7f7371fda258, sqlcom=sqlcom@entry=SQLCOM_END,
|
create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=true) at /mariadb/bb-10.6-monty/storage/innobase/row/row0mysql.cc:3607
|
#4 0x000055ee0dae0e4d in ha_innobase::delete_table (this=this@entry=0x55ee10f85460, name=name@entry=0x7ffd90857800 "./test/#sql-backup-a73a-26", sqlcom=sqlcom@entry=SQLCOM_END)
|
at /mariadb/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:13093
|
#5 0x000055ee0dad4d8e in ha_innobase::delete_table (this=0x55ee10f85460, name=0x7ffd90857800 "./test/#sql-backup-a73a-26") at /mariadb/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:13216
|
#6 0x000055ee0d77abd8 in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /mariadb/bb-10.6-monty/sql/handler.cc:577
|
#7 0x000055ee0d5b0c50 in ddl_log_execute_action (thd=thd@entry=0x55ee10a8af48, mem_root=mem_root@entry=0x7ffd90857a50, ddl_log_entry=ddl_log_entry@entry=0x7ffd90857a90)
|
at /mariadb/bb-10.6-monty/sql/ddl_log.cc:2111
|
#8 0x000055ee0d5b11cb in ddl_log_execute_entry_no_lock (thd=0x55ee10a8af48, first_entry=<optimized out>) at /mariadb/bb-10.6-monty/sql/ddl_log.cc:2358
|
#9 0x000055ee0d5b2303 in ddl_log_execute_recovery () at /mariadb/bb-10.6-monty/sql/ddl_log.cc:2714
|
#10 0x000055ee0d3c15a7 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /mariadb/bb-10.6-monty/sql/mysqld.cc:5682
|
#11 0x000055ee0d3b538e in main (argc=<optimized out>, argv=<optimized out>) at /mariadb/bb-10.6-monty/sql/main.cc:25
|
elenst, can you still reproduce this with that revision?
|
|
This seems to be fixed now (based on testing with MDEV-24626 fixes). Note: Without the MDEV-24626 fixes we may end up refusing recovery or may leave some orphan files behind. The test innodb_fts.crash_recovery,release works around that (and the workarounds will be removed in MDEV-24626).
|
|
We still need a third adjustment: DROP TABLE must not delete the file before the transaction has been committed. The file may be deleted by purge, or by the thread that committed the DROP TABLE transaction.
|
|
Even when the MDEV-24626 fix is present, the test innodb_fts.crash_recovery,release is displaying error messages about disappeared data files during recovery, presumably because of the premature file deletion problem.
I believe that any transaction that aims to delete files needs to follow the following protocol:
- Ensure that the background statistics collection is not running on any affected table (dict_stats_stop_bg().
- Exclusively lock the data dictionary.
- Ensure that no locks or references by other transactions or threads exist on the tables. If they do, abort the transaction.
- Memorize the tablespace ID of each table, and evict the table definitions from the data dictionary cache.
- With in the transaction, delete the persistent metadata.
- Check that FOREIGN KEY constraints are not being violated or broken.
- If no errors occurred, commit the transaction. Else, roll back.
- Release the data dictionary lock.
- If no errors occurred, invoke fil_delete_tablespace() on every memorized tablespace ID. (If the server is killed between commit and this step, then the purge of history will take care of this step after recovery.)
If the same transaction needs to drop and create a table by the same name, then the table will have to be renamed in order to avoid a file name clash on the subsequent create. If the function fts_create_common_tables() really needs to drop the tables, it should first rename them.
|
|
The easiest way to keep track of which tables were dropped during the execution of the transaction is trx_t::mod_tables. The field dict_table_t::to_be_dropped (and possibly later also dict_index_t::to_be_dropped) must be removed. The datafiles can be unlinked and file handles detached by a new variant of trx_t::commit(). The caller of that function would close the file handles after releasing dict_sys.mutex and dict_sys.latch, letting the file system follow the delete-on-close semantics, similar to how MDEV-8069 was solved.
|
|
I think that row_drop_table_for_mysql() must be replaced with a function that takes the table as a parameter. Perhaps already the caller should check that the table can actually be dropped. The metadata would be evicted and the table detached from other subsystems in a new member function
void trx_t::commit(std::vector<pfs_os_file_t> &deleted)
|
whose caller would invoke
for (auto d : deleted) os_file_close(d);
|
after releasing the dict_sys latches. Continuously holding exclusive latch on dict_sys will prevent the table from being used by any other thread until we have committed or rolled back. In this way, we should avoid any ‘rollback in the cache’ activity.
|
|
The fix of the remaining part of file system and InnoDB inconsistency will need some more days.
|
|
I have now finished the bulk of the refactoring of DROP TABLE operations, also when they are part of other operations, such as DROP DATABASE, TRUNCATE TABLE, ALTER TABLE.
- In the data dictionary cache, we will not modify anything before DROP TABLE is committed. The dict_table_t::to_be_dropped flag has been replaced by a flag in trx_t::mod_tables. This makes rollback trivial.
- Before acquiring locks on any data dictionary tables, DDL operations will first acquire exclusive locks on the to-be-dropped tables and ensure that all other threads that would access the tables have ceased doing so. The dict_sys will be exclusively locked until transaction commit.
- After writing the log record that marks the transaction committed, we will durably write a FILE_DELETE record and only then unlink the file.
- Also the purge of a delete-mark of a SYS_INDEXES record may initiate the file deletion (which could compete with the DDL thread mentioned above). Purge is what will guarantee file removal in case the server is killed after the commit.
- After the file has been unlinked, we will release latches and only then close the file handle, to have delete-on-close happen outside the critical section (
MDEV-8069).
This mostly works now. Known problems include the following:
- Persistent statistics are still not being deleted as part of the same transaction.
- For the internal tables related to FULLTEXT indexes, purge will not (yet) acquire correct MDL (
MDEV-16678). This causes a race condition between purge and a DROP of those tables. We need to acquire MDL for the main table, which should be guaranteed to exist in the dict_sys cache, identified by the table_id portion of the FTS_ table name.
|
|
As part of this change, I will also make the updates of InnoDB persistent statistics an atomic part of the DDL operation. That is, statistics will be dropped or renamed.
The background DROP TABLE queue will be removed, and MDEV-21602 will essentially be fixed too.
|
|
With part 3 of the fix, InnoDB no longer deletes any .ibd files before the DDL transaction has been durably committed. If the server is killed before the commit, the transaction will be rolled back without any problems. Also the persistent statistics will be dropped or renamed as part of the DDL transaction.
If the server is killed between the durable commit and file deletion, the purge of history will remove the .ibd file after recovery.
Orphan files were also caused due to MDEV-25852, but by far the bigger cause was the fact that DDL operations inside InnoDB were not executed in a single atomic transaction.
|