[MDEV-32786] Optimization of DROP TABLE for Galera Created: 2023-11-13  Updated: 2023-12-22

Status: Open
Project: MariaDB Server
Component/s: Galera, Locking
Fix Version/s: 11.5

Type: Task Priority: Major
Reporter: Michael Widenius Assignee: Daniele Sciascia
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-8069 DROP or rebuild of a large table may ... Closed
relates to MDEV-18613 Optimization for dropping table Closed
relates to MDEV-25506 Atomic DDL: .frm file is removed and ... Closed

 Description   

I got this from a MariaDB user:
----------
I have executed test on Galera Cluster. Notice the result of the test. DML operations are blocked for the duration of the DROP statement. If this was huge table with drop taking over 2-3min we might have serious issue on top loaded DCs.

----------------

version()

----------------

11.1.2-MariaDB

----------------

SESSION1
MariaDB [testdb]> drop table bigtable1;
Query OK, 0 rows affected (8.487 sec)

SESSION2
MariaDB [testdb]> insert into t1 values (2,'Test123');
Query OK, 1 row affected (7.282 sec)

SESSION3
MariaDB [test]> insert into t values (2,'Test');
Query OK, 1 row affected (6.177 sec)

SESSION4
MariaDB [(none)]> show processlist;
-------------------------------------------------------------------------------------------------------------

Id User Host db Command Time State Info Progress

-------------------------------------------------------------------------------------------------------------

1 system user   NULL Sleep 632 wsrep aborter idle NULL 0.000
2 system user   NULL Sleep 632 closing tables NULL 0.000
7 system user   NULL Sleep 80 After apply log event NULL 0.000
9 system user   NULL Sleep 631 wsrep applier idle NULL 0.000
10 system user   NULL Sleep 69 wsrep applier committed NULL 0.000
12 root localhost testdb Query 3 Waiting for certification insert into t1 values (2,'Test123') 0.000
13 root localhost testdb Query 4 closing tables drop table bigtable1 0.000
15 root localhost test Query 1 Waiting for certification insert into t values (2,'Test') 0.000
16 root localhost NULL Query 0 starting show processlist 0.000

-------------------------------------------------------------------------------------------------------------
9 rows in set (0.000 sec)



 Comments   
Comment by Marko Mäkelä [ 2023-11-13 ]

MDEV-8069 fixed something related in InnoDB. The problem was that the system call unlink() would be invoked while dict_sys latches were being held by InnoDB. Especially before MDEV-23484 and MDEV-16678 were fixed, this would block most InnoDB threads for the duration of deleting a large file. Deleting a large or fragmented file could take several seconds, in particular on rotational storage.

MDEV-8069 was fixed by a simple trick that relies on the POSIX delete-on-close semantics, which works also on Microsoft Windows. We retain an open handle to the being-deleted file, and close that handle after the critical section of InnoDB has completed. In this way, the potentially time-consuming part will only block the current thread that is about to return from a DDL operation such as ha_innobase::delete_table(), ha_innobase::truncate(), or ha_innobase::commit_inplace_alter_table().

I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE.

I believe that for this to be fixed properly, we should refactor the API to mark DDL transaction boundaries. DROP TABLE is analogous to the DELETE statement. On DML, the locking works like this:

  1. DELETE will lock and delete-mark the index records of the matching rows.
  2. COMMIT will release the InnoDB locks as well as metadata locks (MDL).
  3. Some time after any read views that could see the data as it was before the COMMIT, InnoDB will purge the delete-marked records.

The simplest analogy for DROP TABLE would be like this:

  1. DROP TABLE will acquire MDL_EXCLUSIVE and invoke handler::delete_table() to unlink() the files.
  2. At the end of the statement, the MDL will be released.
  3. Something will be invoked that tells storage engines to perform post-commit cleanup (such as closing handles to the deleted files).

In MDEV-25506 or related tickets, a purge mechanism was implemented for DDL in InnoDB. Its main purpose is crash recovery, in case the server was killed after a DDL transaction was durably committed but before the unlink() was called. As noted in this blog post, we must not start destroying any data or deleting any files before commit.

Comment by Daniele Sciascia [ 2023-12-06 ]

DDLs in Galera cluster are by default executed in so called TOI (total order isolation) mode. The DROP TABLE operation shown in the ticket, essentially uses one replication round to order the operation and execute it in isolation. Once ordered, the DDL acquires Galera's commit order critical section (preventing other operations to commit). The commit order critical section is then released after the DDL releases its MDL locks.
In that sense, marko is right:

I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE.

Shortening the duration of the MDL locks would shorten commit order critical section.

Alternatively, one could rely on NBO (non blocking operation) mode for DDL execution. NBO would allow executing the DDL, in a way that doesn't block concurrent non-conflicting operations. Thus allowing concurrent executions of operations like INSERT to t1 while DROP TABLE bigtable1 is in progress.
NBO comes however at the cost of two replication rounds: one round orders the beginning of the DDL to acquire all necessary MDL locks, and a second round is used after DDL execution is done when MDL locks are released and to order binlogging of the DDL. The overhead of one additional replication round is likely acceptable in the above scenario, given that dropping a large table may take minutes.
NBO mode is currently not supported for DROP TABLE. However it should be possible to extend NBO mode for DROP TABLE as well, although further analysis would be required to confirm this.

Generated at Thu Feb 08 10:34:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.