[MDEV-32786] Support NBO for DROP TABLE in Galera - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Galera, Locking
Labels:
None

Description

I got this from a MariaDB user:

I have executed test on Galera Cluster. Notice the result of the test. DML operations are blocked for the duration of the DROP statement. If this was huge table with drop taking over 2-3min we might have serious issue on top loaded DCs.

+----------------+

| version()      |

+----------------+

| 11.1.2-MariaDB |

+----------------+

SESSION1
MariaDB [testdb]> drop table bigtable1;
Query OK, 0 rows affected (8.487 sec)

SESSION2
MariaDB [testdb]> insert into t1 values (2,'Test123');
Query OK, 1 row affected (7.282 sec)

SESSION3
MariaDB [test]> insert into t values (2,'Test');
Query OK, 1 row affected (6.177 sec)

SESSION4
MariaDB [(none)]> show processlist;
+----+-------------+-----------+--------+---------+------+---------------------------+-------------------------------------+----------+
\| Id \| User \| Host \| db \| Command \| Time \| State \| Info \| Progress \|
+----+-------------+-----------+--------+---------+------+---------------------------+-------------------------------------+----------+
\| 1 \| system user \| \| NULL \| Sleep \| 632 \| wsrep aborter idle \| NULL \| 0.000 \|
\| 2 \| system user \| \| NULL \| Sleep \| 632 \| closing tables \| NULL \| 0.000 \|
\| 7 \| system user \| \| NULL \| Sleep \| 80 \| After apply log event \| NULL \| 0.000 \|
\| 9 \| system user \| \| NULL \| Sleep \| 631 \| wsrep applier idle \| NULL \| 0.000 \|
\| 10 \| system user \| \| NULL \| Sleep \| 69 \| wsrep applier committed \| NULL \| 0.000 \|
\| 12 \| root \| localhost \| testdb \| Query \| 3 \| Waiting for certification \| insert into t1 values (2,'Test123') \| 0.000 \|
\| 13 \| root \| localhost \| testdb \| Query \| 4 \| closing tables \| drop table bigtable1 \| 0.000 \|
\| 15 \| root \| localhost \| test \| Query \| 1 \| Waiting for certification \| insert into t values (2,'Test') \| 0.000 \|
\| 16 \| root \| localhost \| NULL \| Query \| 0 \| starting \| show processlist \| 0.000 \|
+----+-------------+-----------+--------+---------+------+---------------------------+-------------------------------------+----------+
9 rows in set (0.000 sec)

Attachments

Issue Links

relates to

MDEV-32845 Improve DROP TABLE execution speed in Galera Cluster

Open

MDEV-8069 DROP or rebuild of a large table may lock up InnoDB

Closed

MDEV-18613 Optimization for dropping table

Closed

MDEV-25506 Atomic DDL: .frm file is removed and orphan InnoDB tablespace is left behind upon crash recovery

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2023-11-13 08:17

~~MDEV-8069~~ fixed something related in InnoDB. The problem was that the system call unlink() would be invoked while dict_sys latches were being held by InnoDB. Especially before ~~MDEV-23484~~ and ~~MDEV-16678~~ were fixed, this would block most InnoDB threads for the duration of deleting a large file. Deleting a large or fragmented file could take several seconds, in particular on rotational storage.

~~MDEV-8069~~ was fixed by a simple trick that relies on the POSIX delete-on-close semantics, which works also on Microsoft Windows. We retain an open handle to the being-deleted file, and close that handle after the critical section of InnoDB has completed. In this way, the potentially time-consuming part will only block the current thread that is about to return from a DDL operation such as ha_innobase::delete_table(), ha_innobase::truncate(), or ha_innobase::commit_inplace_alter_table().

I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE.

I believe that for this to be fixed properly, we should refactor the API to mark DDL transaction boundaries. DROP TABLE is analogous to the DELETE statement. On DML, the locking works like this:

DELETE will lock and delete-mark the index records of the matching rows.
COMMIT will release the InnoDB locks as well as metadata locks (MDL).
Some time after any read views that could see the data as it was before the COMMIT, InnoDB will purge the delete-marked records.

The simplest analogy for DROP TABLE would be like this:

DROP TABLE will acquire MDL_EXCLUSIVE and invoke handler::delete_table() to unlink() the files.
At the end of the statement, the MDL will be released.
Something will be invoked that tells storage engines to perform post-commit cleanup (such as closing handles to the deleted files).

In ~~MDEV-25506~~ or related tickets, a purge mechanism was implemented for DDL in InnoDB. Its main purpose is crash recovery, in case the server was killed after a DDL transaction was durably committed but before the unlink() was called. As noted in this blog post, we must not start destroying any data or deleting any files before commit.

Marko Mäkelä added a comment - 2023-11-13 08:17 MDEV-8069 fixed something related in InnoDB. The problem was that the system call unlink() would be invoked while dict_sys latches were being held by InnoDB. Especially before MDEV-23484 and MDEV-16678 were fixed, this would block most InnoDB threads for the duration of deleting a large file. Deleting a large or fragmented file could take several seconds, in particular on rotational storage. MDEV-8069 was fixed by a simple trick that relies on the POSIX delete-on-close semantics, which works also on Microsoft Windows. We retain an open handle to the being-deleted file, and close that handle after the critical section of InnoDB has completed. In this way, the potentially time-consuming part will only block the current thread that is about to return from a DDL operation such as ha_innobase::delete_table() , ha_innobase::truncate() , or ha_innobase::commit_inplace_alter_table() . I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE . I believe that for this to be fixed properly, we should refactor the API to mark DDL transaction boundaries. DROP TABLE is analogous to the DELETE statement. On DML, the locking works like this: DELETE will lock and delete-mark the index records of the matching rows. COMMIT will release the InnoDB locks as well as metadata locks (MDL). Some time after any read views that could see the data as it was before the COMMIT , InnoDB will purge the delete-marked records. The simplest analogy for DROP TABLE would be like this: DROP TABLE will acquire MDL_EXCLUSIVE and invoke handler::delete_table() to unlink() the files. At the end of the statement, the MDL will be released. Something will be invoked that tells storage engines to perform post-commit cleanup (such as closing handles to the deleted files). In MDEV-25506 or related tickets, a purge mechanism was implemented for DDL in InnoDB. Its main purpose is crash recovery, in case the server was killed after a DDL transaction was durably committed but before the unlink() was called. As noted in this blog post , we must not start destroying any data or deleting any files before commit.

Daniele Sciascia added a comment - 2023-12-06 14:24

DDLs in Galera cluster are by default executed in so called TOI (total order isolation) mode. The DROP TABLE operation shown in the ticket, essentially uses one replication round to order the operation and execute it in isolation. Once ordered, the DDL acquires Galera's commit order critical section (preventing other operations to commit). The commit order critical section is then released after the DDL releases its MDL locks.
In that sense, marko is right:

I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE.

Shortening the duration of the MDL locks would shorten commit order critical section.

Alternatively, one could rely on NBO (non blocking operation) mode for DDL execution. NBO would allow executing the DDL, in a way that doesn't block concurrent non-conflicting operations. Thus allowing concurrent executions of operations like INSERT to t1 while DROP TABLE bigtable1 is in progress.
NBO comes however at the cost of two replication rounds: one round orders the beginning of the DDL to acquire all necessary MDL locks, and a second round is used after DDL execution is done when MDL locks are released and to order binlogging of the DDL. The overhead of one additional replication round is likely acceptable in the above scenario, given that dropping a large table may take minutes.
NBO mode is currently not supported for DROP TABLE. However it should be possible to extend NBO mode for DROP TABLE as well, although further analysis would be required to confirm this.

Daniele Sciascia added a comment - 2023-12-06 14:24 DDLs in Galera cluster are by default executed in so called TOI (total order isolation) mode. The DROP TABLE operation shown in the ticket, essentially uses one replication round to order the operation and execute it in isolation. Once ordered, the DDL acquires Galera's commit order critical section (preventing other operations to commit). The commit order critical section is then released after the DDL releases its MDL locks. In that sense, marko is right: I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE. Shortening the duration of the MDL locks would shorten commit order critical section. Alternatively, one could rely on NBO (non blocking operation) mode for DDL execution. NBO would allow executing the DDL, in a way that doesn't block concurrent non-conflicting operations. Thus allowing concurrent executions of operations like INSERT to t1 while DROP TABLE bigtable1 is in progress. NBO comes however at the cost of two replication rounds: one round orders the beginning of the DDL to acquire all necessary MDL locks, and a second round is used after DDL execution is done when MDL locks are released and to order binlogging of the DDL. The overhead of one additional replication round is likely acceptable in the above scenario, given that dropping a large table may take minutes. NBO mode is currently not supported for DROP TABLE. However it should be possible to extend NBO mode for DROP TABLE as well, although further analysis would be required to confirm this.

MariaDB Server

Support NBO for DROP TABLE in Galera

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration