Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32786

Support NBO for DROP TABLE in Galera

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Galera, Locking
    • None

    Description

      I got this from a MariaDB user:

      I have executed test on Galera Cluster. Notice the result of the test. DML operations are blocked for the duration of the DROP statement. If this was huge table with drop taking over 2-3min we might have serious issue on top loaded DCs.

      +----------------+
      | version()      |
      +----------------+
      | 11.1.2-MariaDB |
      +----------------+
      

      SESSION1

      MariaDB [testdb]> drop table bigtable1;
      Query OK, 0 rows affected (8.487 sec)
      

      SESSION2

      MariaDB [testdb]> insert into t1 values (2,'Test123');
      Query OK, 1 row affected (7.282 sec)
      

      SESSION3

      MariaDB [test]> insert into t values (2,'Test');
      Query OK, 1 row affected (6.177 sec)
      

      SESSION4

      MariaDB [(none)]> show processlist;
      +----+-------------+-----------+--------+---------+------+---------------------------+-------------------------------------+----------+
      | Id | User        | Host      | db     | Command | Time | State                     | Info                                | Progress |
      +----+-------------+-----------+--------+---------+------+---------------------------+-------------------------------------+----------+
      |  1 | system user |           | NULL   | Sleep   |  632 | wsrep aborter idle        | NULL                                |    0.000 |
      |  2 | system user |           | NULL   | Sleep   |  632 | closing tables            | NULL                                |    0.000 |
      |  7 | system user |           | NULL   | Sleep   |   80 | After apply log event     | NULL                                |    0.000 |
      |  9 | system user |           | NULL   | Sleep   |  631 | wsrep applier idle        | NULL                                |    0.000 |
      | 10 | system user |           | NULL   | Sleep   |   69 | wsrep applier committed   | NULL                                |    0.000 |
      | 12 | root        | localhost | testdb | Query   |    3 | Waiting for certification | insert into t1 values (2,'Test123') |    0.000 |
      | 13 | root        | localhost | testdb | Query   |    4 | closing tables            | drop table bigtable1                |    0.000 |
      | 15 | root        | localhost | test   | Query   |    1 | Waiting for certification | insert into t values (2,'Test')     |    0.000 |
      | 16 | root        | localhost | NULL   | Query   |    0 | starting                  | show processlist                    |    0.000 |
      +----+-------------+-----------+--------+---------+------+---------------------------+-------------------------------------+----------+
      9 rows in set (0.000 sec)

      Attachments

        Issue Links

          Activity

            MDEV-8069 fixed something related in InnoDB. The problem was that the system call unlink() would be invoked while dict_sys latches were being held by InnoDB. Especially before MDEV-23484 and MDEV-16678 were fixed, this would block most InnoDB threads for the duration of deleting a large file. Deleting a large or fragmented file could take several seconds, in particular on rotational storage.

            MDEV-8069 was fixed by a simple trick that relies on the POSIX delete-on-close semantics, which works also on Microsoft Windows. We retain an open handle to the being-deleted file, and close that handle after the critical section of InnoDB has completed. In this way, the potentially time-consuming part will only block the current thread that is about to return from a DDL operation such as ha_innobase::delete_table(), ha_innobase::truncate(), or ha_innobase::commit_inplace_alter_table().

            I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE.

            I believe that for this to be fixed properly, we should refactor the API to mark DDL transaction boundaries. DROP TABLE is analogous to the DELETE statement. On DML, the locking works like this:

            1. DELETE will lock and delete-mark the index records of the matching rows.
            2. COMMIT will release the InnoDB locks as well as metadata locks (MDL).
            3. Some time after any read views that could see the data as it was before the COMMIT, InnoDB will purge the delete-marked records.

            The simplest analogy for DROP TABLE would be like this:

            1. DROP TABLE will acquire MDL_EXCLUSIVE and invoke handler::delete_table() to unlink() the files.
            2. At the end of the statement, the MDL will be released.
            3. Something will be invoked that tells storage engines to perform post-commit cleanup (such as closing handles to the deleted files).

            In MDEV-25506 or related tickets, a purge mechanism was implemented for DDL in InnoDB. Its main purpose is crash recovery, in case the server was killed after a DDL transaction was durably committed but before the unlink() was called. As noted in this blog post, we must not start destroying any data or deleting any files before commit.

            marko Marko Mäkelä added a comment - MDEV-8069 fixed something related in InnoDB. The problem was that the system call unlink() would be invoked while dict_sys latches were being held by InnoDB. Especially before MDEV-23484 and MDEV-16678 were fixed, this would block most InnoDB threads for the duration of deleting a large file. Deleting a large or fragmented file could take several seconds, in particular on rotational storage. MDEV-8069 was fixed by a simple trick that relies on the POSIX delete-on-close semantics, which works also on Microsoft Windows. We retain an open handle to the being-deleted file, and close that handle after the critical section of InnoDB has completed. In this way, the potentially time-consuming part will only block the current thread that is about to return from a DDL operation such as ha_innobase::delete_table() , ha_innobase::truncate() , or ha_innobase::commit_inplace_alter_table() . I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE . I believe that for this to be fixed properly, we should refactor the API to mark DDL transaction boundaries. DROP TABLE is analogous to the DELETE statement. On DML, the locking works like this: DELETE will lock and delete-mark the index records of the matching rows. COMMIT will release the InnoDB locks as well as metadata locks (MDL). Some time after any read views that could see the data as it was before the COMMIT , InnoDB will purge the delete-marked records. The simplest analogy for DROP TABLE would be like this: DROP TABLE will acquire MDL_EXCLUSIVE and invoke handler::delete_table() to unlink() the files. At the end of the statement, the MDL will be released. Something will be invoked that tells storage engines to perform post-commit cleanup (such as closing handles to the deleted files). In MDEV-25506 or related tickets, a purge mechanism was implemented for DDL in InnoDB. Its main purpose is crash recovery, in case the server was killed after a DDL transaction was durably committed but before the unlink() was called. As noted in this blog post , we must not start destroying any data or deleting any files before commit.

            DDLs in Galera cluster are by default executed in so called TOI (total order isolation) mode. The DROP TABLE operation shown in the ticket, essentially uses one replication round to order the operation and execute it in isolation. Once ordered, the DDL acquires Galera's commit order critical section (preventing other operations to commit). The commit order critical section is then released after the DDL releases its MDL locks.
            In that sense, marko is right:

            I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE.

            Shortening the duration of the MDL locks would shorten commit order critical section.

            Alternatively, one could rely on NBO (non blocking operation) mode for DDL execution. NBO would allow executing the DDL, in a way that doesn't block concurrent non-conflicting operations. Thus allowing concurrent executions of operations like INSERT to t1 while DROP TABLE bigtable1 is in progress.
            NBO comes however at the cost of two replication rounds: one round orders the beginning of the DDL to acquire all necessary MDL locks, and a second round is used after DDL execution is done when MDL locks are released and to order binlogging of the DDL. The overhead of one additional replication round is likely acceptable in the above scenario, given that dropping a large table may take minutes.
            NBO mode is currently not supported for DROP TABLE. However it should be possible to extend NBO mode for DROP TABLE as well, although further analysis would be required to confirm this.

            sciascid Daniele Sciascia added a comment - DDLs in Galera cluster are by default executed in so called TOI (total order isolation) mode. The DROP TABLE operation shown in the ticket, essentially uses one replication round to order the operation and execute it in isolation. Once ordered, the DDL acquires Galera's commit order critical section (preventing other operations to commit). The commit order critical section is then released after the DDL releases its MDL locks. In that sense, marko is right: I think that the problem might be that the time-consuming file deletion would still be executed while the DDL operation is protected by MDL_EXCLUSIVE. Shortening the duration of the MDL locks would shorten commit order critical section. Alternatively, one could rely on NBO (non blocking operation) mode for DDL execution. NBO would allow executing the DDL, in a way that doesn't block concurrent non-conflicting operations. Thus allowing concurrent executions of operations like INSERT to t1 while DROP TABLE bigtable1 is in progress. NBO comes however at the cost of two replication rounds: one round orders the beginning of the DDL to acquire all necessary MDL locks, and a second round is used after DDL execution is done when MDL locks are released and to order binlogging of the DDL. The overhead of one additional replication round is likely acceptable in the above scenario, given that dropping a large table may take minutes. NBO mode is currently not supported for DROP TABLE. However it should be possible to extend NBO mode for DROP TABLE as well, although further analysis would be required to confirm this.

            People

              ralf.gebhardt Ralf Gebhardt
              monty Michael Widenius
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.