Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25506

Atomic DDL: .frm file is removed and orphan InnoDB tablespace is left behind upon crash recovery

Details

    Description

      In the test run, a server crash was induced while CREATE TABLE IF NOT EXISTS tt11 AS SELECT ... was being executed.

      After the crash, the original datadir contained both .frm and .ibd files:

      $ ls -l workdirs/worker7/vardir1_65/data_orig/test/tt11*
      -rw-rw---- 1 mdbe mdbe   462 Apr 24 14:17 workdirs/worker7/vardir1_65/data_orig/test/tt11.frm
      -rw-rw---- 1 mdbe mdbe 65536 Apr 24 14:17 workdirs/worker7/vardir1_65/data_orig/test/tt11.ibd
      

      Upon recovery on this datadir, InnoDB complains in the error log:

      bb-10.6-monty 8a94dabc9 with trigger patch

      2021-04-24 14:17:06 0 [ERROR] InnoDB: Table `test`.`tt11` does not exist in the InnoDB internal data dictionary though MariaDB is trying to drop it. Have you copied the .frm file of the table to the MariaDB database directory from another database? Please refer to https://mariadb.com/kb/en/innodb-troubleshooting/ for how to resolve the issue.
      2021-04-24 14:17:06 0 [Note] DDL_LOG: Crash recovery executed 1 entries
      

      After the recovery the .frm file is no longer there, but .ibd still is:

      $ ls -l workdirs/worker7/vardir1_65/data/test/tt11*
      -rw-rw---- 1 mdbe mdbe 65536 Apr 24 14:17 workdirs/worker7/vardir1_65/data/test/tt11.ibd
      

      And the table cannot be either dropped or created:

      MariaDB [test]> create table tt11 (a int);
      ERROR 1813 (HY000): Tablespace for table '`test`.`tt11`' exists. Please DISCARD the tablespace before IMPORT
      MariaDB [test]> drop table tt11;
      ERROR 1051 (42S02): Unknown table 'test.tt11'
      MariaDB [test]> create table tt11 (a int);
      ERROR 1813 (HY000): Tablespace for table '`test`.`tt11`' exists. Please DISCARD the tablespace before IMPORT
      

      The "before" and "after" datadirs, logs and rr profiles are available.

      Attachments

        Issue Links

          Activity

            I think that row_drop_table_for_mysql() must be replaced with a function that takes the table as a parameter. Perhaps already the caller should check that the table can actually be dropped. The metadata would be evicted and the table detached from other subsystems in a new member function

            void trx_t::commit(std::vector<pfs_os_file_t> &deleted)
            

            whose caller would invoke

            for (auto d : deleted) os_file_close(d);
            

            after releasing the dict_sys latches. Continuously holding exclusive latch on dict_sys will prevent the table from being used by any other thread until we have committed or rolled back. In this way, we should avoid any ‘rollback in the cache’ activity.

            marko Marko Mäkelä added a comment - I think that row_drop_table_for_mysql() must be replaced with a function that takes the table as a parameter. Perhaps already the caller should check that the table can actually be dropped. The metadata would be evicted and the table detached from other subsystems in a new member function void trx_t::commit(std::vector<pfs_os_file_t> &deleted) whose caller would invoke for (auto d : deleted) os_file_close(d); after releasing the dict_sys latches. Continuously holding exclusive latch on dict_sys will prevent the table from being used by any other thread until we have committed or rolled back. In this way, we should avoid any ‘rollback in the cache’ activity.

            The fix of the remaining part of file system and InnoDB inconsistency will need some more days.

            marko Marko Mäkelä added a comment - The fix of the remaining part of file system and InnoDB inconsistency will need some more days.

            I have now finished the bulk of the refactoring of DROP TABLE operations, also when they are part of other operations, such as DROP DATABASE, TRUNCATE TABLE, ALTER TABLE.

            1. In the data dictionary cache, we will not modify anything before DROP TABLE is committed. The dict_table_t::to_be_dropped flag has been replaced by a flag in trx_t::mod_tables. This makes rollback trivial.
            2. Before acquiring locks on any data dictionary tables, DDL operations will first acquire exclusive locks on the to-be-dropped tables and ensure that all other threads that would access the tables have ceased doing so. The dict_sys will be exclusively locked until transaction commit.
            3. After writing the log record that marks the transaction committed, we will durably write a FILE_DELETE record and only then unlink the file.
            4. Also the purge of a delete-mark of a SYS_INDEXES record may initiate the file deletion (which could compete with the DDL thread mentioned above). Purge is what will guarantee file removal in case the server is killed after the commit.
            5. After the file has been unlinked, we will release latches and only then close the file handle, to have delete-on-close happen outside the critical section (MDEV-8069).

            This mostly works now. Known problems include the following:

            1. Persistent statistics are still not being deleted as part of the same transaction.
            2. For the internal tables related to FULLTEXT indexes, purge will not (yet) acquire correct MDL (MDEV-16678). This causes a race condition between purge and a DROP of those tables. We need to acquire MDL for the main table, which should be guaranteed to exist in the dict_sys cache, identified by the table_id portion of the FTS_ table name.
            marko Marko Mäkelä added a comment - I have now finished the bulk of the refactoring of DROP TABLE operations, also when they are part of other operations, such as DROP DATABASE , TRUNCATE TABLE , ALTER TABLE . In the data dictionary cache, we will not modify anything before DROP TABLE is committed. The dict_table_t::to_be_dropped flag has been replaced by a flag in trx_t::mod_tables . This makes rollback trivial. Before acquiring locks on any data dictionary tables, DDL operations will first acquire exclusive locks on the to-be-dropped tables and ensure that all other threads that would access the tables have ceased doing so. The dict_sys will be exclusively locked until transaction commit. After writing the log record that marks the transaction committed, we will durably write a FILE_DELETE record and only then unlink the file. Also the purge of a delete-mark of a SYS_INDEXES record may initiate the file deletion (which could compete with the DDL thread mentioned above). Purge is what will guarantee file removal in case the server is killed after the commit. After the file has been unlinked, we will release latches and only then close the file handle, to have delete-on-close happen outside the critical section ( MDEV-8069 ). This mostly works now. Known problems include the following: Persistent statistics are still not being deleted as part of the same transaction. For the internal tables related to FULLTEXT indexes, purge will not (yet) acquire correct MDL ( MDEV-16678 ). This causes a race condition between purge and a DROP of those tables. We need to acquire MDL for the main table, which should be guaranteed to exist in the dict_sys cache, identified by the table_id portion of the FTS_ table name.

            As part of this change, I will also make the updates of InnoDB persistent statistics an atomic part of the DDL operation. That is, statistics will be dropped or renamed.

            The background DROP TABLE queue will be removed, and MDEV-21602 will essentially be fixed too.

            marko Marko Mäkelä added a comment - As part of this change, I will also make the updates of InnoDB persistent statistics an atomic part of the DDL operation. That is, statistics will be dropped or renamed. The background DROP TABLE queue will be removed, and MDEV-21602 will essentially be fixed too.

            With part 3 of the fix, InnoDB no longer deletes any .ibd files before the DDL transaction has been durably committed. If the server is killed before the commit, the transaction will be rolled back without any problems. Also the persistent statistics will be dropped or renamed as part of the DDL transaction.

            If the server is killed between the durable commit and file deletion, the purge of history will remove the .ibd file after recovery.

            Orphan files were also caused due to MDEV-25852, but by far the bigger cause was the fact that DDL operations inside InnoDB were not executed in a single atomic transaction.

            marko Marko Mäkelä added a comment - With part 3 of the fix , InnoDB no longer deletes any .ibd files before the DDL transaction has been durably committed. If the server is killed before the commit, the transaction will be rolled back without any problems. Also the persistent statistics will be dropped or renamed as part of the DDL transaction. If the server is killed between the durable commit and file deletion, the purge of history will remove the .ibd file after recovery. Orphan files were also caused due to MDEV-25852 , but by far the bigger cause was the fact that DDL operations inside InnoDB were not executed in a single atomic transaction.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.