Details

    Description

      When innodb_file_per_table=ON, each table has its only ibd file, user thread has to unlink refered ibd file when drop table is executed. as a result, it cost a lot of time when the ibd file is large and stall the whole system.

      For detail information, please refer to: https://github.com/MariaDB/server/pull/1021

      Attachments

        Issue Links

          Activity

            As far as I can tell, this basically is a work-around for an operating system deficiency that blocks any concurrent usage of the file system while a large file is being deleted. To my knowledge, it is most needed on Linux, and not at all needed on Microsoft Windows.

            MDEV-8069 and MDEV-22456 will remove some other bottlenecks related to InnoDB DDL operations that affect all environments.

            Technically, if we implement a background task that piecewise shrinks a large file in order to work around the file system starvation bug, it would be preferable to do that on 10.5 or later, using the MDEV-16264 infrastructure.

            marko Marko Mäkelä added a comment - As far as I can tell, this basically is a work-around for an operating system deficiency that blocks any concurrent usage of the file system while a large file is being deleted. To my knowledge, it is most needed on Linux, and not at all needed on Microsoft Windows. MDEV-8069 and MDEV-22456 will remove some other bottlenecks related to InnoDB DDL operations that affect all environments. Technically, if we implement a background task that piecewise shrinks a large file in order to work around the file system starvation bug, it would be preferable to do that on 10.5 or later, using the MDEV-16264 infrastructure.

            Now that MDEV-8069 has been fixed, I would like to know if a ftruncate() workaround is actually needed to prevent stalls on some file systems.

            marko Marko Mäkelä added a comment - Now that MDEV-8069 has been fixed, I would like to know if a ftruncate() workaround is actually needed to prevent stalls on some file systems.

            How is this issue different than MDEV-8069?

            manjot Manjot Singh (Inactive) added a comment - How is this issue different than MDEV-8069 ?

            marko in your comment in MDEV-8069 on May 20, you mention that the unlink should still be fixed. Is the truncate here a different issue or the same issue?

            Was this fixed in 8069?

            manjot Manjot Singh (Inactive) added a comment - marko in your comment in MDEV-8069 on May 20, you mention that the unlink should still be fixed. Is the truncate here a different issue or the same issue? Was this fixed in 8069?

            manjot, in MDEV-8069 we changed the logic so that at the time of the unlink() invocation, there will be an open handles, to prevent the unlink() from performing any actual work. In this way, holding InnoDB mutexes at that point does not matter. We would close() the file handle only after releasing the InnoDB mutexes.

            I do not know whether any currently popular Linux file systems suffer from the problem that deleting a file (which in our case would occur at the time of the close() invocation) would prevent any concurrent operation on the file system. There are some hints that this was a problem with the ext3 file system, but not with ext4. I think that we will find it out when someone complains. I would expect the worst case to involve the deletion of large fragmented files. It might ‘help’ to fragment the files by enabling page_compressed when creating the tables.

            If some file system turns out to suffer from that problem, we could try to work around that problem by repeatedly invoking ftruncate() to shrink the file before closing the file handle.

            marko Marko Mäkelä added a comment - manjot , in MDEV-8069 we changed the logic so that at the time of the unlink() invocation, there will be an open handles, to prevent the unlink() from performing any actual work. In this way, holding InnoDB mutexes at that point does not matter. We would close() the file handle only after releasing the InnoDB mutexes. I do not know whether any currently popular Linux file systems suffer from the problem that deleting a file (which in our case would occur at the time of the close() invocation) would prevent any concurrent operation on the file system. There are some hints that this was a problem with the ext3 file system, but not with ext4 . I think that we will find it out when someone complains. I would expect the worst case to involve the deletion of large fragmented files. It might ‘help’ to fragment the files by enabling page_compressed when creating the tables. If some file system turns out to suffer from that problem, we could try to work around that problem by repeatedly invoking ftruncate() to shrink the file before closing the file handle.

            In MDEV-25506 the DROP TABLE code was rewritten, but the basic idea of the MDEV-8069 fix was preserved: we will unlink() the file while holding both some mutexes and an open file handle. Finally, we will release the mutexes and close the file. At this point, some time may be spent in the file system driver of the operating system kernel.

            Should some file system really require a work-around to make the delete-on-close perform faster (without stalling other threads or processes that are competing for kernel resources), we could implement something that performs a piecewise ftruncate() of the file before finally closing the handle. The following just illustrates the idea; there are multiple occurrences of such code in the 10.6 server:

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 1acb8ef5e20..ddb8422a553 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -2050,7 +2050,7 @@ static void drop_garbage_tables_after_restore()
             
                 row_mysql_unlock_data_dictionary(trx);
                 for (pfs_os_file_t d : deleted)
            -      os_file_close(d);
            +      os_file_truncate_and_close(d);
             
                 mtr.start();
                 btr_pcur_restore_position(BTR_SEARCH_LEAF, &pcur, &mtr);
            

            marko Marko Mäkelä added a comment - In MDEV-25506 the DROP TABLE code was rewritten, but the basic idea of the MDEV-8069 fix was preserved: we will unlink() the file while holding both some mutexes and an open file handle. Finally, we will release the mutexes and close the file. At this point, some time may be spent in the file system driver of the operating system kernel. Should some file system really require a work-around to make the delete-on-close perform faster (without stalling other threads or processes that are competing for kernel resources), we could implement something that performs a piecewise ftruncate() of the file before finally closing the handle. The following just illustrates the idea; there are multiple occurrences of such code in the 10.6 server: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 1acb8ef5e20..ddb8422a553 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -2050,7 +2050,7 @@ static void drop_garbage_tables_after_restore() row_mysql_unlock_data_dictionary(trx); for (pfs_os_file_t d : deleted) - os_file_close(d); + os_file_truncate_and_close(d); mtr.start(); btr_pcur_restore_position(BTR_SEARCH_LEAF, &pcur, &mtr);

            I believe that this has been fixed by MDEV-8069 in MariaDB Server 10.5.4.

            marko Marko Mäkelä added a comment - I believe that this has been fixed by MDEV-8069 in MariaDB Server 10.5.4.

            People

              marko Marko Mäkelä
              musazhang musazhang
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.