Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16283

ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool

Details

    Description

      ccalender noted in MDEV-13328 that ALTER TABLE...DISCARD TABLESPACE still takes a long time in a large buffer pool.

      I tried to reproduce the issue with the following:

      source include/have_innodb.inc;
      source include/have_sequence.inc;
       
      create database p;
      use p;
      create table pad(a serial,
      b0 char(255) not null default '',
      b1 char(255) not null default '',
      b2 char(255) not null default '',
      b3 char(255) not null default '',
      b4 char(255) not null default '',
      b5 char(255) not null default '',
      b6 char(255) not null default '',
      key(b0),key(b1),key(b2),key(b3),key(b4),key(b5),key(b6)) ENGINE=InnoDB;
       
      insert into pad(a) select null from seq_1_to_1000000;
       
      let $n=1000;
      while ($n) {
      eval create table t$n(a serial) engine=innodb;
      dec $n;
      }
       
      let $n=1000;
      while ($n) {
      eval alter table t$n discard tablespace;
      dec $n;
      }
       
      drop database p;
      

      ./mtr --mysqld=--innodb-flush-log-at-trx-commit=0 --mysqld=--innodb-max-dirty-pages-pct=99.999 --mysqld=--innodb-log-file-size=2g --mysqld=--innodb-page-size=4k --mysqld=--innodb-buffer-pool-size=8g innodb.discard,xtradb
      

      I believe that the trick is to fill up the buffer pool with many dirty pages, so that the buf_pool->LRU and buf_pool->flush_list will be long. The reason for innodb_page_size=4k is to get as many pages as possible into a smaller buffer pool. The other parameters are mainly there to speed up the insert (do as little file I/O as possible). I ran the test in /dev/shm.

      The DISCARD TABLESPACE were rather fast, maybe 0.1 seconds or less. While they were being run, sudo perf top showed the following:

      10.1 bfed1bfe28980d3b1404f99a44712242a5108ef5

        93,50%  mysqld                         [.] buf_LRU_flush_or_remove_pages
         0,60%  mysqld                         [.] buf_calc_page_new_checksum
         0,21%  mysqld                         [.] ut_delay
         0,21%  libc-2.27.so                   [.] __memmove_avx_unaligned_erms
         0,09%  libshadow.so                   [.] shadowUpdatePacked
         0,09%  libc-2.27.so                   [.] __memchr_avx2
      

      Yes, we still have to scan the buf_pool->flush_list and remove the pages of any discarded tables, so that the discarded data will not be written, or that we will not get flooded by error messages that there are unwritten changes to missing files.

      I tried shortening the flush_list by adding the following before the CREATE TABLE loop:

      FLUSH TABLES pad FOR EXPORT;
      UNLOCK TABLES;
      

      Computing the page checksums for the flushing takes quite a bit of time, but to my surprise, the DISCARD TABLESPACE remained roughly as slow as before. I also tried skip_innodb_adaptive_hash_index (even though already with the previous fix, we would not touch the adaptive hash index). No change.

      Here is the culprit:

      #0  buf_LRU_flush_or_remove_pages (id=id@entry=5, trx=trx@entry=0x0, 
          drop_ahi=drop_ahi@entry=true)
          at /mariadb/10.1/storage/xtradb/buf/buf0lru.cc:745
      #1  0x0000555555dbf1dd in fil_delete_tablespace (id=id@entry=5, 
          drop_ahi=drop_ahi@entry=true)
          at /mariadb/10.1/storage/xtradb/fil/fil0fil.cc:2939
      #2  0x0000555555dbf693 in fil_discard_tablespace (id=5)
          at /mariadb/10.1/storage/xtradb/fil/fil0fil.cc:3050
      #3  0x0000555555cfba1a in row_discard_tablespace (table=0x7fffb40159c8, 
          trx=0x7fffb40126f8) at /mariadb/10.1/storage/xtradb/row/row0mysql.cc:3123
      #4  row_discard_tablespace_for_mysql (name=<optimized out>, 
          trx=0x7fffb40126f8) at /mariadb/10.1/storage/xtradb/row/row0mysql.cc:3217
      

      We are actually passing drop_ahi=true in DISCARD TABLESPACE, and calling the expensive function buf_LRU_drop_page_hash_for_tablespace() for no good reason.

      The adaptive hash index entries should do no harm here; they will have to be dropped on IMPORT TABLESPACE or DROP TABLE, but not earlier.

      It is also curious that we are searching for adaptive hash index entries to be dropped, even though skip_innodb_adaptive_hash_index is set. We should check if any AHI entries exist for any table (or even better, for any indexes of our table or partition), and only then invoke buf_LRU_drop_page_hash_for_tablespace().

      Attachments

        Issue Links

          Activity

            I don’t think that we need to drop the adaptive hash index at all on DISCARD TABLESPACE. However, we must fix DROP TABLE so that it will drop the entries, even if the tablespace has been discarded.

            On IMPORT TABLESPACE, since MDEV-13328 we would free any pages with matching page numbers, and this will also drop the adaptive hash index entries. Only if the table would shrink as a result of the DISCARD TABLESPACE and IMPORT TABLESPACE, there could be adaptive hash index entries pointing to ‘phantom’ pages in the buffer pool. These entries must be removed. As a result of this, IMPORT TABLESPACE may become a little slower when replacing a table with a smaller one.

            While developing the fix, I came across the code that emits messages like this:

            InnoDB: Error: Waited for 140 secs for hash index ref_count (1) to drop to 0
            

            I do not see any mechanism that would actively drop the adaptive hash index in this case. Because we are keeping the dictionary locked while waiting, the entire server could hang, and the InnoDB buffer pool could become idle, so that the entries are never dropped. I think that it is better to actively drop the adaptive hash index entries in this case.

            marko Marko Mäkelä added a comment - I don’t think that we need to drop the adaptive hash index at all on DISCARD TABLESPACE . However, we must fix DROP TABLE  so that it will drop the entries, even if the tablespace has been discarded. On IMPORT TABLESPACE , since MDEV-13328 we would free any pages with matching page numbers, and this will also drop the adaptive hash index entries. Only if the table would shrink as a result of the DISCARD TABLESPACE and IMPORT TABLESPACE , there could be adaptive hash index entries pointing to ‘phantom’ pages in the buffer pool. These entries must be removed. As a result of this, IMPORT TABLESPACE may become a little slower when replacing a table with a smaller one. While developing the fix, I came across the code that emits messages like this: InnoDB: Error: Waited for 140 secs for hash index ref_count (1) to drop to 0 I do not see any mechanism that would actively drop the adaptive hash index in this case. Because we are keeping the dictionary locked while waiting, the entire server could hang, and the InnoDB buffer pool could become idle, so that the entries are never dropped. I think that it is better to actively drop the adaptive hash index entries in this case.

            The original fix (MDEV-13328) was that we would not evict the pages of the discarded tablespace from the buffer pool; we would do that on a subsequent IMPORT TABLESPACE only. But, we would still scan the buffer pool for adaptive hash index entries, to drop them. It actually suffices to drop the adaptive hash index entries on IMPORT TABLESPACE or on DROP TABLE or DROP INDEX. And there is no need to scan the buffer pool if index->search_info->ref_count==0 for all indexes of the table. MDEV-16283 implemented these fixes.

            marko Marko Mäkelä added a comment - The original fix ( MDEV-13328 ) was that we would not evict the pages of the discarded tablespace from the buffer pool; we would do that on a subsequent IMPORT TABLESPACE only. But, we would still scan the buffer pool for adaptive hash index entries, to drop them. It actually suffices to drop the adaptive hash index entries on IMPORT TABLESPACE or on DROP TABLE or DROP INDEX . And there is no need to scan the buffer pool if index->search_info->ref_count==0 for all indexes of the table. MDEV-16283 implemented these fixes.
            Tim He Tim He added a comment -

            --innodb-buffer-pool-size=16G
            ------------------------------------ (change "seq_1_to_1000000" to "seq_1_to_2000000")
            let $n=1000;
            while ($n)

            { eval alter table t$n discard tablespace; dec $n; }
            ------------------------------------
            71.09 sec


            --innodb-buffer-pool-size=8G
            ------------------------------------
            let $n=1000;
            while ($n) { eval alter table t$n discard tablespace; dec $n; }

            ------------------------------------
            41.65 sec

            --innodb-buffer-pool-size=2G
            ------------------------------------
            let $n=1000;
            while ($n)

            { eval alter table t$n discard tablespace; dec $n; }
            ------------------------------------
            16.24 sec


            --innodb-buffer-pool-size=256M
            ------------------------------------
            let $n=1000;
            while ($n) { eval alter table t$n discard tablespace; dec $n; }

            ------------------------------------
            3.6 sec

            --innodb-buffer-pool-size=128M
            ------------------------------------
            let $n=1000;
            while ($n)

            { eval alter table t$n discard tablespace; dec $n; }

            ------------------------------------
            2.75 sec

            Tim He Tim He added a comment - --innodb-buffer-pool-size=16G ------------------------------------ (change "seq_1_to_1000000" to "seq_1_to_2000000") let $n=1000; while ($n) { eval alter table t$n discard tablespace; dec $n; } ------------------------------------ 71.09 sec --innodb-buffer-pool-size=8G ------------------------------------ let $n=1000; while ($n) { eval alter table t$n discard tablespace; dec $n; } ------------------------------------ 41.65 sec --innodb-buffer-pool-size=2G ------------------------------------ let $n=1000; while ($n) { eval alter table t$n discard tablespace; dec $n; } ------------------------------------ 16.24 sec --innodb-buffer-pool-size=256M ------------------------------------ let $n=1000; while ($n) { eval alter table t$n discard tablespace; dec $n; } ------------------------------------ 3.6 sec --innodb-buffer-pool-size=128M ------------------------------------ let $n=1000; while ($n) { eval alter table t$n discard tablespace; dec $n; } ------------------------------------ 2.75 sec

            Tim He, with which version of MariaDB is that? Can you post some analysis based on perf record and perf report?

            marko Marko Mäkelä added a comment - Tim He , with which version of MariaDB is that? Can you post some analysis based on perf record and perf report ?

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.