Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35049

fix innodb-adaptive-hash-index scalability with multiple threads

Details

    Description

      InnoDB adaptive indexes has performance problems when running queries in parallel that uses InnoDB hash indexes (at least on the same tables).

      Here is an example of running the same SELECT query with 10-256 concurrent threads.
      (The query normally takes about 0.36 seconds):

      Total execution with 10 threads time: 1 seconds
      Total execution with 32 threads time: 1 seconds
      Total execution with 64 threads time: 3 seconds
      Total execution with 128 threads time: 5 seconds
      Total execution with 256 threads time: 9 seconds

      When using --innodb-adaptive-hash-index=1 we get:

      Total execution with 10 threads time: 1 seconds
      Total execution with 32 threads time: 5 seconds
      Total execution with 64 threads time: 9 seconds
      Total execution with 128 threads time: 18 seconds
      Total execution with 256 threads time: 50 seconds

      MySQL 5.7 does not have this issue. It has roughly the same speed with and without the hash indexes.

      In the above query, there was a driving table with a 400 rows doing lookups in a child table using 200 lookups on a secondary keys for every row from the driving table.
      Using innodb-adaptive-hash-index=1 speeds up the query with 37% when running a query in one thread but it slows down drastically when running 32-256 concurrent threads.

      Attachments

        Issue Links

          Activity

            There was one more observed ut_ad(block->n_pointers) assertion failure. It is the result of a race condition between btr_search_drop_page_hash_index() and btr_search_update_hash_ref() between two SELECT operations. The workload is also frequently rebuilding the adaptive hash index for this index, between 1 field (left side) and 2 fields (right side). It demonstrates that ideally, it should be possible to enable or disable the adaptive hash index not only on a per-index basis, but also on a per-query basis.

            I do not have a fix for this yet. Possibly, btr_search_drop_page_hash_index() needs to be made to use a single batch, by allocating a large enough buffer. Multiple batches in btr_search_build_page_hash_index() should not be a problem, because the AHI always can be a proper subset of the contents of the index.

            For performance testing, I also created the branch bb-10.11-MDEV-35049, with two variants: something that aims to be a correct merge and a deliberately incorrect merge to assess the impact of MDEV-13756.

            marko Marko Mäkelä added a comment - There was one more observed ut_ad(block->n_pointers) assertion failure. It is the result of a race condition between btr_search_drop_page_hash_index() and btr_search_update_hash_ref() between two SELECT operations. The workload is also frequently rebuilding the adaptive hash index for this index, between 1 field (left side) and 2 fields (right side). It demonstrates that ideally, it should be possible to enable or disable the adaptive hash index not only on a per-index basis, but also on a per-query basis. I do not have a fix for this yet. Possibly, btr_search_drop_page_hash_index() needs to be made to use a single batch, by allocating a large enough buffer. Multiple batches in btr_search_build_page_hash_index() should not be a problem, because the AHI always can be a proper subset of the contents of the index. For performance testing, I also created the branch bb-10.11- MDEV-35049 , with two variants: something that aims to be a correct merge and a deliberately incorrect merge to assess the impact of MDEV-13756 .

            I ended up fixing the btr_search_drop_page_hash_index() problem by allocating enough stack to accommodate the maximum number of records per page. I empirically determined the maximum number of records with the following test case:

            --source include/have_innodb.inc
            --source include/have_innodb_64k.inc
            --source include/have_sequence.inc
            CREATE TABLE t(a SMALLINT PRIMARY KEY, INDEX(a)) ENGINE=InnoDB;
            INSERT INTO t SELECT * FROM seq_1_to_8189;
            

            od -Ax -t x1 -j 0x40000 -N 0x10000 var/mysqld.1/data/test/t.ibd
            

            There is a hard limit of 8191 records per page. InnoDB reserves 13 bits for a heap number of in the record header, and the predefined infimum and supremum records consume one heap number each.

            So, we would have to allocate 8189*4 = 32756 bytes of stack for the rec_fold() value buffer in btr_search_drop_page_hash_index() in order to be able to compute all rec_fold() before acquiring an exclusive latch on an adaptive hash index partition. I did consider an alternative of allocating a buffer from the heap; I am afraid that if we failed to allocate it, we’d have no other option than to crash the server.

            I just realized that there is one more option: keep computing rec_fold() for the "overflow" while holding an exclusive adaptive hash index latch. In that way, for typical pages where we have at most a few hundred records, we’d invoke rec_fold() while not blocking others. In the non-typical case, there would be a bit more blocking going on. This could be the best compromise. We know that Alpine Linux (MDEV-18462) is configuring a small stack size, which is already leading to some problems elsewhere (MDEV-35705, MDEV-35022).

            marko Marko Mäkelä added a comment - I ended up fixing the btr_search_drop_page_hash_index() problem by allocating enough stack to accommodate the maximum number of records per page. I empirically determined the maximum number of records with the following test case: --source include/have_innodb.inc --source include/have_innodb_64k.inc --source include/have_sequence.inc CREATE TABLE t(a SMALLINT PRIMARY KEY , INDEX (a)) ENGINE=InnoDB; INSERT INTO t SELECT * FROM seq_1_to_8189; od -Ax -t x1 -j 0x40000 -N 0x10000 var/mysqld.1/data/test/t.ibd There is a hard limit of 8191 records per page. InnoDB reserves 13 bits for a heap number of in the record header, and the predefined infimum and supremum records consume one heap number each. So, we would have to allocate 8189*4 = 32756 bytes of stack for the rec_fold() value buffer in btr_search_drop_page_hash_index() in order to be able to compute all rec_fold() before acquiring an exclusive latch on an adaptive hash index partition. I did consider an alternative of allocating a buffer from the heap; I am afraid that if we failed to allocate it, we’d have no other option than to crash the server. I just realized that there is one more option: keep computing rec_fold() for the "overflow" while holding an exclusive adaptive hash index latch. In that way, for typical pages where we have at most a few hundred records, we’d invoke rec_fold() while not blocking others. In the non-typical case, there would be a bit more blocking going on. This could be the best compromise. We know that Alpine Linux ( MDEV-18462 ) is configuring a small stack size, which is already leading to some problems elsewhere ( MDEV-35705 , MDEV-35022 ).

            I checked some performance test results. As expected, I see a slight improvement in INSERT for the innodb_adaptive_hash_index=OFF case, due to MDEV-35312 and the fact that we now enable that code even when the adaptive hash index is disabled.

            Curiously, for the t_oltp_pointselect_innodb test in the innodb_adaptive_hash_index=ON case, there is hardly any improvement observed. This seems to suggest that we should include something like ahi.sh in our performance regression test suite.

            marko Marko Mäkelä added a comment - I checked some performance test results. As expected, I see a slight improvement in INSERT for the innodb_adaptive_hash_index=OFF case, due to MDEV-35312 and the fact that we now enable that code even when the adaptive hash index is disabled. Curiously, for the t_oltp_pointselect_innodb test in the innodb_adaptive_hash_index=ON case, there is hardly any improvement observed. This seems to suggest that we should include something like ahi.sh in our performance regression test suite.

            RQG testing etc.

            mleich Matthias Leich added a comment - RQG testing etc.

            I created the branch bb-10.11-MDEV-35049-rebase (and tested all 10 commits in it separately) as well as pushed the 10 commits to the main (currently 11.8) branch.

            marko Marko Mäkelä added a comment - I created the branch bb-10.11-MDEV-35049-rebase (and tested all 10 commits in it separately) as well as pushed the 10 commits to the main (currently 11.8) branch.

            People

              marko Marko Mäkelä
              monty Michael Widenius
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.