Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34759

buf_page_get_low() is unnecessarily acquiring exclusive latch on secondary index pages

Details

    Description

      With a simple Sysbench select_random_ranges workload on 1 table, 1 connection and 10000 rows, the following code is being exercised very frequently:

      diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
      index 44b093e93ed..ab720dc128f 100644
      --- a/storage/innobase/buf/buf0buf.cc
      +++ b/storage/innobase/buf/buf0buf.cc
      @@ -2849,6 +2849,7 @@ buf_page_get_low(
       	    && fil_page_get_type(block->page.frame) == FIL_PAGE_INDEX
       	    && page_is_leaf(block->page.frame)) {
       		block->page.lock.x_lock();
      +		sql_print_warning("InnoDB: x_locked %p", block);
       		state = block->page.state();
       		ut_ad(state < buf_page_t::READ_FIX);
       
      

      Here is the corresponding patch for the 10.5 branch:

      diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
      index f37a96d5a7c..7b8aa0934d2 100644
      --- a/storage/innobase/buf/buf0buf.cc
      +++ b/storage/innobase/buf/buf0buf.cc
      @@ -3198,6 +3198,7 @@ buf_page_get_low(
       	    && fil_page_get_type(fix_block->frame) == FIL_PAGE_INDEX
       	    && page_is_leaf(fix_block->frame)) {
       		rw_lock_x_lock_inline(&fix_block->lock, 0, file, line);
      +		sql_print_warning("InnoDB: x_locked %p", block);
       
       		if (fix_block->page.ibuf_exist) {
       			fix_block->page.ibuf_exist = false;
      

      When using the oltp_read_only workload, the above would only be covered only during sysbench prepare.

      The purpose of the above code is to apply buffered changes to a secondary index leaf page if such changes exist. This logic was originally implemented in MDEV-19514. Note: the change buffer was removed in MariaDB Server 11.0 in MDEV-29694, and therefore 11.x is not affected by this.

      The problem that is exposed by the select_random_ranges workload is that even though a shared latch was requested, we are first acquiring an exclusive latch. This could be improved in a number of ways:

      • If a buffer-fix is being requested (rw_latch=RW_NO_LATCH), skip this logic. This mode should not be used together with allow_ibuf_merge=true.
      • Acquire the latch that was requested, and if necessary, temporarily upgrade it to exclusive so that the buffered changes can be merged. Upgrading a latch used to be deadlock-prone before the locks were refactored in MDEV-24142 (MariaDB Server 10.6). After the refinement in MDEV-34178, I think the needed upgrade step is trivial to implement.
      • If ibuf.empty holds (the change buffer was found to be empty), it does not make any sense to try a change buffer merge.

      Attachments

        Issue Links

          Activity

            I realized that a deadlock on an upgrade from shared to exclusive latch is possible if another thread is already waiting for an exclusive latch. After MDEV-24142 and MDEV-34178, an exclusive latch would be acquired as follows:

            1. Acquire the writer mutex.
            2. Set the WRITER flag in the readers word.
            3. If the readers word was not 0 (and is now larger than WRITER), wait until all current shared latch holders have decremented readers.

            When we upgrade a latch from shared, we’d be holding one share of readers. The next logical step would be to acquire the writer mutex and then invoke u_wr_upgrade() to update the nonzero readers word to WRITER. If we wait for the writer mutex while holding a readers reference, we will easily get into deadlock with other threads.

            The solution is to employ a trylock operation: If the writer mutex cannot be acquired without waiting, we’d have to release the shared latch and acquire an exclusive one normally.

            marko Marko Mäkelä added a comment - I realized that a deadlock on an upgrade from shared to exclusive latch is possible if another thread is already waiting for an exclusive latch. After MDEV-24142 and MDEV-34178 , an exclusive latch would be acquired as follows: Acquire the writer mutex. Set the WRITER flag in the readers word. If the readers word was not 0 (and is now larger than WRITER ), wait until all current shared latch holders have decremented readers . When we upgrade a latch from shared, we’d be holding one share of readers . The next logical step would be to acquire the writer mutex and then invoke u_wr_upgrade() to update the nonzero readers word to WRITER . If we wait for the writer mutex while holding a readers reference, we will easily get into deadlock with other threads. The solution is to employ a trylock operation: If the writer mutex cannot be acquired without waiting, we’d have to release the shared latch and acquire an exclusive one normally.

            I filed https://github.com/MariaDB/server/pull/3467 and posted results from a quick performance test (1 table, 10000 records, 20 threads on a 2-socket, 10-thread, 2-way hyperthreading box):

            revision select_random_ranges/qps
            10.6 baseline 43141.66
            patch 70529.46

            I can imagine that the difference could be even larger when employing more threads.

            marko Marko Mäkelä added a comment - I filed https://github.com/MariaDB/server/pull/3467 and posted results from a quick performance test (1 table, 10000 records, 20 threads on a 2-socket, 10-thread, 2-way hyperthreading box): revision select_random_ranges /qps 10.6 baseline 43141.66 patch 70529.46 I can imagine that the difference could be even larger when employing more threads.

            From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench I see some good results. My usage of sysbench is explained here.

            Tests are run with 12 clients.

            I compared two builds with 10.6.19, using the names from the gists linked below

            • ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch
            • ma1006_mdev34759 > the latest from remotes/origin/10.6MDEV-34759, version_source_revision is this

            I name things as Version.Config where Version is one of:

            • ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo
            • ma1006_bbperfpkgtest - explained above
            • ma1006_mdev34759 - explained above

            And config is one of:

            • z11a_c24r64 - has innodb_change_buffering=none
            • z11a2_c24r64 - has innodb_change_buffering=all

            Below I focus on the results for ma1006_mdev34759 compared to 10.6.19

            With 8 tables, 10M rows/table

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here)

            WIth 1 table, 80M rows

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here)
            mdcallag Mark Callaghan added a comment - From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench I see some good results. My usage of sysbench is explained here . Tests are run with 12 clients. I compared two builds with 10.6.19, using the names from the gists linked below ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch ma1006_mdev34759 > the latest from remotes/origin/10.6 MDEV-34759 , version_source_revision is this I name things as Version.Config where Version is one of: ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo ma1006_bbperfpkgtest - explained above ma1006_mdev34759 - explained above And config is one of: z11a_c24r64 - has innodb_change_buffering=none z11a2_c24r64 - has innodb_change_buffering=all Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 With 8 tables, 10M rows/table innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here ) WIth 1 table, 80M rows innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here )

            Thank you, Mark. The impact of this bug is that accessing secondary index leaf pages in the buffer pool is more expensive than it needs to be. Therefore, any workload that will read from secondary indexes should be affected. I would expect the impact to be largest for covering index reads when multiple connections are accessing the same index pages concurrently.

            marko Marko Mäkelä added a comment - Thank you, Mark. The impact of this bug is that accessing secondary index leaf pages in the buffer pool is more expensive than it needs to be. Therefore, any workload that will read from secondary indexes should be affected. I would expect the impact to be largest for covering index reads when multiple connections are accessing the same index pages concurrently.

            I will repeat sysbench using the --secondary option to stress secondary indexes

            mdcallag Mark Callaghan added a comment - I will repeat sysbench using the --secondary option to stress secondary indexes

            marko Thanks for the patch.

            The fix looks correct and good to me. It is good observation that we must only try for upgrading S -> X in current design and not wait which could lead to deadlock. I have some minor points for your consideration.

            One way people implement the upgrade with wait Q(queue) based model is to jump the Q ahead of other waiting locks and just wait for granted locks eliminating any deadlock possibility. This is not possible here with two level locking where the waiting X lock could have already passed the fist step and waiting on the readers. It is not possible to jump ahead of it. We don't need to worry though as releasing and reacquiring should be fine here as implemented in the patch.

            debarun Debarun Banerjee added a comment - marko Thanks for the patch. The fix looks correct and good to me. It is good observation that we must only try for upgrading S -> X in current design and not wait which could lead to deadlock. I have some minor points for your consideration. One way people implement the upgrade with wait Q(queue) based model is to jump the Q ahead of other waiting locks and just wait for granted locks eliminating any deadlock possibility. This is not possible here with two level locking where the waiting X lock could have already passed the fist step and waiting on the readers. It is not possible to jump ahead of it. We don't need to worry though as releasing and reacquiring should be fine here as implemented in the patch.

            From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench. My usage of sysbench is explained here.

            Tests are run with 12 clients and the --secondary option for sysbench so that "id" has a secondary index instead of a PK index.

            I compared two builds with 10.6.19, using the names from the gists linked below

            • ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch
            • ma1006_mdev34759 > the latest from remotes/origin/10.6MDEV-34759, version_source_revision is this

            I name things as Version.Config where Version is one of:

            • ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo
            • ma1006_bbperfpkgtest - explained above
            • ma1006_mdev34759 - explained above

            And config is one of:

            • z11a_c24r64 - has innodb_change_buffering=none
            • z11a2_c24r64 - has innodb_change_buffering=all

            For all cases, the "hot-point" benchmark is much faster with both ma1006_bbperfpkgtest and ma1006_mdev34759 when compared to unchanged 10.6.19

            Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 which is "col 2" in the results:

            With 8 tables, 10M rows/table

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here)

            WIth 1 table, 80M rows

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here)
            mdcallag Mark Callaghan added a comment - From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench. My usage of sysbench is explained here . Tests are run with 12 clients and the --secondary option for sysbench so that "id" has a secondary index instead of a PK index. I compared two builds with 10.6.19, using the names from the gists linked below ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch ma1006_mdev34759 > the latest from remotes/origin/10.6 MDEV-34759 , version_source_revision is this I name things as Version.Config where Version is one of: ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo ma1006_bbperfpkgtest - explained above ma1006_mdev34759 - explained above And config is one of: z11a_c24r64 - has innodb_change_buffering=none z11a2_c24r64 - has innodb_change_buffering=all For all cases, the "hot-point" benchmark is much faster with both ma1006_bbperfpkgtest and ma1006_mdev34759 when compared to unchanged 10.6.19 Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 which is "col 2" in the results: With 8 tables, 10M rows/table innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here ) WIth 1 table, 80M rows innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here )

            For the record, this fix depends on MDEV-34178.

            marko Marko Mäkelä added a comment - For the record, this fix depends on MDEV-34178 .

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.