Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34515

Contention between secondary index UPDATE and purge due to large innodb_purge_batch_size

Details

    Description

      When I run sysbench oltp_update_index on current 10.6-11.6 using very small amount of data (1 row in 1 table), performance regresses compared to 10.5

      The easiest way for me to reproduce it, is to run server with --innodb-flush-log-at-trx-commit=2 or 0 and a single benchmark user. I also see the same effect with --innodb-flush-log-at-trx-commit=1 and multiple users.

      Benchmark setup

      server:

      mysqld --innodb-flush-log-at-trx-commit=2 #all defaults otherwise
      

      sysbench prepare

      sysbench oltp_update_index --table-size=1 --tables=1 --threads=1 --mysql-user=root --mysql-db=mysql --report-interval=1 --time=120  --mysql-socket=/tmp/mysql.sock  --histogram prepare
      

      sysbench oltp_update_index --table-size=1 --tables=1 --threads=1 --mysql-user=root --mysql-db=mysql --report-interval=1 --time=120  --mysql-socket=/tmp/mysql.sock  --histogram run
      

      Results 10.5 vs 10.6

      version tps latency ms 95% max history length flamegraph
      10.5 16501.81 0.07 95 10_5_flame.svg
      10.6 592.19 5.28 70332 10_6_flame.svg

      As one can see from the attached 10_6_flame.svg , in 10.6 the purge is abnormally busy, with 85% of all CPU samples in purge . However, no matter how busy, purge does not achieve its goal, as Innodb_history_list_length grows continuously up to about 70K within 2 minutes. I think no record is actually purged. Additionally, foreground threads run into some sort of buffer pool contention, ssux_lock_impl<true>::wr_wait from buf_page_get_low being the most , and ssux_lock_impl<true>::wr_wait second most prominent places

      In contrast, as one can see from 10_5_flame.svg . in 10.5 the CPU time is spent roughly evenly between purge and foreground processing in do_command, and purge works, keeping history length tiny during the benchmark, with overall 25x better performance (although purge seems to have relatively high overhead to wake/wait for workers doing their tiny tasks, but this is for another day)

      What causes it, according to "git bisect"

      I ran manual bisect, and it points to aa719b5010c MDEV-32050: Do not copy undo records in purge from 10.6.16 .
      I had want to run the benchmark 2 or 3 times for this commit , from col start, on the first run it started slow, but unexplainably showed a better performance at about the middle of benchmark, on the second run it was slow all the way.

      So, the results for that commit vs previous commit 88733282fb15c80f0bd722df0041d06ad90c26b0 are below

      commit tps latency ms 95% history length max flamegraph
      aa719b5010c (bad) 592.96 5.37 69069 aa719b5010c9_flame2.svg
      88733282fb15(good) 19003.11 0.07 217 88733282fb15_flame.svg

      In fact, the "good" commit before regression was 20% better than 10.5, and purging uses only 7% CPU while managing to keep history length low.

      The "bad" flamegraph is pretty much the same as current 10.6, purge using up most of the CPU.

      Can it is reproduced differently ?

      I could also reproduce it by leaving innodb_flush_log_at_trx_commit=0 .
      I can also reproduce it by increasing concurrency (--threads=10 in sysbench), and leaving innodb_flush_log_at_trx_commit default (1).

      The bad effect seems to vanish once more rows are updated, but maybe I did not find a way to have a repro for that.

      Attachments

        Issue Links

          Activity

            After this change, will users that uses innodb_purge_batch_size=1000 face any new issues with performance.

            monty Michael Widenius added a comment - After this change, will users that uses innodb_purge_batch_size=1000 face any new issues with performance.

            monty, depending on the type of workload, configuring a larger innodb_purge_batch_size may really hurt things. With the workload given in the Description, I and wlad observed an order or two of magnitude difference between using a value that corresponds to the number of rollback segments, and a value that is 1 larger. This is an extreme workload, and nobody should really be defining any secondary indexes on a tiny table. With more general workloads, the impact is smaller but it does exist.

            marko Marko Mäkelä added a comment - monty , depending on the type of workload, configuring a larger innodb_purge_batch_size may really hurt things. With the workload given in the Description, I and wlad observed an order or two of magnitude difference between using a value that corresponds to the number of rollback segments, and a value that is 1 larger. This is an extreme workload, and nobody should really be defining any secondary indexes on a tiny table. With more general workloads, the impact is smaller but it does exist.

            From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench. My usage of sysbench is explained here.

            Tests are run with 12 clients and the --secondary option for sysbench so that "id" has a secondary index instead of a PK index.

            I compared two builds with 10.6.19, using the names from the gists linked below

            • ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch
            • ma1006_mdev34759 > the latest from remotes/origin/10.6MDEV-34759, version_source_revision is this

            I name things as Version.Config where Version is one of:

            • ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo
            • ma1006_bbperfpkgtest - explained above
            • ma1006_mdev34759 - explained above

            And config is one of:

            • z11a_c24r64 - has innodb_change_buffering=none
            • z11a2_c24r64 - has innodb_change_buffering=all

            For all cases, the "hot-point" benchmark is much faster with both ma1006_bbperfpkgtest and ma1006_mdev34759 when compared to unchanged 10.6.19. The improvement is larger in the 1 table case where both are almost 6X faster than 10.6.19. One reason for the improvement is a ~30X reduction in the context switch rate (see here)

            Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 which is "col 1" in the results:

            With 8 tables, 10M rows/table

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here)

            WIth 1 table, 80M rows

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here) except for update-index which is 2.19X faster and write-only which is 1.75X faster in ma1006_bbperfpkgtest (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here) except for update-index which is 2.24X faster and write-only which is 1.80X faster in ma1006_bbperfpkgtest (see here)
            mdcallag Mark Callaghan added a comment - From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench. My usage of sysbench is explained here . Tests are run with 12 clients and the --secondary option for sysbench so that "id" has a secondary index instead of a PK index. I compared two builds with 10.6.19, using the names from the gists linked below ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch ma1006_mdev34759 > the latest from remotes/origin/10.6 MDEV-34759 , version_source_revision is this I name things as Version.Config where Version is one of: ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo ma1006_bbperfpkgtest - explained above ma1006_mdev34759 - explained above And config is one of: z11a_c24r64 - has innodb_change_buffering=none z11a2_c24r64 - has innodb_change_buffering=all For all cases, the "hot-point" benchmark is much faster with both ma1006_bbperfpkgtest and ma1006_mdev34759 when compared to unchanged 10.6.19. The improvement is larger in the 1 table case where both are almost 6X faster than 10.6.19. One reason for the improvement is a ~30X reduction in the context switch rate ( see here ) Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 which is "col 1" in the results: With 8 tables, 10M rows/table innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here ) WIth 1 table, 80M rows innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) except for update-index which is 2.19X faster and write-only which is 1.75X faster in ma1006_bbperfpkgtest ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here ) except for update-index which is 2.24X faster and write-only which is 1.80X faster in ma1006_bbperfpkgtest ( see here )

            This blog post about Huawei GaussDB (a MySQL fork) mentions a new parameter innodb_rds_purge_subbatch_size=150. I think that MySQL 5.7 onwards would use 95 or 96 persistent rollback segments; MDEV-12289 in MariaDB prevented that regression. I assume that they have other contention points elsewhere so that the index page latch contention between purge and DML did not become a problem. Or maybe they did not test with few tiny tables.

            The idea of a “purge subbatch” would seem to be incompatible with purge_sys_t::close_and_reopen() as well as the latch-free purge_sys.pages lookup in purge_sys_t::view_guard::get().

            marko Marko Mäkelä added a comment - This blog post about Huawei GaussDB (a MySQL fork) mentions a new parameter innodb_rds_purge_subbatch_size=150 . I think that MySQL 5.7 onwards would use 95 or 96 persistent rollback segments; MDEV-12289 in MariaDB prevented that regression. I assume that they have other contention points elsewhere so that the index page latch contention between purge and DML did not become a problem. Or maybe they did not test with few tiny tables. The idea of a “purge subbatch” would seem to be incompatible with purge_sys_t::close_and_reopen() as well as the latch-free purge_sys.pages lookup in purge_sys_t::view_guard::get() .
            ralf.gebhardt Ralf Gebhardt added a comment -

            This fix includes a change of the default for innodb_purge_batch_size from 1000 to 127

            ralf.gebhardt Ralf Gebhardt added a comment - This fix includes a change of the default for innodb_purge_batch_size from 1000 to 127

            People

              marko Marko Mäkelä
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.