Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34178

Performance regression with IO-bound insert benchmark

Details

    Description

      This is similar to regressions I found with CPU-bound sysbench (MDEV-33966), but I opened a separate issue because this occurs with IO-bound Insert Benchmark.

      I ran an IO-bound Insert Benchmark (IO-bound because the working set and database are much larger than memory) on a small server (8 cores, 16G RAM) to compare MariaDB LTS releases for 10.2, 10.3, 10.4, 10.5, 10.6, 10.11 and upcoming 11.4 with MySQL 5.6, 5.7 and 8.0.

      A result for a CPU-bound benchmark is here which covers a test that uses a smaller database that can be cached and isn't IO-bound. In the cased case, MariaDB doesn't have large regressions from 10.2 to 11.4. Here, with an IO-bound setup there are regressions.

      The IO-bound tests were run for 1 and 4 clients:

      The way I label results context on the DBMS version and my.cnf

      • ma101107_rel.cz11a_bee - MariaDB 10.11.7 with the cz11a_bee config that uses innodb_flush_method=O_DIRECT_NO_FSYNC and innodb_change_buffering=none
      • ma101107_rel.cz11b_bee - MariaDB 10.11.7 with the cz11b_bee config that uses innodb_flush_method=O_DIRECT and innodb_change_buffering=none
      • ma110401_rel.cz11b_bee - MariaDB 10.11.7 with the cz11b_bee config that uses innodb_flush_method=O_DIRECT and the InnoDB change buffer has been removed
      • my8036_rel.cz11a_bee - MySQL 8.0.36 with the cz11a_bee config that uses innodb_flush_method=O_DIRECT_NO_FSYNC and innodb_change_buffering=all
      • my8036_rel.cz11d_bee - MySQL 8.0.36 with the cz11d_bee config that uses innodb_flush_method=O_DIRECT_NO_FSYNC and innodb_change_buffering=none

      Note that the cz11d_bee config for MySQL 8.0.36 is similar to the cz11a_bee config for MariaDB.

      My claim about the regressions is based on the following

      • start with the MariaDB vs MySQL comparison for 1 client and 4 clients and the results for MySQL 8.0.36 (my8036_rel.cz11a_bee and my8036_rel.cz11d_bee) are much better than for MariaDB 10.11.7 (ma101107_rel.cz11a_bee) and 11.4.1 (ma110401_rel.cz11b_bee).
      • then look at results for MariaDB LTS releases with 1 client and 4 clients and see some regressions from ma100433 (10.4.33) to ma100524 (10.5.24) and larger regressions from ma100524 to ma100617 (10.6.17)
      • then look at the HW metrics for MariaDB LTS releases from the 1 client setup. These are values from vmstat and iostat normalized by query and insert rates to understand HW efficiency. For the write heavy benchmark steps from 10.4.33 through 10.6.17 I see a ~20% increase in context switches per insert (cspq) and ~20% decrease in CPU per insert (cpupq) – see for l.i1 and for l.i2. Most of the change is from 10.5.24 to 10.6.17. I assume this is a result of the changes in 10.6 to replace some mutexes and rw-locks from spinning to not-spinning. So there is less CPU burned, but more lock waiters are going to sleep.
      • the read-write benchmark steps also show a similar pattern as the write rate increases. See the 1 client results for range queries and point queries when the background write rate is 1000/s. Although here I see an increase in cspq (more context switches per query == more threads going to sleep) but not a large decrease in cpupq (CPU per query)

      Attachments

        1. image-2024-06-26-16-08-14-626.png
          81 kB
          Vladislav Vaintroub
        2. MDEV-34178_g1_g2.pdf
          52 kB
          Axel Schwenke
        3. MDEV-34178.pdf
          76 kB
          Axel Schwenke
        4. test_output_sudo2.txt
          0.4 kB
          Kirill Perov
        5. test_output1.txt
          3 kB
          Kirill Perov
        6. test_output3.txt
          77 kB
          Kirill Perov
        7. update_index_10.11.txt
          106 kB
          Vladislav Vaintroub
        8. update_index_10.4.txt
          103 kB
          Vladislav Vaintroub

        Issue Links

          Activity

            @kirill.perov@mariadb.com can you be more specific

            The perf reports that I publish include links to charts with QPS reported per 1-second interval. To find them grep for "Graphs for performance per 1-second interval"

            From looking at the charts now I tend to see more variance in the MariaDB results, and variance isn't good.

            mdcallag Mark Callaghan added a comment - @kirill.perov@mariadb.com can you be more specific The perf reports that I publish include links to charts with QPS reported per 1-second interval. To find them grep for "Graphs for performance per 1-second interval" From looking at the charts now I tend to see more variance in the MariaDB results, and variance isn't good.

            This is my last big batch of results from the Insert Benchmark because my contract for testing InnoDB has run out. I have results from a big server (48 cores) and a small server (8 cores).

            There are 1 to 3 concurrent connections per client and at most 2 busy connections per client. On the big server the test is run with 20 clients. On the small server the test is run with 1 and 6 clients.

            The test is also repeated with cached (database fits in InnoDB block cache) and IO-bound (working set is larger than memory) workloads.

            The my .cnf files are in subdirectories here for the small server and here for the big server. I try to make them as similar as possible – for example, the InnoDB change buffer is disabled for the older versions that still supported it.

            For the big server with a cached workload the base case is MariaDB 10.4.34 and I tested the latest version from each LTS release after that.

            • range queries (qr*) has no regression from 10.4 to 11.7
            • point queries (qp*) has a small regression (11.7.1 gets about 87% of the QPS vs 10.4.34)
            • initial load (l.i0) shows a large regression (11.7.1 gets 56% of the throughput vs 10.4.34). Most of the regression arrived in 10.5 and the root cause might be CPU overhead which increased a lot from 10.4 through 10.11 (see here
            • random inserts+deletes (l.i1, l.i2) has a large regression (11.7.1 gets <= 68^% of the throughput vs 10.4.34). Most of the regression arrived in 10.5 and CPU overhead might be the root cause (see here)

            For the big server with an IO-bound workload the base case is MariaDB 10.4.34

            • none of the versions were able to sustain the target insert rates for qr500 & qp500 (10000/s) or for qr1000 & qp1000 (20000/s), see here and look at the third table
            • range query performance (qr*) improved from 10.4 to 11.7 but that is harder to judge given the failure to sustain the target write rates
            • point query performance is also hard to judge given the target write rate failures, but for qp100 MariaDB 11.7 does the same as 10.4, and then 11.7 has regressions for qp500 and qp1000
            • the initial load (l.i0) which isn't IO-bound is similar to the result above for the cached workload. There is a large regression from 10.4 to 11.7, most of that arrived in 10.5 and CPU overheads might be the issue see cpupq here
            • the l.i1 and l.i2 steps that do random inserts+deletes have a small regression for l.i1 and no regression for l.i2

            For the small server with a cached workload and 1 client – see here the base case is MariaDB 10.2.44

            • there is a small regression for the initial load (l.i0) where 11.7.1 gets 88% of the throughput vs 10.2.44
            • there are small regressions for range queries (qr*) where 11.7.1 gets ~90% of the throughput vs 10.2.44
            • there are medium regressions for point queries (qp*) where 11.7.1 gets ~80% of the throughput vs 10.2.44
            • for random inserts+deletes (l.i1, l.i2) 11.7.1 gets ~10% more throughput than 10.2.44
            • MariaDB 11.7.1 and MySQL 8.0.40 have similar performance

            For the small server with a cached workload and 6 client – see here the base case is MariaDB 10.2.44

            • there is a small regression for the initial load (l.i0) where 11.7.1 gets 85% of the throughput vs 10.2.44
            • there are no regressions for range queries (qr*)
            • there are small regressions for point queries (qp*) where 11.7.1 gets ~90% of the throughput vs 10.2.44
            • for random inserts+deletes (l.i1, l.i2) 11.7.1 gets less throughput than 10.2.44 on l.i1 but more on l.i2
            • MariaDB 11.7.1 and MySQL 8.0.40 have similar performance except on the l.i1 and l.i2 steps (random inserts + deletes) where MariaDB is about 2X faster

            For the small server with an IO-bound workload and 1 client – see here
            the base case is MariaDB 10.2.44

            • there is a small regression for the initial load (l.i0) where 11.7.1 gets 89% of the throughput vs 10.2.44
            • there are small regressions for range queries (qr*) where 11.7.1 gets ~90% of the throughput vs 10.2.44
            • there are no regressions for point queries (qp*)
            • for random inserts+deletes (l.i1, l.i2) 11.7.1 gets about 80% of the throughput vs 10.2.44. A large part of this regression arrived in 11.4
            • for MariaDB 11.7.1 vs MySQL 8.0.40, MariaDB is faster on the initial load (l.i0), MySQL is faster on random inserts + deletes (l.i1, l.i2), MySQL is faster on range queries and they are similar on point queries

            For the small server with an IO-bound workload and 6 clients – see here
            the base case is MariaDB 10.2.44

            • MariaDB doesn't sustain the target write rates (6000/s) for the qr1000 and qp1000 steps with 10.5, and 11.4 through 11.7
            • there is a medium regression for the initial load (l.i0) where 11.7.1 gets 81% of the throughput vs 10.2.44
            • there are medium regressions for range queries (qr*) where 11.7.1 gets 91%, 81% and 82% of the throughput vs 10.2.44. Recent versions don't sustain the target write rate for qr1000.
            • there are medium regressions for point queries (qr*) where 11.7.1 gets 101%, 71% and 71% of the throughput vs 10.2.44. Recent versions don't sustain the target write rate for qp1000.
            • for random inserts+deletes (l.i1, l.i2) 11.7.1 gets 30% and 78% of the throughput vs 10.2 for l.i1 and l.i2. From the metrics for l.i1 nothing stands out as CPU/operation (cpupq) and context switches/operation (csps) don't change much. Starting in 11.4 the average context switch rate (csps) and the average CPU utilization (cpups) starting both drop a lot. I assume there is some kind of stall (perhaps mutex contention) or thread scheduling issue. I will repeat a few tests to get thread stacks with PMP.
            • for MariaDB 11.7.1 vs MySQL 8.0.40, perf is similar on the initial load (l.i0), MySQL is faster on random inserts + deletes (l.i1, l.i2), MySQL is a lot faster on point and range queries.
            mdcallag Mark Callaghan added a comment - This is my last big batch of results from the Insert Benchmark because my contract for testing InnoDB has run out. I have results from a big server (48 cores) and a small server (8 cores). There are 1 to 3 concurrent connections per client and at most 2 busy connections per client. On the big server the test is run with 20 clients. On the small server the test is run with 1 and 6 clients. The test is also repeated with cached (database fits in InnoDB block cache) and IO-bound (working set is larger than memory) workloads. The my .cnf files are in subdirectories here for the small server and here for the big server . I try to make them as similar as possible – for example, the InnoDB change buffer is disabled for the older versions that still supported it. For the big server with a cached workload the base case is MariaDB 10.4.34 and I tested the latest version from each LTS release after that. range queries (qr*) has no regression from 10.4 to 11.7 point queries (qp*) has a small regression (11.7.1 gets about 87% of the QPS vs 10.4.34) initial load (l.i0) shows a large regression (11.7.1 gets 56% of the throughput vs 10.4.34). Most of the regression arrived in 10.5 and the root cause might be CPU overhead which increased a lot from 10.4 through 10.11 ( see here random inserts+deletes (l.i1, l.i2) has a large regression (11.7.1 gets <= 68^% of the throughput vs 10.4.34). Most of the regression arrived in 10.5 and CPU overhead might be the root cause ( see here ) For the big server with an IO-bound workload the base case is MariaDB 10.4.34 none of the versions were able to sustain the target insert rates for qr500 & qp500 (10000/s) or for qr1000 & qp1000 (20000/s), see here and look at the third table range query performance (qr*) improved from 10.4 to 11.7 but that is harder to judge given the failure to sustain the target write rates point query performance is also hard to judge given the target write rate failures, but for qp100 MariaDB 11.7 does the same as 10.4, and then 11.7 has regressions for qp500 and qp1000 the initial load (l.i0) which isn't IO-bound is similar to the result above for the cached workload. There is a large regression from 10.4 to 11.7, most of that arrived in 10.5 and CPU overheads might be the issue see cpupq here the l.i1 and l.i2 steps that do random inserts+deletes have a small regression for l.i1 and no regression for l.i2 For the small server with a cached workload and 1 client – see here the base case is MariaDB 10.2.44 there is a small regression for the initial load (l.i0) where 11.7.1 gets 88% of the throughput vs 10.2.44 there are small regressions for range queries (qr*) where 11.7.1 gets ~90% of the throughput vs 10.2.44 there are medium regressions for point queries (qp*) where 11.7.1 gets ~80% of the throughput vs 10.2.44 for random inserts+deletes (l.i1, l.i2) 11.7.1 gets ~10% more throughput than 10.2.44 MariaDB 11.7.1 and MySQL 8.0.40 have similar performance For the small server with a cached workload and 6 client – see here the base case is MariaDB 10.2.44 there is a small regression for the initial load (l.i0) where 11.7.1 gets 85% of the throughput vs 10.2.44 there are no regressions for range queries (qr*) there are small regressions for point queries (qp*) where 11.7.1 gets ~90% of the throughput vs 10.2.44 for random inserts+deletes (l.i1, l.i2) 11.7.1 gets less throughput than 10.2.44 on l.i1 but more on l.i2 MariaDB 11.7.1 and MySQL 8.0.40 have similar performance except on the l.i1 and l.i2 steps (random inserts + deletes) where MariaDB is about 2X faster For the small server with an IO-bound workload and 1 client – see here the base case is MariaDB 10.2.44 there is a small regression for the initial load (l.i0) where 11.7.1 gets 89% of the throughput vs 10.2.44 there are small regressions for range queries (qr*) where 11.7.1 gets ~90% of the throughput vs 10.2.44 there are no regressions for point queries (qp*) for random inserts+deletes (l.i1, l.i2) 11.7.1 gets about 80% of the throughput vs 10.2.44. A large part of this regression arrived in 11.4 for MariaDB 11.7.1 vs MySQL 8.0.40, MariaDB is faster on the initial load (l.i0), MySQL is faster on random inserts + deletes (l.i1, l.i2), MySQL is faster on range queries and they are similar on point queries For the small server with an IO-bound workload and 6 clients – see here the base case is MariaDB 10.2.44 MariaDB doesn't sustain the target write rates (6000/s) for the qr1000 and qp1000 steps with 10.5, and 11.4 through 11.7 there is a medium regression for the initial load (l.i0) where 11.7.1 gets 81% of the throughput vs 10.2.44 there are medium regressions for range queries (qr*) where 11.7.1 gets 91%, 81% and 82% of the throughput vs 10.2.44. Recent versions don't sustain the target write rate for qr1000. there are medium regressions for point queries (qr*) where 11.7.1 gets 101%, 71% and 71% of the throughput vs 10.2.44. Recent versions don't sustain the target write rate for qp1000. for random inserts+deletes (l.i1, l.i2) 11.7.1 gets 30% and 78% of the throughput vs 10.2 for l.i1 and l.i2. From the metrics for l.i1 nothing stands out as CPU/operation (cpupq) and context switches/operation (csps) don't change much. Starting in 11.4 the average context switch rate (csps) and the average CPU utilization (cpups) starting both drop a lot. I assume there is some kind of stall (perhaps mutex contention) or thread scheduling issue. I will repeat a few tests to get thread stacks with PMP. for MariaDB 11.7.1 vs MySQL 8.0.40, perf is similar on the initial load (l.i0), MySQL is faster on random inserts + deletes (l.i1, l.i2), MySQL is a lot faster on point and range queries.

            The previous comment mentions there is a large regression for the l.i1 benchmark step that arrives in 11.4 for the IO-bound workload with 6 clients on a small server. This gist has thread stacks collected by PMP (they are aggregated over many samples to compare 11.10.10 and 11.4.4. What I see for 11.4.4 is that the most frequent stacks are from buffer pool flushing not keeping up (see here) in 11.4.4 but that doesn't show up for 10.11.10 (see here)

            mdcallag Mark Callaghan added a comment - The previous comment mentions there is a large regression for the l.i1 benchmark step that arrives in 11.4 for the IO-bound workload with 6 clients on a small server. This gist has thread stacks collected by PMP (they are aggregated over many samples to compare 11.10.10 and 11.4.4. What I see for 11.4.4 is that the most frequent stacks are from buffer pool flushing not keeping up ( see here ) in 11.4.4 but that doesn't show up for 10.11.10 ( see here )

            Thank you, mdcallag. Indeed, it looks like only 11.4.4 but not 10.11.10 (you typoed "11.10.10" once) is blocking because of running out of buffer pool. Apart from the removal of the change buffer (MDEV-29694), there shouldn’t be anything obvious in 11.4 that should make InnoDB slower. In fact, it is expected to be the opposite due to changes like MDEV-30216 and MDEV-32042.

            If you had enabled innodb_change_buffering in 10.11, that could be the simple explanation. But I saw innodb_change_buffering=none in some …/etc/my.cnf that I checked in https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/oct24/c32r128/.

            If this is an apples-to-apples comparison (the change buffer is disabled in both), then we may have to experiment with some "hybrid" builds (such as applying the InnoDB from 10.11 to MariaDB Server 11.4 or vice versa) in order to narrow down the cause of this regression.

            marko Marko Mäkelä added a comment - Thank you, mdcallag . Indeed, it looks like only 11.4.4 but not 10.11.10 (you typoed "11.10.10" once) is blocking because of running out of buffer pool. Apart from the removal of the change buffer ( MDEV-29694 ), there shouldn’t be anything obvious in 11.4 that should make InnoDB slower. In fact, it is expected to be the opposite due to changes like MDEV-30216 and MDEV-32042 . If you had enabled innodb_change_buffering in 10.11, that could be the simple explanation. But I saw innodb_change_buffering=none in some …/etc/my.cnf that I checked in https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/oct24/c32r128/ . If this is an apples-to-apples comparison (the change buffer is disabled in both), then we may have to experiment with some "hybrid" builds (such as applying the InnoDB from 10.11 to MariaDB Server 11.4 or vice versa) in order to narrow down the cause of this regression.

            All of the my.cnf I use have innodb_change_buffering=none unless I call it out. I try to make the configurations as similar as possible.

            I tested (IO-bound, not IO-bound) X (big server, small server). And for the small server test at 1 and 6 clients. The problem only shows up on (IO-bound, small server, 6 clients). In most cases I try to avoid over-subscribing the CPU where that means the number of busy connections is >= number of CPU cores. But in the 6 client case it is possible that I did over-subscribe which is why the problem only shows up on that case.

            Looking at the insert rate by time during l.i1 for 10.11.10 and for 11.4.4 what I see is that things go bad fast for 11.4.4.

            mdcallag Mark Callaghan added a comment - All of the my.cnf I use have innodb_change_buffering=none unless I call it out. I try to make the configurations as similar as possible. I tested (IO-bound, not IO-bound) X (big server, small server). And for the small server test at 1 and 6 clients. The problem only shows up on (IO-bound, small server, 6 clients). In most cases I try to avoid over-subscribing the CPU where that means the number of busy connections is >= number of CPU cores. But in the 6 client case it is possible that I did over-subscribe which is why the problem only shows up on that case. Looking at the insert rate by time during l.i1 for 10.11.10 and for 11.4.4 what I see is that things go bad fast for 11.4.4.

            People

              marko Marko Mäkelä
              mdcallag Mark Callaghan
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.