[MDEV-33966] sysbench performance regression with concurrent workloads - Jira

Mark Callaghan created issue - 2024-04-22 22:18

Mark Callaghan made changes - 2024-04-22 22:56

Field	Original Value	New Value
Description	While I haven't seen significant performance regressions when comparing modern MariaDB (11.4, 10.11) with older MariaDB via sysbench with low concurrency workloads ([see here\|https://smalldatum.blogspot.com/2024/04/sysbench-on-less-small-server-mariadb.html]). I have seen perf regressions once I use workloads with some concurrency. This will take a few days to properly document. From a server with 8 cores and sysbench run with 4 threads ... * the numbers in the table are the throughput relative to MariaDB 10.2.44 (x.ma100244_rel.z11a_bee.pk1) where 1.0 means the same, < 1.0 means a regression and > 1.0 means an improvement If I use 0.8 as a cutoff, meaning some version gets less than 80% of the throughput relative to MariaDB 10.2, then from column 6 (col-6) the problem microbenchmarks are: * update-index_range=100, relative throughput is 0.25, problem arrives in 10.3 * update-on_range=100, relative throughput is 0.65, problem arrives in 10.6 * write-only_range=10000 , relative throughput is 0.77, problem arrives in 10.3 Next step for this is to get flamegraphs and maybe PMP stacks. {{ Relative to: x.ma100244_rel.z11a_bee.pk1 -> MariaDB 10.2.44 col-1 : x.ma100339_rel.z11a_bee.pk1 -> MariaDB 10.3.39 col-2 : x.ma100433_rel.z11a_bee.pk1 -> MariaDB 10.4.33 col-3 : x.ma100524_rel.z11a_bee.pk1 -> MariaDB 10.5.24 col-4 : x.ma100617_rel.z11a_bee.pk1 -> MariaDB 10.6.17 col-5 : x.ma101107_rel.z11a_bee.pk1 -> MariaDB 10.11.7 col-6 : x.ma110401_rel.z11b_bee.pk1 -> MariaDB 11.4.1 col-1 col-2 col-3 col-4 col-5 col-6 0.98 0.98 0.94 0.94 0.94 0.93 hot-points_range=100 0.95 0.93 0.92 0.90 0.90 0.88 point-query_range=100 0.99 0.97 0.94 0.96 0.95 0.95 points-covered-pk_range=100 0.99 0.97 0.95 0.96 0.96 0.91 points-covered-si_range=100 0.99 0.98 0.95 0.97 0.97 0.96 points-notcovered-pk_range=100 1.01 1.00 0.97 0.99 0.99 0.97 points-notcovered-si_range=100 1.00 0.99 0.96 0.97 0.97 0.97 random-points_range=1000 0.99 0.98 0.96 0.97 0.97 0.97 random-points_range=100 0.98 0.93 0.90 0.89 0.89 0.89 random-points_range=10 0.97 0.91 0.93 0.92 0.93 0.92 range-covered-pk_range=100 0.95 0.90 0.93 0.90 0.91 0.90 range-covered-si_range=100 0.96 0.95 0.98 0.95 0.95 0.95 range-notcovered-pk_range=100 1.01 1.00 0.99 1.00 1.00 0.99 range-notcovered-si_range=100 0.99 1.00 1.06 1.05 1.04 1.03 read-only_range=10000 0.97 0.94 0.96 0.93 0.93 0.92 read-only_range=100 0.96 0.92 0.91 0.89 0.89 0.88 read-only_range=10 0.94 0.97 1.16 1.10 1.12 1.09 scan_range=100 1.03 1.00 1.20 1.21 1.16 1.08 delete_range=100 0.95 0.90 1.11 1.20 1.09 1.07 insert_range=100 0.97 0.95 1.13 1.16 1.15 1.12 read-write_range=100 0.86 0.84 1.01 1.04 1.03 0.99 read-write_range=10 0.23 0.24 1.31 0.25 0.27 0.25 update-index_range=100 1.00 0.99 1.19 1.17 1.06 0.98 update-inlist_range=100 1.02 0.97 1.10 1.18 1.05 0.97 update-nonindex_range=100 1.02 1.04 1.05 0.73 0.70 0.65 update-one_range=100 1.01 1.01 1.06 1.15 1.02 0.95 update-zipf_range=100 0.79 0.80 0.98 0.85 0.83 0.77 write-only_range=10000 }}	While I haven't seen significant performance regressions when comparing modern MariaDB (11.4, 10.11) with older MariaDB via sysbench with low concurrency workloads ([see here\|https://smalldatum.blogspot.com/2024/04/sysbench-on-less-small-server-mariadb.html]). I have seen perf regressions once I use workloads with some concurrency. This will take a few days to properly document. From a server with 8 cores and sysbench run with 4 threads ... * the numbers in the table are the throughput relative to MariaDB 10.2.44 (x.ma100244_rel.z11a_bee.pk1) where 1.0 means the same, < 1.0 means a regression and > 1.0 means an improvement If I use 0.8 as a cutoff, meaning some version gets less than 80% of the throughput relative to MariaDB 10.2, then from column 6 (col-6) the problem microbenchmarks are: * update-index_range=100, relative throughput is 0.25 in 11.4.1, problem arrives in 10.3 * update-one_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.6 * write-only_range=10000 , relative throughput is 0.77 in 11.4.1, problem arrives in 10.3 Next step for this is to get flamegraphs and maybe PMP stacks. The table relies on fixed width fonts to be readable but the "preformatted" option in JIRA doesn't do what I want it to do so the data [is here\|https://gist.github.com/mdcallag/f57dce5dd1b2778d3bb1617e4ac8cae4] Next up is a server with 2 sockets and 12 cores/socket and the benchmark was run with 16 threads. The results [are here\|https://gist.github.com/mdcallag/e400b266571f3fcd65378493ecb9465f]. Again, using 0.8 as a cutoff and looking at col-6 (MariaDB 11.4.1) the problem microbenchmarks are: * insert_range=100, relative throughput is 0.73 in 11.4.1, there are gradual regressions starting in 10.3, but the largest are from 10.11 and 11.4 * update-index_range=100, relative throughput is 0.18 in 11.4.1, problem starts in 10.5 and 10.11->11.4 is the biggest drop * update-inlist_range=100, relative thoughput is 0.56 in 11.4.1, problem is gradual from 10.3 through 11.4 * update-nonindex_range=100, relative throughput is 0.69 in 11.4.1, problems arrive in 10.11 and 11.4 * update-one_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.6 * update-zipf_range=100, relative throughput is 0.75 in 11.4.1, problem arrives in 11.4 * write-only_range=10000, relative throughput is 0.59 in 11.4.1, problems arrive in 10.11 and 11.4 Finally a server with 32 cores (AMD Threadripper) and the benchmark was run with 24 threads. The results [are here\|https://gist.github.com/mdcallag/e870bea3196cfa8e917874aede695630] and the problem microbenchmarks are: * points-notcovered-pk_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5 * points-notcovered-si_range=100, relative throughput is 0.77 in 11.4.1, problem arrives in 10.5 * random-points_range=1000, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5 * random-points_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5 * range-notcovered-si_range=100, relative throughput is 0.59 in 11.4.1, problem arrives in 10.5 * read-write_range=10, relative throughput is 0.79 in 11.4.1, problem arrives in 10.11 * update-index_range=100, relative throughput is 0.64 in 11.4.1, problem arrives in 10.11 and 11.4 * update-inlist_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.3, 10.5, 10.11 * write-only_range=10000, relative throughput is 0.75 in 11.4.1, problem arrives in 10.11, 11.4 At this point my hypothesis is that the problem is from a few changes to InnoDB but I need more data to confirm or deny that. On the 24-core server (2 sockets, 12 cores/socket) I repeated sysbench for 1, 4, 8, 12, 16 and 18 threads. And then on the 32-core server I repeated it for 1, 4, 8, 12, 16, 20 and 24 threads. The goal was to determine at which thread count the regressions become obvious. Alas, I only used a subset of the microbenchmarks to get results in less time. Another run with more microbenchmarks is in progress.

Mark Callaghan made changes - 2024-04-22 22:58

Description

While I haven't seen significant performance regressions when comparing modern MariaDB (11.4, 10.11) with older MariaDB via sysbench with low concurrency workloads ([see here|https://smalldatum.blogspot.com/2024/04/sysbench-on-less-small-server-mariadb.html]). I have seen perf regressions once I use workloads with some concurrency.

This will take a few days to properly document.

From a server with 8 cores and sysbench run with 4 threads ...
* the numbers in the table are the throughput relative to MariaDB 10.2.44 (x.ma100244_rel.z11a_bee.pk1) where 1.0 means the same, < 1.0 means a regression and > 1.0 means an improvement

If I use 0.8 as a cutoff, meaning some version gets less than 80% of the throughput relative to MariaDB 10.2, then from column 6 (col-6) the problem microbenchmarks are:
* update-index_range=100, relative throughput is 0.25 in 11.4.1, problem arrives in 10.3
* update-one_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.6
* write-only_range=10000 , relative throughput is 0.77 in 11.4.1, problem arrives in 10.3

Next step for this is to get flamegraphs and maybe PMP stacks.

The table relies on fixed width fonts to be readable but the "preformatted" option in JIRA doesn't do what I want it to do so the data [is here|https://gist.github.com/mdcallag/f57dce5dd1b2778d3bb1617e4ac8cae4]

Next up is a server with 2 sockets and 12 cores/socket and the benchmark was run with 16 threads. The results [are here|https://gist.github.com/mdcallag/e400b266571f3fcd65378493ecb9465f]. Again, using 0.8 as a cutoff and looking at col-6 (MariaDB 11.4.1) the problem microbenchmarks are:
* insert_range=100, relative throughput is 0.73 in 11.4.1, there are gradual regressions starting in 10.3, but the largest are from 10.11 and 11.4
* update-index_range=100, relative throughput is 0.18 in 11.4.1, problem starts in 10.5 and 10.11->11.4 is the biggest drop
* update-inlist_range=100, relative thoughput is 0.56 in 11.4.1, problem is gradual from 10.3 through 11.4
* update-nonindex_range=100, relative throughput is 0.69 in 11.4.1, problems arrive in 10.11 and 11.4
* update-one_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.6
* update-zipf_range=100, relative throughput is 0.75 in 11.4.1, problem arrives in 11.4
* write-only_range=10000, relative throughput is 0.59 in 11.4.1, problems arrive in 10.11 and 11.4

Finally a server with 32 cores (AMD Threadripper) and the benchmark was run with 24 threads. The results [are here|https://gist.github.com/mdcallag/e870bea3196cfa8e917874aede695630] and the problem microbenchmarks are:
* points-notcovered-pk_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
* points-notcovered-si_range=100, relative throughput is 0.77 in 11.4.1, problem arrives in 10.5
* random-points_range=1000, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
* random-points_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
* range-notcovered-si_range=100, relative throughput is 0.59 in 11.4.1, problem arrives in 10.5
* read-write_range=10, relative throughput is 0.79 in 11.4.1, problem arrives in 10.11
* update-index_range=100, relative throughput is 0.64 in 11.4.1, problem arrives in 10.11 and 11.4
* update-inlist_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.3, 10.5, 10.11
* write-only_range=10000, relative throughput is 0.75 in 11.4.1, problem arrives in 10.11, 11.4

At this point my hypothesis is that the problem is from a few changes to InnoDB but I need more data to confirm or deny that.

On the 24-core server (2 sockets, 12 cores/socket) I repeated sysbench for 1, 4, 8, 12, 16 and 18 threads. And then on the 32-core server I repeated it for 1, 4, 8, 12, 16, 20 and 24 threads. The goal was to determine at which thread count the regressions become obvious. Alas, I only used a subset of the microbenchmarks to get results in less time. Another run with more microbenchmarks is in progress.

While I haven't seen significant performance regressions when comparing modern MariaDB (11.4, 10.11) with older MariaDB via sysbench with low concurrency workloads ([see here|https://smalldatum.blogspot.com/2024/04/sysbench-on-less-small-server-mariadb.html]). I have seen perf regressions once I use workloads with some concurrency.

This will take a few days to properly document.

From a server with 8 cores and sysbench run with 4 threads ...
* the numbers in the table are the throughput relative to MariaDB 10.2.44 (x.ma100244_rel.z11a_bee.pk1) where 1.0 means the same, < 1.0 means a regression and > 1.0 means an improvement

If I use 0.8 as a cutoff, meaning some version gets less than 80% of the throughput relative to MariaDB 10.2, then from column 6 (col-6) the problem microbenchmarks are:
* update-index_range=100, relative throughput is 0.25 in 11.4.1, problem arrives in 10.3
* update-one_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.6
* write-only_range=10000 , relative throughput is 0.77 in 11.4.1, problem arrives in 10.3.

Next step for this is to get flamegraphs and maybe PMP stacks.

The table relies on fixed width fonts to be readable but the "preformatted" option in JIRA doesn't do what I want it to do so the data [is here|https://gist.github.com/mdcallag/f57dce5dd1b2778d3bb1617e4ac8cae4]

Next up is a server with 2 sockets and 12 cores/socket and the benchmark was run with 16 threads. The results [are here|https://gist.github.com/mdcallag/e400b266571f3fcd65378493ecb9465f]. Again, using 0.8 as a cutoff and looking at col-6 (MariaDB 11.4.1) the problem microbenchmarks are:
* insert_range=100, relative throughput is 0.73 in 11.4.1, there are gradual regressions starting in 10.3, but the largest are from 10.11 and 11.4
* update-index_range=100, relative throughput is 0.18 in 11.4.1, problem starts in 10.5 and 10.11->11.4 is the biggest drop
* update-inlist_range=100, relative thoughput is 0.56 in 11.4.1, problem is gradual from 10.3 through 11.4
* update-nonindex_range=100, relative throughput is 0.69 in 11.4.1, problems arrive in 10.11 and 11.4
* update-one_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.6
* update-zipf_range=100, relative throughput is 0.75 in 11.4.1, problem arrives in 11.4
* write-only_range=10000, relative throughput is 0.59 in 11.4.1, problems arrive in 10.11 and 11.4

Finally a server with 32 cores (AMD Threadripper) and the benchmark was run with 24 threads. The results [are here|https://gist.github.com/mdcallag/e870bea3196cfa8e917874aede695630] and the problem microbenchmarks are:
* points-notcovered-pk_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
* points-notcovered-si_range=100, relative throughput is 0.77 in 11.4.1, problem arrives in 10.5
* random-points_range=1000, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
* random-points_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
* range-notcovered-si_range=100, relative throughput is 0.59 in 11.4.1, problem arrives in 10.5
* read-write_range=10, relative throughput is 0.79 in 11.4.1, problem arrives in 10.11
* update-index_range=100, relative throughput is 0.64 in 11.4.1, problem arrives in 10.11 and 11.4
* update-inlist_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.3, 10.5, 10.11
* write-only_range=10000, relative throughput is 0.75 in 11.4.1, problem arrives in 10.11, 11.4

At this point my hypothesis is that the problem is from a few changes to InnoDB but I need more data to confirm or deny that.

On the 24-core server (2 sockets, 12 cores/socket) I repeated sysbench for 1, 4, 8, 12, 16 and 18 threads. And then on the 32-core server I repeated it for 1, 4, 8, 12, 16, 20 and 24 threads. The goal was to determine at which thread count the regressions become obvious. Alas, I only used a subset of the microbenchmarks to get results in less time. Another run with more microbenchmarks is in progress.

The results will be in comments to follow.

Mark Callaghan added a comment - 2024-04-22 23:00 - edited

I continue to link to gists, because JIRA formatting just doesn't seem to work. I tried both preformatted and monospace, but neither work and I need monospace to make the tables readable.

From the 24-core server tests I share the results for the benchmark run at 1, 4, 8, 12, 16 and 18 threads and focus on the microbenchmarks that had significant regressions as described above. The numbers are here. And the summary is:

For insert_range=100 the big regression starts at 8 threads
For update-index_range=100 the big regressions starts at 8 threads
For update-inlist_range=100 the big regression starts at 8 threads
For update-nonindex_range=100 the big regression starts at 8 threads
Unfortunately update-one_range=100, update-zipf_range=100 and write-only_range=10000 were not run

From the 32-core server tests I share the results for the benchmark run at 1, 4, 8, 12, 16 and 20 threads and focus on the microbenchmarks that had significant regressions as described above. The numbers are here

For points-notcovered-pk_range=100 the problem is gradual from 1 to 20 threads
For points-notcovered-si_range=100 the problem is gradual from 1 to 20 threads
For random-points_range=1000 the problem is gradual from 1 to 20 threads
For random-points_range=100 the problem is gradual from 1 to 20 threads
For range-notcovered-si_range=100 the problem is gradual from 1 to 8 threads and then gets bad faster at 12+ threads
For read-write_range=10 the problem didn't reproduce
For update-index_range=100 the big regressions arrive with 8+ threads
Unfortunately I did not include update-inlist_range=100 or write-only_range=10000 in this round of tests

Mark Callaghan added a comment - 2024-04-22 23:00 - edited I continue to link to gists, because JIRA formatting just doesn't seem to work. I tried both preformatted and monospace, but neither work and I need monospace to make the tables readable. From the 24-core server tests I share the results for the benchmark run at 1, 4, 8, 12, 16 and 18 threads and focus on the microbenchmarks that had significant regressions as described above. The numbers are here . And the summary is: For insert_range=100 the big regression starts at 8 threads For update-index_range=100 the big regressions starts at 8 threads For update-inlist_range=100 the big regression starts at 8 threads For update-nonindex_range=100 the big regression starts at 8 threads Unfortunately update-one_range=100, update-zipf_range=100 and write-only_range=10000 were not run From the 32-core server tests I share the results for the benchmark run at 1, 4, 8, 12, 16 and 20 threads and focus on the microbenchmarks that had significant regressions as described above. The numbers are here For points-notcovered-pk_range=100 the problem is gradual from 1 to 20 threads For points-notcovered-si_range=100 the problem is gradual from 1 to 20 threads For random-points_range=1000 the problem is gradual from 1 to 20 threads For random-points_range=100 the problem is gradual from 1 to 20 threads For range-notcovered-si_range=100 the problem is gradual from 1 to 8 threads and then gets bad faster at 12+ threads For read-write_range=10 the problem didn't reproduce For update-index_range=100 the big regressions arrive with 8+ threads Unfortunately I did not include update-inlist_range=100 or write-only_range=10000 in this round of tests

Marko Mäkelä added a comment - 2024-04-25 13:42

mdcallag, I think that some flame graphs generated from perf record data would be useful. In some cases I found offcputime useful, but it is a nuisance to use due to stack unwinder limitations unless all your code is built with the frame pointer enabled. Some recent GNU/Linux distributions are supposed to do that. For ~~MDEV-32050~~ I built libstdc++ and libc myself, and actually found a contention point that I was unaware of.

There is a known scalability bottleneck MDEV-19749; would the patch that is posted there help?

Marko Mäkelä added a comment - 2024-04-25 13:42 mdcallag , I think that some flame graphs generated from perf record data would be useful. In some cases I found offcputime useful, but it is a nuisance to use due to stack unwinder limitations unless all your code is built with the frame pointer enabled. Some recent GNU/Linux distributions are supposed to do that. For MDEV-32050 I built libstdc++ and libc myself, and actually found a contention point that I was unaware of. There is a known scalability bottleneck MDEV-19749 ; would the patch that is posted there help?

Mark Callaghan added a comment - 2024-05-06 00:59 - edited

Had to restart tests on the big servers, so I don't have results from there yet.
I do have results from a smaller server that has 8 cores and sysbench was run with 4 threads.

Results with relative QPS are here for sysbench with 1 and then 4 threads.
In the 4 thread results the relative QPS during the update-index microbenchmark is ~0.25 with MariaDB 10.3, 10.4, 10.6, 10.11 and 11.4 meaning these only get ~25% of the throughput relative to 10.2. For some reason, the result for 10.5 was OK. The result for update-index with 4 threads is here

An overview of how I run sysbench is here and the update-index microbenchmark is implemented by oltp_update_index.lua which calls execute_index_updates and the SQL is here – note these update statements require index maintenance because there is a secondary index on the k column.

Then I repeated the test after changing the sysbench command line options to get throughput per 30 seconds via --report-checkpoints=... and also getting PMP stacks each 30 seconds and the results are here. While the results for 10.3 and 10.4 are OK in this case, there is still a big regression starting in 10.6.

Then I aggregated the ~10 call stacks I got during the test both and the results from that are here and 10.6 is dominated by stalls from srw_mutex, see here.

The srw code is new to me, probably not to you. I have a vague memory from long ago about a blog post from the upstream InnoDB team explaining new custom mutex code they wrote. But I don't see the srw code in MySQL 5.7 or 8.0. I even copied some of that code into my rarely used InnoDB thread perf simulator (see innotsim here) but upstream was using things with names that start with "TTAS" like TTASMutex. But I do see SRWLOCK in non-InnoDB code (include/thr_rwlock.h).

PMP thread stacks for 10.11 and 11.4 are here and also show that the srw code is the bottleneck.

Mark Callaghan added a comment - 2024-05-06 00:59 - edited Had to restart tests on the big servers, so I don't have results from there yet. I do have results from a smaller server that has 8 cores and sysbench was run with 4 threads. Results with relative QPS are here for sysbench with 1 and then 4 threads. In the 4 thread results the relative QPS during the update-index microbenchmark is ~0.25 with MariaDB 10.3, 10.4, 10.6, 10.11 and 11.4 meaning these only get ~25% of the throughput relative to 10.2. For some reason, the result for 10.5 was OK. The result for update-index with 4 threads is here An overview of how I run sysbench is here and the update-index microbenchmark is implemented by oltp_update_index.lua which calls execute_index_updates and the SQL is here – note these update statements require index maintenance because there is a secondary index on the k column. Then I repeated the test after changing the sysbench command line options to get throughput per 30 seconds via --report-checkpoints=... and also getting PMP stacks each 30 seconds and the results are here . While the results for 10.3 and 10.4 are OK in this case, there is still a big regression starting in 10.6. Then I aggregated the ~10 call stacks I got during the test both and the results from that are here and 10.6 is dominated by stalls from srw_mutex, see here . The srw code is new to me, probably not to you. I have a vague memory from long ago about a blog post from the upstream InnoDB team explaining new custom mutex code they wrote. But I don't see the srw code in MySQL 5.7 or 8.0. I even copied some of that code into my rarely used InnoDB thread perf simulator (see innotsim here ) but upstream was using things with names that start with "TTAS" like TTASMutex. But I do see SRWLOCK in non-InnoDB code (include/thr_rwlock.h). PMP thread stacks for 10.11 and 11.4 are here and also show that the srw code is the bottleneck.

Mark Callaghan added a comment - 2024-05-06 13:44

From the thread stacks it is always srw_mutex_impl<false>, so spinning=false.
I assume whatever this replaced (InnoDB mutex or RW-lock from upstream) did spinning.

Mark Callaghan added a comment - 2024-05-06 13:44 From the thread stacks it is always srw_mutex_impl<false>, so spinning=false. I assume whatever this replaced (InnoDB mutex or RW-lock from upstream) did spinning.

Mark Callaghan added a comment - 2024-05-07 15:07

I started to edit srw_lock.h and sux_lock.h to prevent the spinning=false templates from being used. However, I think the authors would do a better job of that. So if you give me a patch against 10.6.17 that brings back spinning then I will repeat tests with it.

From my one attempt, the stack traces show that I avoided most of the spinning=false use cases but that did not change performance.

Do you have results to share with me from when this code was introduced to show that it didn't hurt performance?

Mark Callaghan added a comment - 2024-05-07 15:07 I started to edit srw_lock.h and sux_lock.h to prevent the spinning=false templates from being used. However, I think the authors would do a better job of that. So if you give me a patch against 10.6.17 that brings back spinning then I will repeat tests with it. From my one attempt, the stack traces show that I avoided most of the spinning=false use cases but that did not change performance. Do you have results to share with me from when this code was introduced to show that it didn't hurt performance?

Mark Callaghan added a comment - 2024-05-13 14:44 - edited

This has results for sysbench on a server with 24 cores – 2 sockets & 12 cores/socket.
The server is a SuperMicro SuperWorkstation using Intel 4214R CPU – see https://smalldatum.blogspot.com/2022/10/small-servers-for-performance-testing-v4.html

I repeated sysbench for 1, 4, 8, 12 and 16 threads. All tests used 8 tables with 10M rows/table. With 8 tables there should be some reduction in data contention vs using 1 table.

For comparison, results with MySQL show a big improvement from 5.6 to 5.7 and then a drop from 5.7 to 8.0. With MySQL 5.7 the peak is ~60k updates/s and with 8.0 it is ~30k/s.

I ran sysbench with the option to dump the throughput every 30 seconds, so there are multiple numbers per line below. Each line is the result from the update-index microbenchmark that is the worst case regression for MariaDB and the ~7 numbers per line are the QPS per 30-second interval.

Legend:

DOP $X - results for sysbench with $X threads
my5651_rel_o2nofp.z11a - results for MySQL 5.6.51
my5744_rel_o2nofp.z11a - results for MySQL 5.7.44
my8036_rel_o2nofp.z11a - results for MySQL 8.0.36

DOP 1
3208 3549 3668 3716 3738 3714 3754 my5651_rel_o2nofp.z11a
3350 3769 3918 3921 3909 3920 4071 my5744_rel_o2nofp.z11a
2745 3071 3145 3199 3166 3219 3284 imy8036_rel_o2nofp.z11a

DOP 4
6631 7228 7016 7478 7150 5960 7759 my5651_rel_o2nofp.z11a
6906 7418 7498 7422 7402 7386 8222 my5744_rel_o2nofp.z11a
6602 7064 7206 7200 7138 7108 7613 my8036_rel_o2nofp.z11a

DOP 8
6725 7554 7505 7251 7212 7219 8100 my5651_rel_o2nofp.z11a
23414 42883 45865 45614 46405 43101 45782 my5744_rel_o2nofp.z11a
12542 12572 13389 12204 16766 15023 16831 my8036_rel_o2nofp.z11a

DOP 12
8097 9690 9415 9260 8995 9819 9961 my5651_rel_o2nofp.z11a
51513 57349 57307 57274 55950 51889 56187 my5744_rel_o2nofp.z11a
35985 34732 28001 26117 28178 23432 31220 my8036_rel_o2nofp.z11a

DOP 16
9764 13156 13828 11918 11104 10849 12095 my5651_rel_o2nofp.z11a
59248 59536 59470 59174 53806 59348 58711 my5744_rel_o2nofp.z11a
37245 33280 25190 32406 32742 28922 31506 my8036_rel_o2nofp.z11a

Results for MariaDB LTS releases shows:

big regressions starting at 8 threads in both MariaDB 10.5.24 and 10.6.17 while 10.2.44, 10.3.39 and 10.4.33 all did OK
slow regressions over time from 10.2 -> 11.4 which might just be new CPU overhead
while upstream MySQL at DOP=16 gets ~60k updates/s in 5.7.44 and ~30k in 8.0.36, MariaDB gets 70k/s or more in 10.2 thru 10.4, ~40k/s in 10.5, ~30k/s in 10.6, ~24k/s in 10.11 and then ~15k/s in 11.4

Legend:

DOP $X - results for sysbench with $X threads
ma100244_rel_withdbg.z11a - MariaDB 10.2.44
ma100339_rel_withdbg.z11a - MariaDB 10.3.39
ma100433_rel_withdbg.z11a - MariaDB 10.4.33
ma100524_rel_withdbg.z11a - MariaDB 10.5.24
ma100617_rel_withdbg.z11a - MariaDB 10.6.17
ma101107_rel_withdbg.z11a - MariaDB 10.11.7
ma110401_rel_withdbg.z11b - MariaDB 11.4.1

DOP 1
3288 3699 3730 3606 3495 3480 3637 ma100244_rel_withdbg.z11a
2950 3387 3460 3384 3344 3313 3285 ma100339_rel_withdbg.z11a
2658 3181 3203 3210 3151 3109 3229 ma100433_rel_withdbg.z11a
3418 4048 4246 4118 4343 4334 4283 ma100524_rel_withdbg.z11a
3928 4287 4382 4377 4302 4519 4584 ma100617_rel_withdbg.z11a
3917 4343 4382 4382 4265 4539 4326 ma101107_rel_withdbg.z11a
3613 3957 4350 4239 3952 4402 4168 ma110401_rel_withdbg.z11b

DOP 4
7811 8169 8111 8005 7869 7850 8662 ma100244_rel_withdbg.z11a
6600 7119 7037 6903 6735 6911 7325 ma100339_rel_withdbg.z11a
6371 7070 7244 6985 7005 7318 7498 ma100433_rel_withdbg.z11a
7879 8547 7987 8741 9038 9346 10398 ma100524_rel_withdbg.z11a
6459 7066 7079 7147 7131 7357 7616 ma100617_rel_withdbg.z11a
10159 9990 10485 10451 10190 10723 10445 ma101107_rel_withdbg.z11a
7315 7816 7765 7765 8448 7479 7636 ma110401_rel_withdbg.z11b

DOP 8
43740 53521 48133 56194 57485 58505 55838 ma100244_rel_withdbg.z11a
15846 23131 26235 25012 24239 26333 25471 ma100339_rel_withdbg.z11a
20699 28173 31655 29639 32049 34095 31686 ma100433_rel_withdbg.z11a
11538 12202 12224 12374 12704 13109 14124 ma100524_rel_withdbg.z11a
9085 9180 9261 9502 9469 9643 10219 ma100617_rel_withdbg.z11a
12051 12302 12344 12793 12405 12936 13571 ma101107_rel_withdbg.z11a
10086 10677 8550 11246 8561 10092 9736 ma110401_rel_withdbg.z11b

DOP 12
68704 71397 70615 71608 67787 69417 70290 ma100244_rel_withdbg.z11a
60770 70039 70747 71010 70223 68461 71967 ma100339_rel_withdbg.z11a
62047 67452 68916 68512 68586 68914 69190 ma100433_rel_withdbg.z11a
16688 16774 19022 19517 20514 20776 21509 ma100524_rel_withdbg.z11a
12420 12653 12927 13532 13243 14049 15365 ma100617_rel_withdbg.z11a
13944 14154 14497 15282 14208 15521 16527 ma101107_rel_withdbg.z11a
12328 10328 10070 9684 13131 10871 11400 ma110401_rel_withdbg.z11b

DOP 16
70703 70537 71731 72540 67908 70929 71118 ma100244_rel_withdbg.z11a
81310 81704 82147 82336 82308 73299 81880 ma100339_rel_withdbg.z11a
72227 74257 73307 74790 75554 67097 73578 ma100433_rel_withdbg.z11a
37348 32724 33922 33405 33983 36772 40503 ma100524_rel_withdbg.z11a
20267 22318 23933 25257 28028 26159 29722 ma100617_rel_withdbg.z11a
18735 20002 20344 20702 21040 22142 24254 ma101107_rel_withdbg.z11a
14298 15837 15317 12236 12149 17126 14313 ma110401_rel_withdbg.z11b

Mark Callaghan added a comment - 2024-05-13 14:44 - edited This has results for sysbench on a server with 24 cores – 2 sockets & 12 cores/socket. The server is a SuperMicro SuperWorkstation using Intel 4214R CPU – see https://smalldatum.blogspot.com/2022/10/small-servers-for-performance-testing-v4.html I repeated sysbench for 1, 4, 8, 12 and 16 threads. All tests used 8 tables with 10M rows/table. With 8 tables there should be some reduction in data contention vs using 1 table. For comparison, results with MySQL show a big improvement from 5.6 to 5.7 and then a drop from 5.7 to 8.0. With MySQL 5.7 the peak is ~60k updates/s and with 8.0 it is ~30k/s. I ran sysbench with the option to dump the throughput every 30 seconds, so there are multiple numbers per line below. Each line is the result from the update-index microbenchmark that is the worst case regression for MariaDB and the ~7 numbers per line are the QPS per 30-second interval. Legend: DOP $X - results for sysbench with $X threads my5651_rel_o2nofp.z11a - results for MySQL 5.6.51 my5744_rel_o2nofp.z11a - results for MySQL 5.7.44 my8036_rel_o2nofp.z11a - results for MySQL 8.0.36 DOP 1 3208 3549 3668 3716 3738 3714 3754 my5651_rel_o2nofp.z11a 3350 3769 3918 3921 3909 3920 4071 my5744_rel_o2nofp.z11a 2745 3071 3145 3199 3166 3219 3284 imy8036_rel_o2nofp.z11a DOP 4 6631 7228 7016 7478 7150 5960 7759 my5651_rel_o2nofp.z11a 6906 7418 7498 7422 7402 7386 8222 my5744_rel_o2nofp.z11a 6602 7064 7206 7200 7138 7108 7613 my8036_rel_o2nofp.z11a DOP 8 6725 7554 7505 7251 7212 7219 8100 my5651_rel_o2nofp.z11a 23414 42883 45865 45614 46405 43101 45782 my5744_rel_o2nofp.z11a 12542 12572 13389 12204 16766 15023 16831 my8036_rel_o2nofp.z11a DOP 12 8097 9690 9415 9260 8995 9819 9961 my5651_rel_o2nofp.z11a 51513 57349 57307 57274 55950 51889 56187 my5744_rel_o2nofp.z11a 35985 34732 28001 26117 28178 23432 31220 my8036_rel_o2nofp.z11a DOP 16 9764 13156 13828 11918 11104 10849 12095 my5651_rel_o2nofp.z11a 59248 59536 59470 59174 53806 59348 58711 my5744_rel_o2nofp.z11a 37245 33280 25190 32406 32742 28922 31506 my8036_rel_o2nofp.z11a Results for MariaDB LTS releases shows: big regressions starting at 8 threads in both MariaDB 10.5.24 and 10.6.17 while 10.2.44, 10.3.39 and 10.4.33 all did OK slow regressions over time from 10.2 -> 11.4 which might just be new CPU overhead while upstream MySQL at DOP=16 gets ~60k updates/s in 5.7.44 and ~30k in 8.0.36, MariaDB gets 70k/s or more in 10.2 thru 10.4, ~40k/s in 10.5, ~30k/s in 10.6, ~24k/s in 10.11 and then ~15k/s in 11.4 Legend: DOP $X - results for sysbench with $X threads ma100244_rel_withdbg.z11a - MariaDB 10.2.44 ma100339_rel_withdbg.z11a - MariaDB 10.3.39 ma100433_rel_withdbg.z11a - MariaDB 10.4.33 ma100524_rel_withdbg.z11a - MariaDB 10.5.24 ma100617_rel_withdbg.z11a - MariaDB 10.6.17 ma101107_rel_withdbg.z11a - MariaDB 10.11.7 ma110401_rel_withdbg.z11b - MariaDB 11.4.1 DOP 1 3288 3699 3730 3606 3495 3480 3637 ma100244_rel_withdbg.z11a 2950 3387 3460 3384 3344 3313 3285 ma100339_rel_withdbg.z11a 2658 3181 3203 3210 3151 3109 3229 ma100433_rel_withdbg.z11a 3418 4048 4246 4118 4343 4334 4283 ma100524_rel_withdbg.z11a 3928 4287 4382 4377 4302 4519 4584 ma100617_rel_withdbg.z11a 3917 4343 4382 4382 4265 4539 4326 ma101107_rel_withdbg.z11a 3613 3957 4350 4239 3952 4402 4168 ma110401_rel_withdbg.z11b DOP 4 7811 8169 8111 8005 7869 7850 8662 ma100244_rel_withdbg.z11a 6600 7119 7037 6903 6735 6911 7325 ma100339_rel_withdbg.z11a 6371 7070 7244 6985 7005 7318 7498 ma100433_rel_withdbg.z11a 7879 8547 7987 8741 9038 9346 10398 ma100524_rel_withdbg.z11a 6459 7066 7079 7147 7131 7357 7616 ma100617_rel_withdbg.z11a 10159 9990 10485 10451 10190 10723 10445 ma101107_rel_withdbg.z11a 7315 7816 7765 7765 8448 7479 7636 ma110401_rel_withdbg.z11b DOP 8 43740 53521 48133 56194 57485 58505 55838 ma100244_rel_withdbg.z11a 15846 23131 26235 25012 24239 26333 25471 ma100339_rel_withdbg.z11a 20699 28173 31655 29639 32049 34095 31686 ma100433_rel_withdbg.z11a 11538 12202 12224 12374 12704 13109 14124 ma100524_rel_withdbg.z11a 9085 9180 9261 9502 9469 9643 10219 ma100617_rel_withdbg.z11a 12051 12302 12344 12793 12405 12936 13571 ma101107_rel_withdbg.z11a 10086 10677 8550 11246 8561 10092 9736 ma110401_rel_withdbg.z11b DOP 12 68704 71397 70615 71608 67787 69417 70290 ma100244_rel_withdbg.z11a 60770 70039 70747 71010 70223 68461 71967 ma100339_rel_withdbg.z11a 62047 67452 68916 68512 68586 68914 69190 ma100433_rel_withdbg.z11a 16688 16774 19022 19517 20514 20776 21509 ma100524_rel_withdbg.z11a 12420 12653 12927 13532 13243 14049 15365 ma100617_rel_withdbg.z11a 13944 14154 14497 15282 14208 15521 16527 ma101107_rel_withdbg.z11a 12328 10328 10070 9684 13131 10871 11400 ma110401_rel_withdbg.z11b DOP 16 70703 70537 71731 72540 67908 70929 71118 ma100244_rel_withdbg.z11a 81310 81704 82147 82336 82308 73299 81880 ma100339_rel_withdbg.z11a 72227 74257 73307 74790 75554 67097 73578 ma100433_rel_withdbg.z11a 37348 32724 33922 33405 33983 36772 40503 ma100524_rel_withdbg.z11a 20267 22318 23933 25257 28028 26159 29722 ma100617_rel_withdbg.z11a 18735 20002 20344 20702 21040 22142 24254 ma101107_rel_withdbg.z11a 14298 15837 15317 12236 12149 17126 14313 ma110401_rel_withdbg.z11b

Mark Callaghan added a comment - 2024-05-13 16:18

More results from the 24-core (2 sockets, 12 cores/socket server) are here for DOP=8 and DOP=16

From the DOP=16 results (16 client threads) I then counted the number of stacks that match certain strings.

The most interesting one might be thread stacks that contain "btr".

Number of stacks with "ha_commit_trans"
87 ma100433_rel_withdbg.z11a_c24r64.pk1
79 ma100524_rel_withdbg.z11a_c24r64.pk1
39 ma100617_rel_withdbg.z11a_c24r64.pk1
55 ma101107_rel_withdbg.z11a_c24r64.pk1
68 ma110401_rel_withdbg.z11b_c24r64.pk1
131 my5744_rel_o2nofp.z11a_c24r64.pk1
163 my8036_rel_o2nofp.z11a_c24r64.pk1

Number of stacks with "log_write_up_to"
34 ma100433_rel_withdbg.z11a_c24r64.pk1
65 ma100524_rel_withdbg.z11a_c24r64.pk1
27 ma100617_rel_withdbg.z11a_c24r64.pk1
36 ma101107_rel_withdbg.z11a_c24r64.pk1
33 ma110401_rel_withdbg.z11b_c24r64.pk1
3 my5744_rel_o2nofp.z11a_c24r64.pk1
16 my8036_rel_o2nofp.z11a_c24r64.pk1

Number of stacks with "trx_prepare"
29 ma100433_rel_withdbg.z11a_c24r64.pk1
40 ma100524_rel_withdbg.z11a_c24r64.pk1
23 ma100617_rel_withdbg.z11a_c24r64.pk1
25 ma101107_rel_withdbg.z11a_c24r64.pk1
28 ma110401_rel_withdbg.z11b_c24r64.pk1
0 my5744_rel_o2nofp.z11a_c24r64.pk1
1 my8036_rel_o2nofp.z11a_c24r64.pk1

Number of stacks with "sync_array", which is removed from MariaDB 10.6
53 ma100433_rel_withdbg.z11a_c24r64.pk1
57 ma100524_rel_withdbg.z11a_c24r64.pk1
0 ma100617_rel_withdbg.z11a_c24r64.pk1
0 ma101107_rel_withdbg.z11a_c24r64.pk1
0 ma110401_rel_withdbg.z11b_c24r64.pk1
35 my5744_rel_o2nofp.z11a_c24r64.pk1
26 my8036_rel_o2nofp.z11a_c24r64.pk1

Number of stacks with "ssux" which arrived in MariaDB 10.6

83 ma100617_rel_withdbg.z11a_c24r64.pk1
65 ma101107_rel_withdbg.z11a_c24r64.pk1
47 ma110401_rel_withdbg.z11b_c24r64.pk1

Number of stacks with "purge"
57 ma100433_rel_withdbg.z11a_c24r64.pk1
72 ma100524_rel_withdbg.z11a_c24r64.pk1
68 ma100617_rel_withdbg.z11a_c24r64.pk1
49 ma101107_rel_withdbg.z11a_c24r64.pk1
34 ma110401_rel_withdbg.z11b_c24r64.pk1
70 my5744_rel_o2nofp.z11a_c24r64.pk1
59 my8036_rel_o2nofp.z11a_c24r64.pk1

Number of stacks with "btr"
69 ma100433_rel_withdbg.z11a_c24r64.pk1
124 ma100524_rel_withdbg.z11a_c24r64.pk1
152 ma100617_rel_withdbg.z11a_c24r64.pk1
106 ma101107_rel_withdbg.z11a_c24r64.pk1
83 ma110401_rel_withdbg.z11b_c24r64.pk1
93 my5744_rel_o2nofp.z11a_c24r64.pk1
77 my8036_rel_o2nofp.z11a_c24r64.pk1

Mark Callaghan added a comment - 2024-05-13 16:18 More results from the 24-core (2 sockets, 12 cores/socket server) are here for DOP=8 and DOP=16 From the DOP=16 results (16 client threads) I then counted the number of stacks that match certain strings. The most interesting one might be thread stacks that contain "btr". Number of stacks with "ha_commit_trans" 87 ma100433_rel_withdbg.z11a_c24r64.pk1 79 ma100524_rel_withdbg.z11a_c24r64.pk1 39 ma100617_rel_withdbg.z11a_c24r64.pk1 55 ma101107_rel_withdbg.z11a_c24r64.pk1 68 ma110401_rel_withdbg.z11b_c24r64.pk1 131 my5744_rel_o2nofp.z11a_c24r64.pk1 163 my8036_rel_o2nofp.z11a_c24r64.pk1 Number of stacks with "log_write_up_to" 34 ma100433_rel_withdbg.z11a_c24r64.pk1 65 ma100524_rel_withdbg.z11a_c24r64.pk1 27 ma100617_rel_withdbg.z11a_c24r64.pk1 36 ma101107_rel_withdbg.z11a_c24r64.pk1 33 ma110401_rel_withdbg.z11b_c24r64.pk1 3 my5744_rel_o2nofp.z11a_c24r64.pk1 16 my8036_rel_o2nofp.z11a_c24r64.pk1 Number of stacks with "trx_prepare" 29 ma100433_rel_withdbg.z11a_c24r64.pk1 40 ma100524_rel_withdbg.z11a_c24r64.pk1 23 ma100617_rel_withdbg.z11a_c24r64.pk1 25 ma101107_rel_withdbg.z11a_c24r64.pk1 28 ma110401_rel_withdbg.z11b_c24r64.pk1 0 my5744_rel_o2nofp.z11a_c24r64.pk1 1 my8036_rel_o2nofp.z11a_c24r64.pk1 Number of stacks with "sync_array", which is removed from MariaDB 10.6 53 ma100433_rel_withdbg.z11a_c24r64.pk1 57 ma100524_rel_withdbg.z11a_c24r64.pk1 0 ma100617_rel_withdbg.z11a_c24r64.pk1 0 ma101107_rel_withdbg.z11a_c24r64.pk1 0 ma110401_rel_withdbg.z11b_c24r64.pk1 35 my5744_rel_o2nofp.z11a_c24r64.pk1 26 my8036_rel_o2nofp.z11a_c24r64.pk1 Number of stacks with "ssux" which arrived in MariaDB 10.6 83 ma100617_rel_withdbg.z11a_c24r64.pk1 65 ma101107_rel_withdbg.z11a_c24r64.pk1 47 ma110401_rel_withdbg.z11b_c24r64.pk1 Number of stacks with "purge" 57 ma100433_rel_withdbg.z11a_c24r64.pk1 72 ma100524_rel_withdbg.z11a_c24r64.pk1 68 ma100617_rel_withdbg.z11a_c24r64.pk1 49 ma101107_rel_withdbg.z11a_c24r64.pk1 34 ma110401_rel_withdbg.z11b_c24r64.pk1 70 my5744_rel_o2nofp.z11a_c24r64.pk1 59 my8036_rel_o2nofp.z11a_c24r64.pk1 Number of stacks with "btr" 69 ma100433_rel_withdbg.z11a_c24r64.pk1 124 ma100524_rel_withdbg.z11a_c24r64.pk1 152 ma100617_rel_withdbg.z11a_c24r64.pk1 106 ma101107_rel_withdbg.z11a_c24r64.pk1 83 ma110401_rel_withdbg.z11b_c24r64.pk1 93 my5744_rel_o2nofp.z11a_c24r64.pk1 77 my8036_rel_o2nofp.z11a_c24r64.pk1

Vladislav Vaintroub added a comment - 2024-05-15 20:20

mdcallag, SRWLOCK in non-Innodb code is not marko's srwlock, but Windows' one , but it seems Marko liked the name, or mode-of-operation (Windows'one is slim only 8 bytes, does not need initialization, fast)

Vladislav Vaintroub added a comment - 2024-05-15 20:20 mdcallag , SRWLOCK in non-Innodb code is not marko 's srwlock, but Windows' one , but it seems Marko liked the name, or mode-of-operation (Windows'one is slim only 8 bytes, does not need initialization, fast)

Marko Mäkelä added a comment - 2024-05-27 05:32

I see that these consolidated stack traces include mtr_t::s_lock(), which is acquiring a shared latch on an index tree, mostly in btr_cur_t::search_leaf().

I had disabled spin loops on the index tree latches because based on some benchmarks that we ran on that time, spin loops in some cases would only end up wasting CPU, with no benefit. What is better may actually depend on the CPU implementation, as we recently observed in ~~MDEV-33515~~.

Would you have a chance to test the patch that I posted to ~~MDEV-34178~~? It should enable spin loops for the index tree latches.

It would be interesting to have a unit test that is based on innotsim. The rw-lock and shared/update/exclusive (SUX) lock implementation in MariaDB Server 10.6 is my own development. I also created a stand-alone version at https://github.com/dr-m/atomic_sync/.

Marko Mäkelä added a comment - 2024-05-27 05:32 I see that these consolidated stack traces include mtr_t::s_lock() , which is acquiring a shared latch on an index tree, mostly in btr_cur_t::search_leaf() . I had disabled spin loops on the index tree latches because based on some benchmarks that we ran on that time, spin loops in some cases would only end up wasting CPU, with no benefit. What is better may actually depend on the CPU implementation, as we recently observed in MDEV-33515 . Would you have a chance to test the patch that I posted to MDEV-34178 ? It should enable spin loops for the index tree latches. It would be interesting to have a unit test that is based on innotsim . The rw-lock and shared/update/exclusive (SUX) lock implementation in MariaDB Server 10.6 is my own development. I also created a stand-alone version at https://github.com/dr-m/atomic_sync/ .

Mark Callaghan added a comment - 2024-05-28 02:53 - edited

If you send me command lines I can try atomic_sync on my home servers. Without looking at the code does it have an option to configure how long the lock is held? Because I assume an issue here is that the lock is not held long enough to justify other threads going to sleep and some are better off spinning, as they do in upstream.

I will try the patch you suggest when I return home. I am at pgconf.dev this week.

Mark Callaghan added a comment - 2024-05-28 02:53 - edited If you send me command lines I can try atomic_sync on my home servers. Without looking at the code does it have an option to configure how long the lock is held? Because I assume an issue here is that the lock is not held long enough to justify other threads going to sleep and some are better off spinning, as they do in upstream. I will try the patch you suggest when I return home. I am at pgconf.dev this week.

Sergei Golubchik made changes - 2024-05-28 09:33

Priority

Minor [ 4 ]

Critical [ 2 ]

Sergei Golubchik made changes - 2024-05-28 09:37

Assignee

Marko Mäkelä [ marko ]

Axel Schwenke added a comment - 2024-05-28 12:50

mdcallag I have a question regarding your workloads. I have read your description here but I don't understand the meaning of the _range=xxx suffix. update_one is supposed to update the same row over and over again. But what is the meaning of update-one_range=100?

Axel Schwenke added a comment - 2024-05-28 12:50 mdcallag I have a question regarding your workloads. I have read your description here but I don't understand the meaning of the _range=xxx suffix. update_one is supposed to update the same row over and over again. But what is the meaning of update-one_range=100 ?

Vladislav Vaintroub made changes - 2024-05-28 14:38

Link

This issue relates to ~~MDEV-32176~~ [ ~~MDEV-32176~~ ]

Mark Callaghan added a comment - 2024-05-28 15:49

@axel an overview of my sysbench usage is here and I use a number of messy shell scripts.

The short answer is that range is ignored for some of the microbenchmarks but is always used in how I name output files. The long answer is ...

I mention 4 of the shell scripts here and the last 4 are relevant to you:

r.sh is the entry point that I run from the command line
r.sh invokes cmp_in.sh to run things for MariaDB and MySQL
cmp_in.sh invokes all_small.sh which runs all of the microbenchmarks in a sequence that is useful to me
each microbenchmark is one invocation of sysbench and all_small.sh calls run.sh to invoke sysbench

So "turtles all the way down" is a way to explain this, only instead of turtles it is messy shell scripts. I rely on many naming conventions, and one of them is the suffix used to name result files. The "range=" string is part of that suffix and is defined by run.sh here. However that "range=..." string doesn't mean anything for the update-one microbenchmark but the run.sh script isn't clever enough to know when to use it, so it always adds --range-size to the command line.

The update-one microbenchmark is run from here by all_small.sh and then in run.sh that uses oltp_update_non_index.lua which uses the non_index_updates code from oltp_common.lua

Mark Callaghan added a comment - 2024-05-28 15:49 @axel an overview of my sysbench usage is here and I use a number of messy shell scripts. The short answer is that range is ignored for some of the microbenchmarks but is always used in how I name output files. The long answer is ... I mention 4 of the shell scripts here and the last 4 are relevant to you: r.sh is the entry point that I run from the command line r.sh invokes cmp_in.sh to run things for MariaDB and MySQL cmp_in.sh invokes all_small.sh which runs all of the microbenchmarks in a sequence that is useful to me each microbenchmark is one invocation of sysbench and all_small.sh calls run.sh to invoke sysbench So "turtles all the way down" is a way to explain this, only instead of turtles it is messy shell scripts. I rely on many naming conventions, and one of them is the suffix used to name result files. The "range=" string is part of that suffix and is defined by run.sh here . However that "range=..." string doesn't mean anything for the update-one microbenchmark but the run.sh script isn't clever enough to know when to use it, so it always adds --range-size to the command line. The update-one microbenchmark is run from here by all_small.sh and then in run.sh that uses oltp_update_non_index.lua which uses the non_index_updates code from oltp_common.lua

Marko Mäkelä added a comment - 2024-05-28 16:24

mdcallag, unfortunately there currently is no option either in the https://github.com/dr-m/atomic_sync/ test programs or in storage/innobase/unittest/innodb_sync-t.cc to control the duration of the critical sections. Something based on busy-waiting as well as based on context switching (system calls) could be useful. My main emphasis was to make the code reusable in other projects.

Marko Mäkelä added a comment - 2024-05-28 16:24 mdcallag , unfortunately there currently is no option either in the https://github.com/dr-m/atomic_sync/ test programs or in storage/innobase/unittest/innodb_sync-t.cc to control the duration of the critical sections. Something based on busy-waiting as well as based on context switching (system calls) could be useful. My main emphasis was to make the code reusable in other projects.

Marko Mäkelä added a comment - 2024-05-30 08:04

For the record, user space spin loops make sense especially when the critical sections are of short duration and the resource is expected to be released soon. For the dict_index_t::lock I had thought that this would be unlikely.

While testing something else (~~MDEV-33894~~) on a SATA 3.0 HDD with ext4fs, I noticed that the top CPU user in my simple test is the Linux kernel function native_queued_spin_lock_slowpath, accounting for 6.70% of the cycles samples. I repeated the run after applying the patch from ~~MDEV-34178~~. It did have an impact, but the overall throughput was reduced for this simple test (using a large enough buffer pool for the workload, but a tiny innodb_log_file_size=100m that would force rather frequent page writes):

   5,83%  mariadbd  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath

   2,39%  mariadbd  mariadbd              [.] ssux_lock_impl<true>::wr_wait(unsigned int)

   2,14%  mariadbd  mariadbd              [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)

   1,83%  mariadbd  mariadbd              [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)

   1,45%  mariadbd  [kernel.kallsyms]     [k] psi_group_change

   1,22%  mariadbd  mariadbd              [.] page_cur_search_with_match(dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, page_cur_t*, rtr_info*)

   1,04%  mariadbd  libc.so.6             [.] __memcmp_avx2_movbe_rtm

   0,98%  mariadbd  libc.so.6             [.] pthread_mutex_lock@@GLIBC_2.2.5

   0,97%  mariadbd  [kernel.kallsyms]     [k] update_load_avg

Before I enabled the spin loop, I got the following:

   6,70%  mariadbd         [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath

   1,31%  mariadbd         [kernel.kallsyms]     [k] psi_group_change

   1,14%  mariadbd         libc.so.6             [.] __memcmp_avx2_movbe_rtm

   0,90%  mariadbd         libc.so.6             [.] pthread_mutex_lock@@GLIBC_2.2.5

   0,89%  mariadbd         mariadbd              [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)

   0,82%  mariadbd         [kernel.kallsyms]     [k] update_load_avg

It seems that enabling the spin loop for dict_index_t::lock is reducing context switching. Whether the increased CPU usage (as we see for the wr_wait()) is making things worse depends on many factors. If a wider selection of benchmarks does not show any significant regression anywhere, maybe we should just unconditionally enable the spin loop, instead of doing something similar to ~~MDEV-33515~~.

I went on and applied another patch from MDEV-19749. With that, things became very ‘interesting’: the throughput would be almost 0 for part of the benchmark. I suspect that this actually mostly unnecessary synchronization on the SQL layer acts a little like the InnoDB throttling mechanisms that were removed in ~~MDEV-23379~~. During the better-performing part of my benchmark run, the perf record showed that there is still quite a bit of context switching going on:

   7,42%  mariadbd         [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath

   2,39%  mariadbd         mariadbd              [.] ssux_lock_impl<true>::wr_wait(unsigned int)

   2,15%  mariadbd         mariadbd              [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)

   1,82%  mariadbd         mariadbd              [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)

   1,41%  mariadbd         [kernel.kallsyms]     [k] psi_group_change

   1,28%  mariadbd         mariadbd              [.] page_cur_search_with_match(dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, page_cur_t*, rtr_info*)

   1,00%  mariadbd         libc.so.6             [.] __memcmp_avx2_movbe_rtm

   0,97%  mariadbd         [kernel.kallsyms]     [k] update_load_avg

   0,92%  mariadbd         mariadbd              [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)

   0,84%  mariadbd         mariadbd              [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, dict_index_t const*, unsigned short const*, unsigned long, unsigned long*)

   0,83%  mariadbd         mariadbd              [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)

   0,80%  mariadbd         libc.so.6             [.] pthread_mutex_lock@@GLIBC_2.2.5

   0,71%  mariadbd         libc.so.6             [.] __memmove_avx_unaligned_erms_rtm

   0,70%  mariadbd         [kernel.kallsyms]     [k] select_task_rq_fair

   0,69%  mariadbd         mariadbd              [.] mysql_execute_command(THD*, bool)

   0,68%  mariadbd         libc.so.6             [.] _int_malloc

   0,67%  mariadbd         libc.so.6             [.] malloc

   0,66%  mariadbd         mariadbd              [.] cmp_data(unsigned long, unsigned long, bool, unsigned char const*, unsigned long, unsigned char const*, unsigned long)

   0,60%  mariadbd         mariadbd              [.] void mtr_t::commit_log<true>(mtr_t*, std::pair<unsigned long, mtr_t::page_flush_ahead>)

I will try to see if offcputime-bpfcc would shed some light on this, like it did on ~~MDEV-32050~~. That would seem to require a reboot, because I only found a linux-headers that is for a newer kernel than my current 6.8.9. That’s what I get for using a rolling release.

Marko Mäkelä added a comment - 2024-05-30 08:04 For the record, user space spin loops make sense especially when the critical sections are of short duration and the resource is expected to be released soon. For the dict_index_t::lock I had thought that this would be unlikely. While testing something else ( MDEV-33894 ) on a SATA 3.0 HDD with ext4fs, I noticed that the top CPU user in my simple test is the Linux kernel function native_queued_spin_lock_slowpath , accounting for 6.70% of the cycles samples. I repeated the run after applying the patch from MDEV-34178 . It did have an impact, but the overall throughput was reduced for this simple test (using a large enough buffer pool for the workload, but a tiny innodb_log_file_size=100m that would force rather frequent page writes): 5,83% mariadbd [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 2,39% mariadbd mariadbd [.] ssux_lock_impl<true>::wr_wait(unsigned int) 2,14% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 1,83% mariadbd mariadbd [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**) 1,45% mariadbd [kernel.kallsyms] [k] psi_group_change 1,22% mariadbd mariadbd [.] page_cur_search_with_match(dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, page_cur_t*, rtr_info*) 1,04% mariadbd libc.so.6 [.] __memcmp_avx2_movbe_rtm 0,98% mariadbd libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5 0,97% mariadbd [kernel.kallsyms] [k] update_load_avg Before I enabled the spin loop, I got the following: 6,70% mariadbd [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 1,31% mariadbd [kernel.kallsyms] [k] psi_group_change 1,14% mariadbd libc.so.6 [.] __memcmp_avx2_movbe_rtm 0,90% mariadbd libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5 0,89% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 0,82% mariadbd [kernel.kallsyms] [k] update_load_avg It seems that enabling the spin loop for dict_index_t::lock is reducing context switching. Whether the increased CPU usage (as we see for the wr_wait() ) is making things worse depends on many factors. If a wider selection of benchmarks does not show any significant regression anywhere, maybe we should just unconditionally enable the spin loop, instead of doing something similar to MDEV-33515 . I went on and applied another patch from MDEV-19749 . With that, things became very ‘interesting’: the throughput would be almost 0 for part of the benchmark. I suspect that this actually mostly unnecessary synchronization on the SQL layer acts a little like the InnoDB throttling mechanisms that were removed in MDEV-23379 . During the better-performing part of my benchmark run, the perf record showed that there is still quite a bit of context switching going on: 7,42% mariadbd [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 2,39% mariadbd mariadbd [.] ssux_lock_impl<true>::wr_wait(unsigned int) 2,15% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 1,82% mariadbd mariadbd [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**) 1,41% mariadbd [kernel.kallsyms] [k] psi_group_change 1,28% mariadbd mariadbd [.] page_cur_search_with_match(dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, page_cur_t*, rtr_info*) 1,00% mariadbd libc.so.6 [.] __memcmp_avx2_movbe_rtm 0,97% mariadbd [kernel.kallsyms] [k] update_load_avg 0,92% mariadbd mariadbd [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format) 0,84% mariadbd mariadbd [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, dict_index_t const*, unsigned short const*, unsigned long, unsigned long*) 0,83% mariadbd mariadbd [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*) 0,80% mariadbd libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5 0,71% mariadbd libc.so.6 [.] __memmove_avx_unaligned_erms_rtm 0,70% mariadbd [kernel.kallsyms] [k] select_task_rq_fair 0,69% mariadbd mariadbd [.] mysql_execute_command(THD*, bool) 0,68% mariadbd libc.so.6 [.] _int_malloc 0,67% mariadbd libc.so.6 [.] malloc 0,66% mariadbd mariadbd [.] cmp_data(unsigned long, unsigned long, bool, unsigned char const*, unsigned long, unsigned char const*, unsigned long) 0,60% mariadbd mariadbd [.] void mtr_t::commit_log<true>(mtr_t*, std::pair<unsigned long, mtr_t::page_flush_ahead>) I will try to see if offcputime-bpfcc would shed some light on this, like it did on MDEV-32050 . That would seem to require a reboot, because I only found a linux-headers that is for a newer kernel than my current 6.8.9. That’s what I get for using a rolling release.

Marko Mäkelä added a comment - 2024-05-30 08:47

For the record, if I "store" the database in /dev/shm and only point innodb_log_group_home_dir to a HDD (to test ~~MDEV-33894~~), then it will look quite different again.

   6,03%  mariadbd         mariadbd              [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)

   5,90%  mariadbd         mariadbd              [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)

   4,45%  mariadbd         mariadbd              [.] page_cur_search_with_match(dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, page_cur_t*, rtr_info*)

   2,89%  mariadbd         mariadbd              [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)

   2,86%  mariadbd         [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath

   2,84%  mariadbd         mariadbd              [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, dict_index_t const*, unsigned short const*, unsigned long, unsigned long*)

   2,24%  mariadbd         libc.so.6             [.] __memcmp_avx2_movbe_rtm

   2,23%  mariadbd         mariadbd              [.] cmp_data(unsigned long, unsigned long, bool, unsigned char const*, unsigned long, unsigned char const*, unsigned long)

   1,68%  mariadbd         mariadbd              [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)

This would seem to suggest that the bulk of the previously observed context switching is related to buf_flush_page_cleaner submitting writes of buffer pool pages to data files, which would conflict with application writes to the same pages in the buffer pool.

Marko Mäkelä added a comment - 2024-05-30 08:47 For the record, if I "store" the database in /dev/shm and only point innodb_log_group_home_dir to a HDD (to test MDEV-33894 ), then it will look quite different again. 6,03% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 5,90% mariadbd mariadbd [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**) 4,45% mariadbd mariadbd [.] page_cur_search_with_match(dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, page_cur_t*, rtr_info*) 2,89% mariadbd mariadbd [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format) 2,86% mariadbd [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 2,84% mariadbd mariadbd [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, dict_index_t const*, unsigned short const*, unsigned long, unsigned long*) 2,24% mariadbd libc.so.6 [.] __memcmp_avx2_movbe_rtm 2,23% mariadbd mariadbd [.] cmp_data(unsigned long, unsigned long, bool, unsigned char const*, unsigned long, unsigned char const*, unsigned long) 1,68% mariadbd mariadbd [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*) This would seem to suggest that the bulk of the previously observed context switching is related to buf_flush_page_cleaner submitting writes of buffer pool pages to data files, which would conflict with application writes to the same pages in the buffer pool.

Julien Fritsch made changes - 2024-05-31 11:54

Fix Version/s		10.5 [ 23123 ]
Fix Version/s		10.6 [ 24028 ]
Fix Version/s		10.11 [ 27614 ]
Fix Version/s		11.1 [ 28549 ]
Fix Version/s		11.2 [ 28603 ]
Fix Version/s		11.4 [ 29301 ]
Fix Version/s		11.5 [ 29506 ]

Vladislav Vaintroub added a comment - 2024-05-31 12:35

When doing "wide range" of benchmarks, it also makes sense to use up to 4000 or so concurrent users, not just 4 as reporter did. Then, the impact of pure spin can become more interesting. Also, it should not be around NUMA or Linux with its lousy native_queued_spin_lock_slowpath. NUMA is bad on Linux, or mutexes/futexes are bad on NUMA or Linux, and we know it, but it does not mean it is bad everywhere else.

Vladislav Vaintroub added a comment - 2024-05-31 12:35 When doing "wide range" of benchmarks, it also makes sense to use up to 4000 or so concurrent users, not just 4 as reporter did. Then, the impact of pure spin can become more interesting. Also, it should not be around NUMA or Linux with its lousy native_queued_spin_lock_slowpath. NUMA is bad on Linux, or mutexes/futexes are bad on NUMA or Linux, and we know it, but it does not mean it is bad everywhere else.

Mark Callaghan added a comment - 2024-06-06 16:58 - edited

On my small server (8 cores) I see a regression that arrives in 10.6 and appears to come from changes to the InnnoDB mutex and rw-lock code. I have been trying some of the patches from that as part of my sysbench tests, but I use larger servers for the sysbench tests (dell32 == 1-socket and 32 cores, socket2 = 2-sockets, 12 cores/socket, 24 cores total).

And in my sysbench tests the problem microbenchmark is update-index, but with update-index the biggest regression arrives in 10.5. Also, on the 2-socket server, the update-inlist microbenchmark has regressions in many LTS releases as far back as 10.2. Results are here for socket2 and for dell32 – scroll to the end of each page, numbers less than 1.0 == regression.

On the 2-socket server for the update-index microbenchmark, from 10.4 to 10.5

CPU/update overhead (cpu/o) almost doubles
Context switches per update (cs/o) more than doubles

From the 2-socket server, the top 20 call stacks for 10.4 and 10.5 during update index are here

at first glance they are similar. The top waits are mutex contention and/or IO waits under ha_commit_trans
I see many more stacks with TTASEventMutex<GenericPolicy> with 10.4 than with 10.5
note that I have three variants for the 10.4 result – the first with the z11a_c24r64 config which is my standard config, then one with z11abpi1_c24r64 which adds innodb_buffer_pool_instances=1 to mimic 10.5, and then one with z11aredo1_c24r64 which just uses 1 big redo log (also to mimic 10.5). But none of these encounters the perf problems that 10.5 has with the update-index microbenchmark

I am not surprised there are regressions in 10.5 given the number of InnoDB changes, my short list is below and the first two might explain the regressions I see here:

many redo log files -> only 1 redo log file (innodb_log_files_in_group removed)
only 1 buffer pool instance| (remove innodb_buffer_pool_instances, innodb_page_cleaners)
remove innodb_concurrency_tickets
remove innodb_idle_flush_pct
* remove innodb_thread_concurrency

What else changed?

I am repeating tests on the 2-socket server, but collecting perf CPU samples instead of PMP and then I will have flame graphs.

Mark Callaghan added a comment - 2024-06-06 16:58 - edited On my small server (8 cores) I see a regression that arrives in 10.6 and appears to come from changes to the InnnoDB mutex and rw-lock code. I have been trying some of the patches from that as part of my sysbench tests, but I use larger servers for the sysbench tests (dell32 == 1-socket and 32 cores, socket2 = 2-sockets, 12 cores/socket, 24 cores total). And in my sysbench tests the problem microbenchmark is update-index, but with update-index the biggest regression arrives in 10.5. Also, on the 2-socket server, the update-inlist microbenchmark has regressions in many LTS releases as far back as 10.2. Results are here for socket2 and for dell32 – scroll to the end of each page, numbers less than 1.0 == regression. On the 2-socket server for the update-index microbenchmark, from 10.4 to 10.5 CPU/update overhead (cpu/o) almost doubles Context switches per update (cs/o) more than doubles From the 2-socket server, the top 20 call stacks for 10.4 and 10.5 during update index are here at first glance they are similar. The top waits are mutex contention and/or IO waits under ha_commit_trans I see many more stacks with TTASEventMutex<GenericPolicy> with 10.4 than with 10.5 note that I have three variants for the 10.4 result – the first with the z11a_c24r64 config which is my standard config, then one with z11abpi1_c24r64 which adds innodb_buffer_pool_instances=1 to mimic 10.5, and then one with z11aredo1_c24r64 which just uses 1 big redo log (also to mimic 10.5). But none of these encounters the perf problems that 10.5 has with the update-index microbenchmark I am not surprised there are regressions in 10.5 given the number of InnoDB changes, my short list is below and the first two might explain the regressions I see here: many redo log files -> only 1 redo log file (innodb_log_files_in_group removed) only 1 buffer pool instance| (remove innodb_buffer_pool_instances, innodb_page_cleaners) remove innodb_concurrency_tickets remove innodb_idle_flush_pct * remove innodb_thread_concurrency  What else changed? I am repeating tests on the 2-socket server, but collecting perf CPU samples instead of PMP and then I will have flame graphs.

Mark Callaghan added a comment - 2024-06-06 22:26

From the gists I shared above, if you scroll to the end of the link for the socket2 server (see here) then you will see that results are much worse for x.ma101107_rel_withdbg.z11a_c24r64.pk1 (MariaDB 10.11.7) and x.ma110401_rel_withdbg.z11b_c24r64.pk1 (MariaDB 11.4.1) and the obvious change is that r/o and rKB/o were 0 in MariaDB 10.6 and earlier releases but they are non-zero starting in 10.11.7. The r/o column is iostat reads per operation and (r/s divided by IPS) and rKB/o is iostat KB read per operation (read KB/s divided by IPS).

Although this comment might belong in MDEV 33894

Mark Callaghan added a comment - 2024-06-06 22:26 From the gists I shared above, if you scroll to the end of the link for the socket2 server ( see here ) then you will see that results are much worse for x.ma101107_rel_withdbg.z11a_c24r64.pk1 (MariaDB 10.11.7) and x.ma110401_rel_withdbg.z11b_c24r64.pk1 (MariaDB 11.4.1) and the obvious change is that r/o and rKB/o were 0 in MariaDB 10.6 and earlier releases but they are non-zero starting in 10.11.7. The r/o column is iostat reads per operation and (r/s divided by IPS) and rKB/o is iostat KB read per operation (read KB/s divided by IPS). Although this comment might belong in MDEV 33894

Mark Callaghan added a comment - 2024-06-12 19:42

Tests on the small server with the Insert Benchmark are still running to include the latest updates to PR 3317.

I tried the updated PR on a big server (2 cores, 12 cores/socket) using sysbench and the the problem didn't go away. Although on this server, the regression starts in 10.5 not 10.6, so the mutex/rw-lock changes are not the suspect.

For results see here

Mark Callaghan added a comment - 2024-06-12 19:42 Tests on the small server with the Insert Benchmark are still running to include the latest updates to PR 3317. I tried the updated PR on a big server (2 cores, 12 cores/socket) using sysbench and the the problem didn't go away. Although on this server, the regression starts in 10.5 not 10.6, so the mutex/rw-lock changes are not the suspect. For results see here

Marko Mäkelä added a comment - 2024-06-13 05:11

Thank you, I see that our efforts in ~~MDEV-34178~~ are leading to some improvement.

When it comes to the regression between 10.4 and 10.5, if it is due to InnoDB, my main suspect would be the changes to the buffer pool (~~MDEV-15053~~, ~~MDEV-15058~~, ~~MDEV-16264~~) and many follow-up changes. Another substantial change would be ~~MDEV-12353~~ and some related changes, but I would not expect any regression there (except for ROW_FORMAT=COMPRESSED). Parsing and applying the log should be faster, but also writing should be more efficient, because some calls to rec_get_offsets() at least during INSERT were removed.

Marko Mäkelä added a comment - 2024-06-13 05:11 Thank you, I see that our efforts in MDEV-34178 are leading to some improvement. When it comes to the regression between 10.4 and 10.5, if it is due to InnoDB, my main suspect would be the changes to the buffer pool ( MDEV-15053 , MDEV-15058 , MDEV-16264 ) and many follow-up changes. Another substantial change would be MDEV-12353 and some related changes, but I would not expect any regression there (except for ROW_FORMAT=COMPRESSED ). Parsing and applying the log should be faster, but also writing should be more efficient, because some calls to rec_get_offsets() at least during INSERT were removed.

Mark Callaghan added a comment - 2024-06-13 18:10

Results from my socket2 server (2 sockets, 12 cores/socket) are here for sysbench with 16 clients and a cached database.

QPS for read-only microbenchmarks drops in 10.5 and the drop is similar to the results for 10.4 when 10.4 uses innodb_buffer_pool_instances=1
QPS for write-heavy microbenchmarks drops in 10.5 but the drop larger than the results for 10.4 when 10.4 uses one large innodb redo log
QPS for update-index drops in half from 10.4 to 10.5 and I have yet mimic that using a bad my.cnf with 10.4

Looking at PMP stack traces for a few of the microbenchmarks that show a large regression in 10.5. First up is points-covered-pk.pre_range=100 see here where QPS drops by ~24% (relative throughput is 0.76 for 10.5) and there is a huge spike in contention from this (see here:
53 futex_wait,_GI_lll_lock_wait,__pthread_mutex_lock,inline_mysql_mutex_lock,buf_page_make_young,buf_page_make_young_if_needed,buf_page_get_low,btr_cur_search_to_nth_level

Then I checked another microbenchmark where the relative throughput is 0.73 for MariaDB 10.5 (so it gets ~27% less QPS). It is range-notcovered-si.pre which does a non-covering secondary index scan see here and the problem is the same as above (buf_page_make_young).

The throughput for these read-only microbenchmarks improves some in 10.6, but the primary contention is still under buf_page_make_young.

This isn't a surprise, while I don't understand all of the changes that were made limiting InnoDB to one buffer pool instance means there will be more mutex contention.

For update-index the aggregated stack traces are here for 10.4, 10.5 and 10.6. The big regression starts in 10.5 where throughput drops in half. In 10.5 the top sources of contention are:

rw_lock_sx_lock_func,pfs_rw_lock_sx_lock_func,mtr_t::sx_lock,btr_cur_search_to_nth_level
THD::wait_for_wakeup_ready,MYSQL_BIN_LOG::write_transaction_to_binlog_events,MYSQL_BIN_LOG::write_transaction_to_binlog,binlog_flush_cache,binlog_commit_flush

And in 10.6 they are:

srw_lock_impl<false>::rd_lock
ha_commit_trans

Mark Callaghan added a comment - 2024-06-13 18:10 Results from my socket2 server (2 sockets, 12 cores/socket) are here for sysbench with 16 clients and a cached database. QPS for read-only microbenchmarks drops in 10.5 and the drop is similar to the results for 10.4 when 10.4 uses innodb_buffer_pool_instances=1 QPS for write-heavy microbenchmarks drops in 10.5 but the drop larger than the results for 10.4 when 10.4 uses one large innodb redo log QPS for update-index drops in half from 10.4 to 10.5 and I have yet mimic that using a bad my.cnf with 10.4 Looking at PMP stack traces for a few of the microbenchmarks that show a large regression in 10.5. First up is points-covered-pk.pre_range=100 see here where QPS drops by ~24% (relative throughput is 0.76 for 10.5) and there is a huge spike in contention from this ( see here : 53 futex_wait,_ GI _ lll_lock_wait, __pthread_mutex_lock,inline_mysql_mutex_lock,buf_page_make_young,buf_page_make_young_if_needed,buf_page_get_low,btr_cur_search_to_nth_level Then I checked another microbenchmark where the relative throughput is 0.73 for MariaDB 10.5 (so it gets ~27% less QPS). It is range-notcovered-si.pre which does a non-covering secondary index scan see here and the problem is the same as above (buf_page_make_young). The throughput for these read-only microbenchmarks improves some in 10.6, but the primary contention is still under buf_page_make_young. This isn't a surprise, while I don't understand all of the changes that were made limiting InnoDB to one buffer pool instance means there will be more mutex contention. For update-index the aggregated stack traces are here for 10.4, 10.5 and 10.6. The big regression starts in 10.5 where throughput drops in half . In 10.5 the top sources of contention are: rw_lock_sx_lock_func,pfs_rw_lock_sx_lock_func,mtr_t::sx_lock,btr_cur_search_to_nth_level THD::wait_for_wakeup_ready,MYSQL_BIN_LOG::write_transaction_to_binlog_events,MYSQL_BIN_LOG::write_transaction_to_binlog,binlog_flush_cache,binlog_commit_flush And in 10.6 they are: srw_lock_impl<false>::rd_lock ha_commit_trans

Mark Callaghan added a comment - 2024-07-26 23:15

Thanks for the fix for ~~MDEV-33894~~ I was able to show that MariaDB is ~10% faster than MySQL on a medium server (16 cores). But those results also show a large increase in mutex contention (context switch rates) in 10.5 and 10.11 for some of the microbenchmarks. Either this or ~~MDEV-34178~~ is the issue.

See https://smalldatum.blogspot.com/2024/07/sysbench-on-medium-server-mariadb-is.html

Mark Callaghan added a comment - 2024-07-26 23:15 Thanks for the fix for MDEV-33894 I was able to show that MariaDB is ~10% faster than MySQL on a medium server (16 cores). But those results also show a large increase in mutex contention (context switch rates) in 10.5 and 10.11 for some of the microbenchmarks. Either this or MDEV-34178 is the issue. See https://smalldatum.blogspot.com/2024/07/sysbench-on-medium-server-mariadb-is.html

Marko Mäkelä added a comment - 2024-08-02 11:40

mdcallag, thank you! There is some rather low-hanging fruit in ~~MDEV-34450~~, ~~MDEV-34515~~ and ~~MDEV-34520~~ waiting for me now that I am back from vacation. Furthermore, MDEV-34431 will need some work in order to avoid a regression on some workloads. (As we can see in ~~MDEV-33515~~, spin loops can be good or bad, depending on the hardware as well as the level of concurrency.)

I would also like to point out that the cache pollution due to ~~MDEV-33508~~ turned out to be a huge problem for a customer. ~~MDEV-34296~~ might be another case that involves cache pollution (but not only that). What I am trying to say is that sometimes the default -e cycles of perf record is not sufficient.

I think that the near term testing (after some fixes) should concentrate on the long-term-support branches 10.6, 10.11 and 11.4. We are close to getting the quarterly releases out, so merge conflicts outside of InnoDB should hopefully not be much of an issue.

Marko Mäkelä added a comment - 2024-08-02 11:40 mdcallag , thank you! There is some rather low-hanging fruit in MDEV-34450 , MDEV-34515 and MDEV-34520 waiting for me now that I am back from vacation. Furthermore, MDEV-34431 will need some work in order to avoid a regression on some workloads. (As we can see in MDEV-33515 , spin loops can be good or bad, depending on the hardware as well as the level of concurrency.) I would also like to point out that the cache pollution due to MDEV-33508 turned out to be a huge problem for a customer. MDEV-34296 might be another case that involves cache pollution (but not only that). What I am trying to say is that sometimes the default -e cycles of perf record is not sufficient. I think that the near term testing (after some fixes) should concentrate on the long-term-support branches 10.6, 10.11 and 11.4. We are close to getting the quarterly releases out, so merge conflicts outside of InnoDB should hopefully not be much of an issue.

Marko Mäkelä made changes - 2024-08-02 14:36

Link

This issue is blocked by ~~MDEV-34515~~ [ ~~MDEV-34515~~ ]

Marko Mäkelä added a comment - 2024-08-16 08:50

~~MDEV-34759~~ affects read operations via secondary indexes.

Marko Mäkelä added a comment - 2024-08-16 08:50 MDEV-34759 affects read operations via secondary indexes.

Marko Mäkelä made changes - 2024-08-16 08:50

Link

This issue is blocked by ~~MDEV-34759~~ [ ~~MDEV-34759~~ ]

Julien Fritsch made changes - 2024-09-10 15:27

Fix Version/s

11.1 [ 28549 ]

Julien Fritsch made changes - 2024-09-10 15:45

Fix Version/s

11.5 [ 29506 ]

Mark Callaghan added a comment - 2024-09-10 23:15 - edited

The tl;dr is that QPS on read-heavy sysbench tests drops in half from MariaDB 10.4 to 10.5. The problem appears to be mutex contention because 10.5 got rid of innodb_buffer_pool_instances.

This regression has never been undone, although in this case the regressions from 10.6 through 11.4 is not big for the tests that have a huge regression in 10.5.

Results from a large server (48 cores, AMD SMT disabled so /proc/cpuinfo shows 48 CPUs) that I get from Hetzner.
https://smalldatum.blogspot.com/2024/09/trying-out-dedicated-server-from-hetzner.html

Relative QPS is here. The database versions have names like ma100244_rel_withdbg.z11a_c32r128.pk1

ma100244 means MariaDB 10.2.44
rel_withdbg has a few details on the build, in this case the CMake command line uses -DCMAKE_BUILD_TYPE=RelWithDebInfo
z11a_c32r128 is the my.cnf which maps to my.cnf.cz11a_c32r128 and those are in the ma10* and ma11* subdirectories here

How I use sysbench is explained here.

The numbers in the gist are the relative QPS, which is: (QPS for $version / QPS for MariaDB 10.2.44) and a value greater than 1.0 means that $version is faster than MariaDB 10.2.44. I tested the latest patch releases for 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.5 and 11.6.

The results are here and a common pattern is that the relative QPS values for read-heavy microbenchmarks drop from ~1.0 in col-2 to less than 0.5 (see here) in col-3 where:

col-2 : x.ma100434_rel_withdbg.z11a_c32r128.pk1 -> MariaDB 10.4.34
col-3 : x.ma100526_rel_withdbg.z11a_c32r128.pk1 -> MariaDB 10.5.26

To be clear, QPS on many microbenchmarks drops in half from 10.4 to 10.5 on a big server.
Note that MariaDB 10.4 supports innodb_buffer_pool_instances > 1 (I used 8) while 10.5 only supports one instance.

Metrics from vmstat and iostat normalized (divided by) QPS are here. And checking that for the result from random-points with range=1000 (1000 rows to fetch per query) the relative QPS is here and drops from 1.04 in MariaDB 10.4.34 to 0.41 in MariaDB 10.5.26 (so it loses more than half the throughput in 10.5).

From the normalized iostat and vmstat metrics I see that in MariaDB 10.5.26 (x.ma100526_rel_withdbg.z11a_c32r128.pk1) ...

context switches / query increase by almost 20X. Relative to MariaDB 10.2.44 the context switch rate (cs/o) was 1.0 for MariaDB 10.4.34 and then is 18.98 with MariaDB 10.5.26
CPU/query increases by ~2.5X. Relative to MariaDB the CPU overhead (cpu/o) was 0.96 for MariaDB 10.4.34 and then is 2.40 for MariaDB 10.5.26

Then I repeated tests and used PMP to collect thread stacks. I got many samples per microbenchmark and then merged all of the samples per microbenchmark. The format for this is: <count> <stack trace> where <count> is the number of times that trace occurs in the merged samples. The results are here for 10.4.34 and for 10.5.26.

for 10.4.34 of the top-10 stack traces by count, the only one that shows mutex contention is the only the 5th most frequent see here
for 10.5.26 of the top-10 stack traces by count, the first, third, sixth and ninth most frequent traces all show mutex contention see here

Mark Callaghan added a comment - 2024-09-10 23:15 - edited The tl;dr is that QPS on read-heavy sysbench tests drops in half from MariaDB 10.4 to 10.5. The problem appears to be mutex contention because 10.5 got rid of innodb_buffer_pool_instances. This regression has never been undone, although in this case the regressions from 10.6 through 11.4 is not big for the tests that have a huge regression in 10.5. Results from a large server (48 cores, AMD SMT disabled so /proc/cpuinfo shows 48 CPUs) that I get from Hetzner. https://smalldatum.blogspot.com/2024/09/trying-out-dedicated-server-from-hetzner.html Relative QPS is here . The database versions have names like ma100244_rel_withdbg.z11a_c32r128.pk1 ma100244 means MariaDB 10.2.44 rel_withdbg has a few details on the build, in this case the CMake command line uses -DCMAKE_BUILD_TYPE=RelWithDebInfo z11a_c32r128 is the my.cnf which maps to my.cnf.cz11a_c32r128 and those are in the ma10* and ma11* subdirectories here How I use sysbench is explained here . The numbers in the gist are the relative QPS, which is: (QPS for $version / QPS for MariaDB 10.2.44) and a value greater than 1.0 means that $version is faster than MariaDB 10.2.44. I tested the latest patch releases for 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.5 and 11.6. The results are here and a common pattern is that the relative QPS values for read-heavy microbenchmarks drop from ~1.0 in col-2 to less than 0.5 ( see here ) in col-3 where: col-2 : x.ma100434_rel_withdbg.z11a_c32r128.pk1 -> MariaDB 10.4.34 col-3 : x.ma100526_rel_withdbg.z11a_c32r128.pk1 -> MariaDB 10.5.26 To be clear, QPS on many microbenchmarks drops in half from 10.4 to 10.5 on a big server. Note that MariaDB 10.4 supports innodb_buffer_pool_instances > 1 (I used 8) while 10.5 only supports one instance. Metrics from vmstat and iostat normalized (divided by) QPS are here . And checking that for the result from random-points with range=1000 (1000 rows to fetch per query) the relative QPS is here and drops from 1.04 in MariaDB 10.4.34 to 0.41 in MariaDB 10.5.26 (so it loses more than half the throughput in 10.5). From the normalized iostat and vmstat metrics I see that in MariaDB 10.5.26 (x.ma100526_rel_withdbg.z11a_c32r128.pk1) ... context switches / query increase by almost 20X. Relative to MariaDB 10.2.44 the context switch rate (cs/o) was 1.0 for MariaDB 10.4.34 and then is 18.98 with MariaDB 10.5.26 CPU/query increases by ~2.5X. Relative to MariaDB the CPU overhead (cpu/o) was 0.96 for MariaDB 10.4.34 and then is 2.40 for MariaDB 10.5.26 Then I repeated tests and used PMP to collect thread stacks. I got many samples per microbenchmark and then merged all of the samples per microbenchmark. The format for this is: <count> <stack trace> where <count> is the number of times that trace occurs in the merged samples. The results are here for 10.4.34 and for 10.5.26 . for 10.4.34 of the top-10 stack traces by count, the only one that shows mutex contention is the only the 5th most frequent see here for 10.5.26 of the top-10 stack traces by count, the first, third, sixth and ninth most frequent traces all show mutex contention see here

Marko Mäkelä added a comment - 2024-09-11 12:52

mdcallag, thank you. You results seem to indicate that buf_page_make_young() is causing contention on the single buf_pool.mutex that protects the single buf_pool.LRU list starting from 10.5. That mutex is of the mysql_mutex_t type already in 10.5.

In recent releases of 10.6 and later, there should be fewer calls of this function. Also, thanks to ~~MDEV-26827~~ there should be less contention on buf_pool.mutex and buf_pool.flush_list_mutex around page writes.

I don’t think that a radical approach like the one in ~~MDEV-23855~~ would work. There we removed fil_system.LRU altogether and implemented an alternative solution of closing not-recently-used files. We’d better retain the buf_pool.LRU roughly as is. We could exploit an old idea. In ~~MDEV-26827~~ we defer the removal of clean blocks from buf_pool.flush_list.

A ‘lazy’ solution could be that buf_page_make_young() would merely set a flag in buf_page_t without acquiring any mutex. In buf_do_LRU_batch() and buf_do_flush_list_batch(), for any encountered block where the flag is set, it would be cleared and the block moved to the most recently used end. A limitation of this approach would be that when background flushing is not enabled, the system could deteriorate to a state where most of blocks would have the flag set, and buf_do_LRU_batch() would be forced to scan the entire buf_pool.LRU list before it can make progress on a second iteration. I think that it would be worth an experiment.

Marko Mäkelä added a comment - 2024-09-11 12:52 mdcallag , thank you. You results seem to indicate that buf_page_make_young() is causing contention on the single buf_pool.mutex that protects the single buf_pool.LRU list starting from 10.5. That mutex is of the mysql_mutex_t type already in 10.5. In recent releases of 10.6 and later, there should be fewer calls of this function. Also, thanks to MDEV-26827 there should be less contention on buf_pool.mutex and buf_pool.flush_list_mutex around page writes. I don’t think that a radical approach like the one in MDEV-23855 would work. There we removed fil_system.LRU altogether and implemented an alternative solution of closing not-recently-used files. We’d better retain the buf_pool.LRU roughly as is. We could exploit an old idea. In MDEV-26827 we defer the removal of clean blocks from buf_pool.flush_list . A ‘lazy’ solution could be that buf_page_make_young() would merely set a flag in buf_page_t without acquiring any mutex. In buf_do_LRU_batch() and buf_do_flush_list_batch() , for any encountered block where the flag is set, it would be cleared and the block moved to the most recently used end. A limitation of this approach would be that when background flushing is not enabled, the system could deteriorate to a state where most of blocks would have the flag set, and buf_do_LRU_batch() would be forced to scan the entire buf_pool.LRU list before it can make progress on a second iteration. I think that it would be worth an experiment.

Marko Mäkelä made changes - 2024-09-11 12:53

Status

Open [ 1 ]

Confirmed [ 10101 ]

Marko Mäkelä made changes - 2024-09-11 13:02

Link

This issue is caused by ~~MDEV-15058~~ [ ~~MDEV-15058~~ ]

Marko Mäkelä made changes - 2024-09-11 13:02

Component/s		Storage Engine - InnoDB [ 10129 ]
Component/s	Server [ 13907 ]
Fix Version/s	10.5 [ 23123 ]
Affects Version/s		10.5 [ 23123 ]
Affects Version/s		10.6 [ 24028 ]
Affects Version/s		10.7 [ 24805 ]
Affects Version/s		10.8 [ 26121 ]
Affects Version/s		10.9 [ 26905 ]
Affects Version/s		10.10 [ 27530 ]
Affects Version/s		10.11 [ 27614 ]
Affects Version/s		11.0 [ 28320 ]
Affects Version/s		11.1 [ 28549 ]
Affects Version/s		11.2 [ 28603 ]
Affects Version/s		11.3 [ 28565 ]
Affects Version/s		11.4 [ 29301 ]
Affects Version/s		11.5 [ 29506 ]
Affects Version/s		11.6 [ 29515 ]
Labels		performance regression

Marko Mäkelä made changes - 2024-09-11 14:25

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Mark Callaghan added a comment - 2024-09-11 15:44 - edited

With respect to things improving in latest versions of 10.6, from the results here in most cases when the performance drops in 10.5.26 (col-3) it remains bad in 10.6.19 (col-4) and 10.11.9 (col-5). From the highlighted results there were only 3 microbenchmarks where things got worse in 10.5 and then improved in 10.6 – hot-points, points-covered-si.pre and points-covered-si

Forgot to tag @marko

Mark Callaghan added a comment - 2024-09-11 15:44 - edited With respect to things improving in latest versions of 10.6, from the results here in most cases when the performance drops in 10.5.26 (col-3) it remains bad in 10.6.19 (col-4) and 10.11.9 (col-5). From the highlighted results there were only 3 microbenchmarks where things got worse in 10.5 and then improved in 10.6 – hot-points, points-covered-si.pre and points-covered-si Forgot to tag @marko

Marko Mäkelä added a comment - 2024-09-11 16:41

I intend to address this in 10.6. While there may be slightly fewer calls to the function (especially from the purge subsystem) there, I do not disagree that this looks like an obvious and rather easily fixable bottleneck.

Marko Mäkelä added a comment - 2024-09-11 16:41 I intend to address this in 10.6. While there may be slightly fewer calls to the function (especially from the purge subsystem) there, I do not disagree that this looks like an obvious and rather easily fixable bottleneck.

Marko Mäkelä added a comment - 2024-09-17 14:26

https://github.com/MariaDB/server/pull/3522 is a proof of concept and work in progress. Currently, I don’t think it approximates an LRU replacement policy, because the access_time field is being ignored. The field freed_page_clock in buf_pool and in block descriptors would no longer serve any purpose. I also hope to be able to shrink the size of a block descriptor by 8 bytes. With the current version, we will waste 4 bytes to alignment.

Marko Mäkelä added a comment - 2024-09-17 14:26 https://github.com/MariaDB/server/pull/3522 is a proof of concept and work in progress. Currently, I don’t think it approximates an LRU replacement policy, because the access_time field is being ignored. The field freed_page_clock in buf_pool and in block descriptors would no longer serve any purpose. I also hope to be able to shrink the size of a block descriptor by 8 bytes. With the current version, we will waste 4 bytes to alignment.

Marko Mäkelä added a comment - 2024-09-19 13:56

Today’s revision might be usable for some initial testing. Something is wrong regarding buf_pool.LRU_old, but I was not able to figure out the root cause of it yet. I moved some flags to a single std::atomic<uint16_t> that resides inside page_zip_des_t. Now that GCC 7 is the oldest compiler version that is present in a supported GNU/Linux distribution, single-bit fetch_or() or fetch_and will actually translate into something smarter than loops around lock cmpxchg. Only MSVC requires the use of its intrinsics to generate sane code around a test-and-reset operation (lock btr followed by reading the Carry flag). On saner ISA such as ARM or POWER there should be no need to worry.

I still want to avoid losing 4 bytes of memory in each block descriptor to alignment. I am thinking of reducing the accuracy of the access_time field. Instead of counting milliseconds, we could count something like seconds; 65536 seconds might wrap around too soon (in less than 24 hours). 2¹⁴ seconds would be close to 2³² milliseconds, which we had until now. It should be reasonably safe to steal 14 bits from modify_clock by reducing it to 50 bits. In this way, the size of the block descriptor would be shrunk by 8 bytes.

Marko Mäkelä added a comment - 2024-09-19 13:56 Today’s revision might be usable for some initial testing. Something is wrong regarding buf_pool.LRU_old , but I was not able to figure out the root cause of it yet. I moved some flags to a single std::atomic<uint16_t> that resides inside page_zip_des_t . Now that GCC 7 is the oldest compiler version that is present in a supported GNU/Linux distribution, single-bit fetch_or() or fetch_and will actually translate into something smarter than loops around lock cmpxchg . Only MSVC requires the use of its intrinsics to generate sane code around a test-and-reset operation ( lock btr followed by reading the Carry flag). On saner ISA such as ARM or POWER there should be no need to worry. I still want to avoid losing 4 bytes of memory in each block descriptor to alignment. I am thinking of reducing the accuracy of the access_time field. Instead of counting milliseconds, we could count something like seconds; 65536 seconds might wrap around too soon (in less than 24 hours). 2¹⁴ seconds would be close to 2³² milliseconds, which we had until now. It should be reasonably safe to steal 14 bits from modify_clock by reducing it to 50 bits. In this way, the size of the block descriptor would be shrunk by 8 bytes.

Marko Mäkelä added a comment - 2024-09-20 13:31

The correctness problem was related to ROW_FORMAT=COMPRESSED pages. I also adjusted the UNIV_LRU_DEBUG logic for this refactoring.

The only remaining part for now is revising the granularity of access_time, which is related to avoiding buffer pool pollution when a page is being loaded into the buffer pool and not going to be accessed after an initial burst of accesses. I think that it should be acceptable to lose some granularity there. The default value of innodb_old_blocks_time is 1000 milliseconds, meaning that any accesses that are within 1 second of the initial access will still count as one access, that is, the block will retain a "least recently used" status.

Marko Mäkelä added a comment - 2024-09-20 13:31 The correctness problem was related to ROW_FORMAT=COMPRESSED pages. I also adjusted the UNIV_LRU_DEBUG logic for this refactoring. The only remaining part for now is revising the granularity of access_time , which is related to avoiding buffer pool pollution when a page is being loaded into the buffer pool and not going to be accessed after an initial burst of accesses. I think that it should be acceptable to lose some granularity there. The default value of innodb_old_blocks_time is 1000 milliseconds, meaning that any accesses that are within 1 second of the initial access will still count as one access, that is, the block will retain a "least recently used" status.

Marko Mäkelä added a comment - 2024-09-25 11:50

I just ran some RAM disk based performance tests on my local 2-socket system.

In some multi-threaded tests with a small buffer pool, I actually observed a performance regression. We’ll need more tests and profiling to find out the culprit. A possible explanation is that the approximated buf_pool.LRU eviction policy is nowhere near enough LRU. If that turns out to be the case, we should be able to address that by revising buf_page_t::make_young() in some way.

With a single-threaded workload (1 table, 100k rows, tiny buffer pool) I observed a significant improvement. So, there is some promise.

Marko Mäkelä added a comment - 2024-09-25 11:50 I just ran some RAM disk based performance tests on my local 2-socket system. In some multi-threaded tests with a small buffer pool, I actually observed a performance regression. We’ll need more tests and profiling to find out the culprit. A possible explanation is that the approximated buf_pool.LRU eviction policy is nowhere near enough LRU. If that turns out to be the case, we should be able to address that by revising buf_page_t::make_young() in some way. With a single-threaded workload (1 table, 100k rows, tiny buffer pool) I observed a significant improvement. So, there is some promise.

Marko Mäkelä made changes - 2024-09-25 11:50

Assignee	Marko Mäkelä [ marko ]	Debarun Banerjee [ JIRAUSER54513 ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Debarun Banerjee added a comment - 2024-10-01 18:42

The code changes look ok functionally. However, there could be impact in different test scenarios as more work is now scheduled for page cleaner. We need to carefully test different scenarios and take the decision. My runs for read-only sysbench shows improvement in lower concurrency and drop in 16 threads and higher for cpu oriented test with SSD.

As discussed, I think we need to spend more time analyzing and testing the issue.

Debarun Banerjee added a comment - 2024-10-01 18:42 The code changes look ok functionally. However, there could be impact in different test scenarios as more work is now scheduled for page cleaner. We need to carefully test different scenarios and take the decision. My runs for read-only sysbench shows improvement in lower concurrency and drop in 16 threads and higher for cpu oriented test with SSD. As discussed, I think we need to spend more time analyzing and testing the issue.

Debarun Banerjee made changes - 2024-10-01 18:42

Assignee	Debarun Banerjee [ JIRAUSER54513 ]	Marko Mäkelä [ marko ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Marko Mäkelä added a comment - 2024-10-11 12:31

monty presented an alternative idea, which roughly is the following: Retain the current logic, but revise it so that buf_page_t::access_time would be updated to the current time so that buf_page_make_young() would be invoked less often. I did not assess the feasibility of that idea. Neither did I have time to analyze deeper why my patch would often introduce a performance regression.

Marko Mäkelä added a comment - 2024-10-11 12:31 monty presented an alternative idea, which roughly is the following: Retain the current logic, but revise it so that buf_page_t::access_time would be updated to the current time so that buf_page_make_young() would be invoked less often. I did not assess the feasibility of that idea. Neither did I have time to analyze deeper why my patch would often introduce a performance regression.

Marko Mäkelä made changes - 2024-10-11 12:49

Link

This issue relates to ~~MDEV-35125~~ [ ~~MDEV-35125~~ ]

Marko Mäkelä added a comment - 2024-10-11 12:49

In some critical sections of buf_pool.mutex we are performing potentially very expensive buf_pool.page_hash lookups. ~~MDEV-35125~~ explains how these lookups could be performed before acquiring buf_pool.mutex.

Marko Mäkelä added a comment - 2024-10-11 12:49 In some critical sections of buf_pool.mutex we are performing potentially very expensive buf_pool.page_hash lookups. MDEV-35125 explains how these lookups could be performed before acquiring buf_pool.mutex .

Marko Mäkelä made changes - 2024-10-15 05:30

Link

This issue relates to MDEV-35155 [ MDEV-35155 ]

Mark Callaghan added a comment - 2024-11-21 22:47 - edited

Results using the latest point releases with both low-concurrency (1 thread) and high concurrency (40 threads). Not much is new, other than I have results from new versions:

I still get better QPS in some cases by adding innodb_log_write_ahead_size=4k to my.cnf
I don't see many regressions from old to new MariaDB at low concurrency, so it has done great at avoiding CPU regressions
I still see big regressions from old to new MariaDB at high concurrency, so mutex contention is an issue and the changes to InnoDB seems to be the cause
the big regressions occur more for read-heavy than for write-heavy and in many cases modern MariaDB (10.5+) gets less than half of the QPS vs older MariaDB. The big regressions for read-heavy arrive in 10.5 and things haven't improved since then

All of the my.cnf files I use are archived here:

The gists I link to below use a naming pattern like x.ma110502_rel_withdbg.z11b_c8r32.pk1 and x.ma110502_rel_withdbg.z11b_lwas4k_c8r32.pk1. For now ignore the "x." at the start and the ".pk1" at the end. The "ma110502_rel_withdbg" means:

MariaDB ("ma")
11.5.2 ("110502")
a build that used CMAKE_BUILD_TYPE=RelWithDebInfo
my.cnf is my.cnf.cz11b_lwas4k_c8r32 (see here)

The links that follow have results in terms of relative QPS (rQPS) which is: (QPS for my version / QPS for older MariaDB). When I look at results across MariaDB major versions then the base case is 10.2.44. When I look at results only for 10.5 point releases then the base case is 10.4.34. When the relative QPS is 1.0 then QPS has not changed, when it is much less than 1.0 then there is a large regression.

First up is the results for the latest point releases at low concurrency. For point releases that support innodb_log_write_ahead_size I include results with it set to =4k and with it not set. The results are here and modern MariaDB almost always has relative QPS >= 0.9 which is OK.

For low concurrency with a focus on MariaDB 10.5 see here which has numbers for 10.5.0, 10.5.4, 10.5.10, 10.5.20, 10.5.24 and 10.5.27. The reason to share the results at low concurrency is to show there aren't big regressions in 10.5 at low concurrency in contrast to the results below that I share.

And note that my focus is on 10.5 because that is where the big regressions first arrived. They are still here, but to understand the root cause I need to show where they start.

Next up is results from the latest. point releases at high concurrency (40 threads) on a server with 48 cores (real cores, AMD SMT is disabled). The relative QPS numbers are here. In the worst cases the relative QPS drops to ~0.4 in 10.5 and has remained there through 11.7. This means older MariaDB gets ~2.5X more QPS than modern MariaDB:

points-covered-pk - fetch 1 row by PK on a covering index
points-notcovered-pk - fetch 1 row by PK on a non-covering index
random-points with range =100 and =1000 - point queries that fetch many rows via a covering index using SELECT with an IN-list
range-notcovered-si - do a short range scan on a non-covering secondary index

Results from the classic sysbench transaction are here. With them the relative QPS in modern MariaDB is 0.83 or 0.88 depending on the length of the range queries it uses. The regressions here arrived in 10.11 and not much changes since then. If you only use classic sysbench then you will miss some of the large regressions, because the regressions here are smaller than above.

A few write-heavy tests also have large regressions, but they aren't as bad as the results mentioned above

insert - relative QPS is 0.79, these arrived in 10.11
update-index - relative QPS is 0.80 in 11.7, these arrived in 10.5 but perf improved a bit since then
update-inlist - relative QPS is 0.48 in 11.7, these arrived in 10.5, then got worse in 10.11 and have been stable (not getting better) since then

Results from 10.5 point releases are here for high concurrency. I tried all point releases from 10.5.0 through 10.5.10, the even number point releases from 10.5.12 through 10.5.26 and then 10.5.27. I won't annotate them other than to say that the big regressions for read-heavy microbenchmarks arrived prior 10.5.10

Mark Callaghan added a comment - 2024-11-21 22:47 - edited Results using the latest point releases with both low-concurrency (1 thread) and high concurrency (40 threads). Not much is new, other than I have results from new versions: I still get better QPS in some cases by adding innodb_log_write_ahead_size=4k to my.cnf I don't see many regressions from old to new MariaDB at low concurrency, so it has done great at avoiding CPU regressions I still see big regressions from old to new MariaDB at high concurrency, so mutex contention is an issue and the changes to InnoDB seems to be the cause the big regressions occur more for read-heavy than for write-heavy and in many cases modern MariaDB (10.5+) gets less than half of the QPS vs older MariaDB. The big regressions for read-heavy arrive in 10.5 and things haven't improved since then All of the my.cnf files I use are archived here: for the low concurrency tests for the high concurrency tests The gists I link to below use a naming pattern like x.ma110502_rel_withdbg.z11b_c8r32.pk1 and x.ma110502_rel_withdbg.z11b_lwas4k_c8r32.pk1. For now ignore the "x." at the start and the ".pk1" at the end. The "ma110502_rel_withdbg" means: MariaDB ("ma") 11.5.2 ("110502") a build that used CMAKE_BUILD_TYPE=RelWithDebInfo my.cnf is my.cnf.cz11b_lwas4k_c8r32 (see here ) The links that follow have results in terms of relative QPS (rQPS) which is: (QPS for my version / QPS for older MariaDB). When I look at results across MariaDB major versions then the base case is 10.2.44. When I look at results only for 10.5 point releases then the base case is 10.4.34. When the relative QPS is 1.0 then QPS has not changed, when it is much less than 1.0 then there is a large regression. First up is the results for the latest point releases at low concurrency. For point releases that support innodb_log_write_ahead_size I include results with it set to =4k and with it not set. The results are here and modern MariaDB almost always has relative QPS >= 0.9 which is OK. For low concurrency with a focus on MariaDB 10.5 see here which has numbers for 10.5.0, 10.5.4, 10.5.10, 10.5.20, 10.5.24 and 10.5.27. The reason to share the results at low concurrency is to show there aren't big regressions in 10.5 at low concurrency in contrast to the results below that I share. And note that my focus is on 10.5 because that is where the big regressions first arrived. They are still here, but to understand the root cause I need to show where they start. Next up is results from the latest. point releases at high concurrency (40 threads) on a server with 48 cores (real cores, AMD SMT is disabled). The relative QPS numbers are here . In the worst cases the relative QPS drops to ~0.4 in 10.5 and has remained there through 11.7. This means older MariaDB gets ~2.5X more QPS than modern MariaDB: points-covered-pk - fetch 1 row by PK on a covering index points-notcovered-pk - fetch 1 row by PK on a non-covering index random-points with range =100 and =1000 - point queries that fetch many rows via a covering index using SELECT with an IN-list range-notcovered-si - do a short range scan on a non-covering secondary index Results from the classic sysbench transaction are here . With them the relative QPS in modern MariaDB is 0.83 or 0.88 depending on the length of the range queries it uses. The regressions here arrived in 10.11 and not much changes since then. If you only use classic sysbench then you will miss some of the large regressions, because the regressions here are smaller than above. A few write-heavy tests also have large regressions, but they aren't as bad as the results mentioned above insert - relative QPS is 0.79, these arrived in 10.11 update-index - relative QPS is 0.80 in 11.7, these arrived in 10.5 but perf improved a bit since then update-inlist - relative QPS is 0.48 in 11.7, these arrived in 10.5, then got worse in 10.11 and have been stable (not getting better) since then Results from 10.5 point releases are here for high concurrency. I tried all point releases from 10.5.0 through 10.5.10, the even number point releases from 10.5.12 through 10.5.26 and then 10.5.27. I won't annotate them other than to say that the big regressions for read-heavy microbenchmarks arrived prior 10.5.10

Mark Callaghan added a comment - 2024-11-21 22:56

I will repeat tests on the large server to collect thread stacks (PMP) and flamegraphs

Mark Callaghan added a comment - 2024-11-21 22:56 I will repeat tests on the large server to collect thread stacks (PMP) and flamegraphs

Julien Fritsch made changes - 2024-11-22 17:06

Fix Version/s

11.2(EOL) [ 28603 ]

Mark Callaghan added a comment - 2024-11-26 18:59 - edited

Aggregated thread stacks from PMP on the 48-core server with sysbench using 40 threads and a cached workload. All of the aggregated thread stacks ("hierarchical") are in subdirectories here.

From random-points.range1000 for 10.4.34 and 10.5.27 and the obvious problem is the most frequent thread stack for 10.5.27 which is mutex contention on the InnoDB LRU (buf_page_make_young). And the problem is the same for random-points.range100, points-covered-pk, points-notcovered-pk and range-notcovered-si. The problem is still there with 11.4.4

futex_wait,__GI___lll_lock_wait,___pthread_mutex_lock,inline_mysql_mutex_lock,buf_page_make_young,buf_page_make_young_if_needed,buf_page_get_low,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,READ_RECORD::read_record,sub_select,do_select,JOIN::exec_inner,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread,clone3

But range-notcovered-si has two threadstacks with buf_page_make_young in the top 2 – one for lock, one for unlock – see here for 10.5.27 and the problem remains in 11.4.4.

For insert there is more mutex contention in 10.5.from the InnoDB redo and the binlog – the top N stacks are here for 10.4.44 and 10.5.27. The problem remains in 11.4.4.

For update-inlist the stacks are here for 10.4.34 and 10.5.27 and there is just more mutex contention from InnoDB in 10.5.27. The problem remains in 11.4.4.

Mark Callaghan added a comment - 2024-11-26 18:59 - edited Aggregated thread stacks from PMP on the 48-core server with sysbench using 40 threads and a cached workload. All of the aggregated thread stacks ("hierarchical") are in subdirectories here . From random-points.range1000 for 10.4.34 and 10.5.27 and the obvious problem is the most frequent thread stack for 10.5.27 which is mutex contention on the InnoDB LRU (buf_page_make_young). And the problem is the same for random-points.range100, points-covered-pk, points-notcovered-pk and range-notcovered-si. The problem is still there with 11.4.4 futex_wait,__GI___lll_lock_wait,___pthread_mutex_lock,inline_mysql_mutex_lock,buf_page_make_young,buf_page_make_young_if_needed,buf_page_get_low,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,READ_RECORD::read_record,sub_select,do_select,JOIN::exec_inner,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread,clone3 But range-notcovered-si has two threadstacks with buf_page_make_young in the top 2 – one for lock, one for unlock – see here for 10.5.27 and the problem remains in 11.4.4 . For insert there is more mutex contention in 10.5.from the InnoDB redo and the binlog – the top N stacks are here for 10.4.44 and 10.5.27 . The problem remains in 11.4.4 . For update-inlist the stacks are here for 10.4.34 and 10.5.27 and there is just more mutex contention from InnoDB in 10.5.27. The problem remains in 11.4.4 .

Mark Callaghan added a comment - 2024-12-20 21:54

Another round of results are here with a cached database, small server, and sysbench using 1 thread. The numbers are the relative QPS which is (QPS for my version / QPS for MariaDB 10.2.44). Mostly I see a few small regressions and a few small improvements. The largest improvement is from setting innodb_log_write_ahead_size for 10.11 through 11.7.

This has results for:

ma100244_rel_withdbg.z11a_c8r32 - 10.2.44 with the z11a_c8r32 config
ma100339_rel_withdbg.z11a_c8r32 - 10.3.39 with the z11a_c8r32 config
ma100434_rel_withdbg.z11a_c8r32 - 10.4.34 with the z11a_c8r32 config
ma100527_rel_withdbg.z11a_c8r32 - 10.5.27 with the z11a_c8r32 config
ma100620_rel_withdbg.z11a_lwas4k_c8r32 - 10.6.20 with the z11a_c8r32 config
ma101110_rel_withdbg.z11a_lwas4k_c8r32 - 10.11.10 with the z11a_c8r32 config
ma110404_rel_withdbg.z11b_lwas4k_c8r32 - 11.4.4 with the z11b_c8r32 config
ma110502_rel_withdbg.z11b_lwas4k_c8r32 - 11.5.2 with the z11b_c8r32 config
ma110602_rel_withdbg.z11b_lwas4k_c8r32 - 11.6.2 with the z11b_c8r32 config
ma110701_rel_withdbg.z11b_lwas4k_c8r32 - 11.7.1 with the z11b_c8r32 config

The my.cnf files are in the subdirectories here. The differences in them are:

z11a_c8r32 uses innodb_flush_method while z11b_c8r32 uses the new options (innodb_log_file_buffering etc)
the 11.6.2 and 11.7.1 config files add innodb_snapshot_isolation=OFF

And this has results for a few more configs (see here)

z11a_lwas4k_c8r32 - adds innodb_log_write_ahead_size=4096, see this issue
z11a_si_c8r32 - adds innodb_snapshot_isolation=ON
z11b_lwas4k_si_c8r32 - adds innodb_log_write_ahead_size=4096 and innodb_snapshot_isolation=ON

I don't see an impact from enabling or disabling snapshot isolation. I do see an impact from innodb_log_write_ahead_size=4096 but that is explained in issue 33894.

Mark Callaghan added a comment - 2024-12-20 21:54 Another round of results are here with a cached database, small server, and sysbench using 1 thread. The numbers are the relative QPS which is (QPS for my version / QPS for MariaDB 10.2.44). Mostly I see a few small regressions and a few small improvements. The largest improvement is from setting innodb_log_write_ahead_size for 10.11 through 11.7. This has results for: ma100244_rel_withdbg.z11a_c8r32 - 10.2.44 with the z11a_c8r32 config ma100339_rel_withdbg.z11a_c8r32 - 10.3.39 with the z11a_c8r32 config ma100434_rel_withdbg.z11a_c8r32 - 10.4.34 with the z11a_c8r32 config ma100527_rel_withdbg.z11a_c8r32 - 10.5.27 with the z11a_c8r32 config ma100620_rel_withdbg.z11a_lwas4k_c8r32 - 10.6.20 with the z11a_c8r32 config ma101110_rel_withdbg.z11a_lwas4k_c8r32 - 10.11.10 with the z11a_c8r32 config ma110404_rel_withdbg.z11b_lwas4k_c8r32 - 11.4.4 with the z11b_c8r32 config ma110502_rel_withdbg.z11b_lwas4k_c8r32 - 11.5.2 with the z11b_c8r32 config ma110602_rel_withdbg.z11b_lwas4k_c8r32 - 11.6.2 with the z11b_c8r32 config ma110701_rel_withdbg.z11b_lwas4k_c8r32 - 11.7.1 with the z11b_c8r32 config The my.cnf files are in the subdirectories here . The differences in them are: z11a_c8r32 uses innodb_flush_method while z11b_c8r32 uses the new options (innodb_log_file_buffering etc) the 11.6.2 and 11.7.1 config files add innodb_snapshot_isolation=OFF And this has results for a few more configs ( see here ) z11a_lwas4k_c8r32 - adds innodb_log_write_ahead_size=4096, see this issue z11a_si_c8r32 - adds innodb_snapshot_isolation=ON z11b_lwas4k_si_c8r32 - adds innodb_log_write_ahead_size=4096 and innodb_snapshot_isolation=ON I don't see an impact from enabling or disabling snapshot isolation. I do see an impact from innodb_log_write_ahead_size=4096 but that is explained in issue 33894.

Marko Mäkelä added a comment - 2025-01-15 08:27

In https://github.com/MariaDB/server/pull/3522 I tried to address the observed contention in buf_page_make_young() by tweaking the buffer pool LRU replacement policy. It helped in one test workload but caused performance regressions in others. I think that the way forward would be to concentrate on one test case where that tweak causes a regression. I think that it had better be a read-only workload, so that it can be more deterministic. My hypothesis is that the regression occurs because we are evicting too recently used pages, causing more frequent reloading of pages.

There are some other performance bottlenecks around dict_sys.latch (~~MDEV-35000~~, MDEV-34999, MDEV-34988, MDEV-33594), which are more straightforward to fix. Those ought to be largely independent of any buf_pool.mutex contention, though.

Marko Mäkelä added a comment - 2025-01-15 08:27 In https://github.com/MariaDB/server/pull/3522 I tried to address the observed contention in buf_page_make_young() by tweaking the buffer pool LRU replacement policy. It helped in one test workload but caused performance regressions in others. I think that the way forward would be to concentrate on one test case where that tweak causes a regression. I think that it had better be a read-only workload, so that it can be more deterministic. My hypothesis is that the regression occurs because we are evicting too recently used pages, causing more frequent reloading of pages. There are some other performance bottlenecks around dict_sys.latch ( MDEV-35000 , MDEV-34999 , MDEV-34988 , MDEV-33594 ), which are more straightforward to fix. Those ought to be largely independent of any buf_pool.mutex contention, though.

MariaDB Server

sysbench performance regression with concurrent workloads

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration