[MDEV-30628] 10.6 performance regression with sustained high-connection write-only OLTP workload (55-80% degradation) Created: 2023-02-09  Updated: 2024-01-12

Status: Needs Feedback
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.11, 10.6.12
Fix Version/s: 10.6, 10.11, 11.0, 11.1, 11.2

Type: Bug Priority: Critical
Reporter: Dishon Merkhai Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 5
Labels: performance, regression

Attachments: Zip Archive Logs & Flamegraphs.zip     PNG File MDEV-26827-Make-page-flushing-even-faster.png     Zip Archive MDEV-30628 Benchmarks March 24-27.zip     PNG File Revert-most-of-ea52a3e.png     Zip Archive benchmarks-fixed-hangs-commit-00d2a3c7.zip     Zip Archive benchmarks-with-metrics.zip     PNG File mariadb-10-5-run-1-tps.png     PNG File mariadb-10-6-13-first-commit-six-runs.png     PNG File mariadb-10-6-13-with-branch-average.png     PNG File mariadb-10-6-13-with-branch-four-runs.png     PNG File mariadb-10-6-run-1-tps.png     PNG File mariadb-commit-00d2a3c7-five-runs.png     File my.cnf    
Issue Links:
PartOf
includes MDEV-29401 InnoDB history list length increased ... Closed
Relates
relates to MDEV-33213 History list is not shrunk unless the... Closed
relates to MDEV-30501 MariaDB 10.6.11 performance 65% slowe... Closed
relates to MDEV-31676 Innodb history length keeps growing Closed
relates to MDEV-32050 UNDO logs still growing for write-int... Closed

 Description   

Summary

On MariaDB 10.6.12 (found in 10.6.11) at a sustained (800 second) OLTP write-only workload at 64 benchmark threads/connections (48 core / 96 thread hardware), the throughput starts high (~2200 TPS) and then gradually decreases starting from the 450 second mark until the end of the run (ends at ~980 TPS - 55.4% drop-off in performance). Performance is stable on MariaDB 10.5.18 (no regression), which averaged 2625 TPS over the 800s benchmark, with no drop-off in TPS. innodb_flush_method was set to O_DIRECT for both MariaDB 10.6 and 10.5 benchmarks.

Additional information

  • TPS continues to decrease if more time is added to the benchmark:
    • ~2350 TPS to ~500 TPS over a 1500s benchmark (79% drop-off in performance).
  • CPUs utilized (measured by task-clock):
    • MariaDB 10.6.12: 0.732 CPUs
    • MariaDB 10.5.18: 1.490 CPUs
  • Issue seems to happen when there are greater or equal connections than CPU cores; did not see the issue at 32 threads, but started seeing the issue at >=48 threads (on a 48 core / 96 thread CPU).
  • Flamegraph results show CPU time spent on buf_page related functions, with buf_LRU_free_page blocking a lot of the on-CPU time.
  • innodb_buffer_pool_wait_free metric increases over the entire duration of the run. At ~200s, the metric is at ~40,000, but at ~420s (where the regression happens) its at ~115,000. By the end of the run (800s) the metric is at ~300,000.
  • Performance schema shows a lot of time spent in buf_pool_mutex.

Reproduce

  • See the my.cnf file attached.
  • Ran Sysbench 1.0.20, wrapped in mini-benchmark.sh, with the following modifications:
    • oltp_write_only workload.
    • 64 benchmark threads only.
    • 800 second benchmark duration.
    • Sysbench prepare:
      • sysbench oltp_write_only prepare --threads=16 --tables=20 --table-size=100000

    • Sysbench run:
      • sysbench oltp_write_only run --threads=$t --time=800 --tables=20 --table_size=100000 --db-ps-mode=disable --simple_ranges=0 --order_ranges=0 --distinct_ranges=0 --sum_ranges=0 --report-interval=10

Other Details

  • Ran the benchmarks on the following hardware (virtualized environment):
    • m5.12xlarge AWS EC2 instance
      • Intel Xeon Platinum 8175 processor @ 2.5 GHz (3.1 GHz turbo, 48 cores / 96 threads)
      • 192 GB memory
    • 500 GB SSD-based storage
  • Issue was reproduced on other instances and CPUs (Intel, ARM) at high threads/connections.

"Logs & Flamegraphs" Folder

The attached Logs & Flamegraphs.zip file contains the following:

  • MariaDB 10.5 benchmark (1 run):
    • Run 1: Same configuration as the "Reproduce" section
  • MariaDB 10.6 benchmarks (3 runs):
    • Run 1: Same configuration as the "Reproduce" section
    • Run 2: Same configuration as Run 1, but with a flamegraph (perf record)
    • Run 3: Same configuration as Run 1, but with a flamegraph of only the last 330 seconds of the run (approximately when the degradation occurs).

Each run has a photo graphing the TPS over time. The graphs for runs 1 of MariaDB 10.5.18 and 10.6.12 are also attached directly to this ticket.



 Comments   
Comment by Otto Kekäläinen [ 2023-02-16 ]

If you marko can provide some pointers here Dishon might be able to spend a week or two trying to submit a PR to fix this.

Comment by Marko Mäkelä [ 2023-02-16 ]

otto, in fact, last week I worked on some changes that have passed initial correctness testing, but have not been tested for performance yet, for MDEV-26055 and MDEV-26827. This was motivated by a discussion that I had with jgagne on February 4 at FOSDEM 2023. The idea is to retain some buffer of buf_pool.free pages, possibly by scheduling buf_flush_page_cleaner up to 10 times per second, and to remove some bottlenecks around page flushing. What I currently have is still retaining the 1 Hz scheduling rate of buf_flush_page_cleaner.

I will try to prepare a branch with these changes today or tomorrow. Testing and some analysis of parameters like those in MDEV-23855 will be very helpful.

Comment by Marko Mäkelä [ 2023-02-16 ]

otto, I created a branch with some performance oriented changes, with two commits for MDEV-26055 and a merge of MDEV-26827 (which was not helpful last times it was tested). I have not run any performance tests on these yet. I hope that it is a step into the right direction, and I am looking forward to your feedback.

Comment by Dishon Merkhai [ 2023-02-17 ]

Marko Mäkelä, thank you for providing insight into related developments, as well as compiling performance-related changes into a branch. I benchmarked the branch you shared and found that the issue still happens, but two characteristics are different. Comparing the branch (which is based on MariaDB 10.6.13) against the baseline MariaDB 10.6.12:

1. The overall TPS prior to the performance degradation is lower with the branch: ~1800 TPS with the branch on MariaDB 10.6.13 vs. ~2300 TPS on the baseline MariaDB 10.6.12 (i.e. 21.74% slower with the branch).
2. The sudden performance drop-off happens around 50-100 seconds later with the branch on MariaDB 10.6.13 when compared against the baseline MariaDB 10.6.12.

I ran the benchmark four times using the same configuration/environment as in the original bug report. I've attached two graphs, mariadb-10-6-13-with-branch-four-runs.png and mariadb-10-6-13-with-branch-average.png. The first shows the TPS over time for each of the four runs, the second shows the average TPS across all four runs.

Note: I'm not sure whether the difference in performance is related to 10.6.12 vs. 10.6.13 or if its due to the changes in the branch. I can run benchmarks with MariaDB 10.6.13 as the baseline to determine that as well.

Comment by Marko Mäkelä [ 2023-02-20 ]

dishonme, thank you for running some benchmarks.

First, I need to clarify that the VERSION file in the source tree is usually updated to indicate the next minor release as soon as the latest release has been published. This "10.6.13" is actually pretty close to the mariadb-10.6.12 release tag.

It is very likely that MDEV-26827 is still causing a performance degradation instead of an improvement. My past attempts to track down I/O or mutex stalls have been somewhat of a hit-and-miss. The tool offcputime-bpfcc looks promising, but it would often fail to produce usable stack traces. Finding CPU bottlenecks is trivial, but waits are trickier.

The reduction of throughput over time may due to MDEV-29401. Did you try to collect Innodb_history_list_length, Innodb_checkpoint_age, Innodb_buffer_pool_pages_free and possibly other counters during the run?

Comment by Marko Mäkelä [ 2023-02-20 ]

dishonme, I posted some quick results in MDEV-26055. In short, anything after the first commit in that branch could significantly reduce performance. I think that we need something like that fix, plus some follow-up to MDEV-26356: Either revert the throttling of purging history, or increase the checkpoint flushing efforts when that throttling is active.

Comment by Dishon Merkhai [ 2023-02-22 ]

Marko Mäkelä, I tracked only the innodb_buffer_pool_wait_free metric during the original benchmarks. In response to your comments, I'll do the following next:

I could also attempt to narrow down the exact minor/patch version of 10.6 where the issue first appears? Though I'm not immediately sure how useful this would be.

Thanks as well for helping me interpret the VERSION file.

Comment by Marko Mäkelä [ 2023-02-24 ]

dishonme, thank you, I am looking forward to your results.

I see these possible follow-ups to MDEV-26356:

  • If the first commit for MDEV-26055 fixes the original issue behind MDEV-26356, that fix can be reverted.
  • Else:
    1. When the purge is being throttled, increase the checkpoint flushing activity (say, by invoking buf_flush_ahead(lsn, false) from the purge coordinator task).
    2. If the history list length grows large enough, purge at maximum speed.
Comment by Dishon Merkhai [ 2023-02-27 ]

Marko Mäkelä, I will run benchmarks on the first commit today. I should have results in a few hours.

Comment by Dishon Merkhai [ 2023-02-27 ]

Marko Mäkelä, I have attached benchmark results (six runs) of only the first commit (improve adaptive flushing).
The performance with just the first commit is similar to the original 10.6.12 benchmarks (which were done without performance-related commits).

I still need to run benchmarks with metrics collected, I'll aim to get that done today and update when complete - same goes for the off-CPU flamegraphs.

Comment by Marko Mäkelä [ 2023-03-01 ]

dishonme, thank you. That commit made some significant improvement in our internal Sysbench TPC-C tests, but caused a regression for some other tests. I am revising it.

Did you sample the history list length during the run? While I believe that it is a plausible explanation for the regression that you are observing, we do not have any evidence of that yet.

Comment by Marko Mäkelä [ 2023-03-01 ]

dishonme, I force-pushed the branch. I did not have a chance to run any performance tests on the revised commit of MDEV-26055 yet. It is based on the same 10.6 parent commit.

Comment by Marko Mäkelä [ 2023-03-02 ]

Initial results from our internal testing (with a non-default innodb_max_dirty_pages_pct_lwm) are positive. Another test with the default innodb_max_dirty_pages_pct_lwm=0 has been started, so that the buf_flush_page_cleaner() will have a chance to write out or evict any pages from the buf_pool.LRU list.

Comment by Dishon Merkhai [ 2023-03-02 ]

Marko Mäkelä, I apologize on the lack of updates; I'll run benchmarks with metrics (history list length, etc.) today and share the results, I'll collect metrics on:

  • First commit of performance-related branch
  • Revised first commit of the performance-related branch
  • 10.6.12 branch (baseline - same environment as original benchmarks)
Comment by Dishon Merkhai [ 2023-03-03 ]

Marko Mäkelä, I have attached a zip file ( benchmarks-with-metrics.zip ) containing metrics and sysbench logs for the three binaries (first commit, revised first commit, and the baseline 10.6.12 branch).

For all three, Innodb_history_list_length steadily increases, reaching 1,420,000+ by the end of the 800s run.

Minor note: The 800s mark of the metric logs is a record of the metrics a few seconds (1-5s) after the benchmark is completed. Looking at the 790s mark should be more indicative of the metrics at the very end of the run while load is still applied.

Comment by Marko Mäkelä [ 2023-03-03 ]

dishonme, thank you. This confirms my suspicion that MDEV-29401 needs to be addressed, that is, some follow-up to MDEV-26356 is needed. I think that MDEV-26055 needs to be completed first in any case. We may also have to change the default from innodb_max_dirty_pages_pct_lwm=0 to get better performance in a wider variety of workloads.

Comment by Marko Mäkelä [ 2023-03-03 ]

For the record, I attempted to revive MDEV-26827 again, rebased on top of MDEV-26055. For now, it is split into multiple commits for testing purposes. I already noticed some occasional assertion failures on our CI system (but not locally), so it definitely needs some more work.

Comment by Marko Mäkelä [ 2023-03-10 ]

dishonme, we got some very promising performance test results. I have now rebased the branch. This commit corresponds to what showed promising results. After discussing with axel, I added the status variable INNODB_BUFFER_POOL_PAGES_SPLIT, because we suspect that an increased write activity starting at a 1200-second mark into his benchmark could be related to page splits.

I would appreciate it if you could repeat the performance tests. Based on our own testing, it might be that nothing further needs to be done about MDEV-29401 and the history list length.

Comment by Marko Mäkelä [ 2023-03-10 ]

One more fixup to buf_LRU_get_free_block() is needed to prevent a hang.

Comment by Dishon Merkhai [ 2023-03-13 ]

Marko Mäkelä, that's great news - I'll run the benchmarks later tonight based on the fixup commit and share the results here by tomorrow.

Comment by Marko Mäkelä [ 2023-03-13 ]

dishonme, thank you. If possible, I would recommend testing the current head of the branch. The branch now also includes a fix of MDEV-30357.

Comment by Dishon Merkhai [ 2023-03-14 ]

Marko Mäkelä, I didn't get to run the benchmark yesterday. I see there are also new commits, I'll run the benchmark on the current head of the branch, I'll let you know when I have the results.

Comment by Dishon Merkhai [ 2023-03-15 ]

Marko Mäkelä, I have attached a zip file ( benchmarks-fixed-hangs-commit-00d2a3c7.zip ) containing five runs based on Fix hangs in buf_LRU_get_free_block() (commit 00d2a3c7). The fourth run contains metrics (run-4-with-metrics/).

It looks like throughput still decreases over time, however the first 500 seconds are more stable compared to past benchmarks on different commits/branches (looking at the sudden TPS dips).

Comment by Marko Mäkelä [ 2023-03-16 ]

dishonme, thank you. Sorry, it has been a moving target. There were some changes on Tuesday and Wednesday, to avoid hangs related to buffer page eviction. Could you please test 10.6 as of this commit once more?

Comment by Marko Mäkelä [ 2023-03-21 ]

dishonme, I see that during the 800 seconds of the run-4-db-metrics.log, Innodb_history_list_length grows all the way to 1,347,228 (at 790s) before starting to shrink a little during the last 10-second interval. This strongly suggests that we need to do something about MDEV-29401, that is, some follow-up to MDEV-26356. I think that the version that you tested is close enough to the final version that was pushed to 10.6, so I assume that you’d get similar results for the current 10.6 branch.

Based on our internal testing, we will not change the default value innodb_max_dirty_pages_pct_lwm=0.

Comment by Marko Mäkelä [ 2023-03-22 ]

dishonme, I created a branch that removes the throttling that was added in MDEV-26356. If that fixes this and does not cause any other regression, then it would be a simple solution.

Comment by Dishon Merkhai [ 2023-03-22 ]

Marko Mäkelä,

I see that during the 800 seconds of the run-4-db-metrics.log, Innodb_history_list_length grows all the way to 1,347,228 (at 790s) before starting to shrink a little during the last 10-second interval.

The decrease in the last 10 seconds is very likely due to the sysbench benchmark terminating (the metric log is lagging by a few seconds).

I'll benchmark the following commits tomorrow morning and share the results:

Comment by Marko Mäkelä [ 2023-03-27 ]

dishonme, I am looking forward to your updates. There is also MDEV-29593 that is work in progress. It might contribute to reducing the history list length, by allowing some more undo log page writes to be eliminated.

Comment by Dishon Merkhai [ 2023-03-27 ]

Marko Mäkelä, I have attached results for the code changes: MDEV-30628 Benchmarks March 24-27.zip. The folder contains benchmark logs, graphs, and (for some of the runs) metric files.

The results for MDEV-26827-Make-page-flushing-even-faster show throughput begin to decrease around the 480s mark.

The results for Revert-most-of-ea52a3e are mixed:

  • The good runs (1,2,6,7,8) sustain the throughput until around 560s (~100s longer than the baseline 10.6.12 branch).
  • The bad runs (3,4,5) had reduced throughput over the entire duration of the run (or for run 4 - starting 150s into the run). I'm not sure why runs 3,4,5 diverged from the usual throughput pattern, I re-ran the benchmark (runs 6,7,8) and could not reproduce the behaviour.

Comment by Marko Mäkelä [ 2023-03-29 ]

dishonme, thank you. axel mentioned having observed something similar, and also I remember from the past (some time in 2021 or 2022) that when I ran the Sysbench oltp_update_index benchmark for some minutes, the growth of the history list length would cause significant slowdown.

A rule of thumb from your runs seems to be that once the history list length grows above 950,000 the performance will suffer. In a "bad" run of Revert-most-of-ea52a3e I see that the history list remains much shorter than in a "good" run that I checked. Whether that is due to less work having been done, or due to something else, I do not know. It could be useful to log also Innodb_max_trx_id (export_vars.innodb_max_trx_id = trx_sys.get_max_trx_id()) to keep track of how much actual work had been done at each point.

It looks like right after we are done testing MDEV-29593, we must really fix MDEV-29401 as the next thing. Possibly there should be some adaptive algorithm to make the purge of committed transaction history more eager or aggressive when the history list grows too long. My intuition says that there may have been a chaotic or lucky balance between different subsystems of InnoDB, which could have been disrupted by the removal of the sync_array (MDEV-21452) and lock_wait_timeout_task a.k.a. lock_wait_timeout_thread (MDEV-24671).

Comment by Marko Mäkelä [ 2023-03-29 ]

To combat the growth of the history list, I have one more idea:

diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
index 753b42332fc..c7d42d91744 100644
--- a/storage/innobase/row/row0purge.cc
+++ b/storage/innobase/row/row0purge.cc
@@ -1081,6 +1081,14 @@ row_purge_parse_undo_rec(
 		goto err_exit;
 	}
 
+	if (trx_sys.history_exceeds(500000)) {
+		/* Acquire exclusive dict_sys.latch to throttle DML and to
+		avoid increased wait times in trx_purge_get_next_rec()
+		and trx_purge_truncate_history(). */
+		dict_sys.lock(SRW_LOCK_CALL);
+		dict_sys.unlock();
+	}
+
 already_locked:
 	ut_ad(!node->table->is_temporary());
 

Something similar (minus the if statement) was the only thing that was not reverted in Revert-most-of-ea52a3e.

Comment by Marko Mäkelä [ 2023-04-26 ]

I can reproduce the excessive history list length growth in MDEV-29401.

Comment by Marko Mäkelä [ 2023-04-27 ]

dishonme, can you please test a 10.6 branch that includes a fix of MDEV-29401? It is actually an improvement on the way how the parameters innodb_max_purge_lag and innodb_max_purge_lag_delay work.

Furthermore, I wonder if 10.6 would be closer to 10.5 performance if you specify innodb_flush_method=fsync, to essentially revert MDEV-24854. For read workloads on a tiny buffer pool backed by a larger file system cache (which in my opinion is a somewhat awkward setting), not disabling the cache by innodb_flush_method=O_DIRECT seems to help. That would have to be improved in MDEV-29967. I believe that an improved read-ahead logic would help more in range scans, while the purge of history only invokes key lookup operations.

Comment by Marko Mäkelä [ 2023-07-07 ]

I’m sorry to see that this report had been closed as incomplete.

Some related news:

  • Linear read-ahead was broken until MDEV-29967 was fixed.
  • In MDEV-30986 I found out that enabling innodb_random_read_ahead=ON can significantly reduce wait times for some workloads.
  • The growth of the history list remains a problem, even after enabling the DML throttling that was improved in MDEV-29401. It may have been made worse by correctness fixes such as MDEV-31355.
Comment by Marko Mäkelä [ 2023-08-11 ]

dishonme, does your test fare any better with MariaDB Server 10.6.14, or better yet, with a recent development snapshot of 10.6 (which is close to the upcoming 10.6.15 release)?

I believe that the main remaining problem is the history list growth. It would be advisable to play with the DML rate limiting parameters that were improved in MDEV-29401.

For I/O bound workloads (where the working set exceeds innodb_buffer_pool_size) it could be worth testing if innodb_flush_method=fsync (essentially reverting MDEV-24854) would help. It could reduce the latency of reads into the buffer pool, among others, by the purge worker threads. If that helps, implementing MDEV-11378 should make the 10.6 default innodb_flush_method=O_DIRECT work better.

Comment by Marko Mäkelä [ 2023-10-25 ]

dishonme, can you please test the latest 10.6 branch, which now includes a fix of MDEV-32050? If you prefer pre-built packages, they should soon appear at https://ci.mariadb.org/39814/.

Comment by Dishon Merkhai [ 2023-11-03 ]

Hi Marko Mäkelä, apologies for the lack of responses - I can test the latest revision of 10.6 this coming week and will share the results once completed.

Comment by Marko Mäkelä [ 2024-01-12 ]

I think that MDEV-33213 could address this.

Comment by Dishon Merkhai [ 2024-01-12 ]

Hi, I can test the commits resulting from MDEV-33213.

The benchmarks I ran for the previously mentioned MDEV-32050 seemed to show some improvement but the issue described in this Jira was still present (degradation showed up at a later point in the benchmark, but it still occurred).

Will share results once prepared.

Generated at Thu Feb 08 10:17:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.