[MDEV-33966] sysbench performance regression with concurrent workloads - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL), 11.6(EOL)
Fix Version/s: 10.6, 10.11, 11.4
Component/s: Storage Engine - InnoDB
Labels:
- performance
- regression
Environment:
ubuntu 22.04

Description

While I haven't seen significant performance regressions when comparing modern MariaDB (11.4, 10.11) with older MariaDB via sysbench with low concurrency workloads (see here). I have seen perf regressions once I use workloads with some concurrency.

This will take a few days to properly document.

From a server with 8 cores and sysbench run with 4 threads ...

the numbers in the table are the throughput relative to MariaDB 10.2.44 (x.ma100244_rel.z11a_bee.pk1) where 1.0 means the same, < 1.0 means a regression and > 1.0 means an improvement

If I use 0.8 as a cutoff, meaning some version gets less than 80% of the throughput relative to MariaDB 10.2, then from column 6 (col-6) the problem microbenchmarks are:

update-index_range=100, relative throughput is 0.25 in 11.4.1, problem arrives in 10.3
update-one_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.6
write-only_range=10000 , relative throughput is 0.77 in 11.4.1, problem arrives in 10.3.

Next step for this is to get flamegraphs and maybe PMP stacks.

The table relies on fixed width fonts to be readable but the "preformatted" option in JIRA doesn't do what I want it to do so the data is here

Next up is a server with 2 sockets and 12 cores/socket and the benchmark was run with 16 threads. The results are here. Again, using 0.8 as a cutoff and looking at col-6 (MariaDB 11.4.1) the problem microbenchmarks are:

insert_range=100, relative throughput is 0.73 in 11.4.1, there are gradual regressions starting in 10.3, but the largest are from 10.11 and 11.4
update-index_range=100, relative throughput is 0.18 in 11.4.1, problem starts in 10.5 and 10.11->11.4 is the biggest drop
update-inlist_range=100, relative thoughput is 0.56 in 11.4.1, problem is gradual from 10.3 through 11.4
update-nonindex_range=100, relative throughput is 0.69 in 11.4.1, problems arrive in 10.11 and 11.4
update-one_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.6
update-zipf_range=100, relative throughput is 0.75 in 11.4.1, problem arrives in 11.4
write-only_range=10000, relative throughput is 0.59 in 11.4.1, problems arrive in 10.11 and 11.4

Finally a server with 32 cores (AMD Threadripper) and the benchmark was run with 24 threads. The results are here and the problem microbenchmarks are:

points-notcovered-pk_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
points-notcovered-si_range=100, relative throughput is 0.77 in 11.4.1, problem arrives in 10.5
random-points_range=1000, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
random-points_range=100, relative throughput is 0.65 in 11.4.1, problem arrives in 10.5
range-notcovered-si_range=100, relative throughput is 0.59 in 11.4.1, problem arrives in 10.5
read-write_range=10, relative throughput is 0.79 in 11.4.1, problem arrives in 10.11
update-index_range=100, relative throughput is 0.64 in 11.4.1, problem arrives in 10.11 and 11.4
update-inlist_range=100, relative throughput is 0.61 in 11.4.1, problem arrives in 10.3, 10.5, 10.11
write-only_range=10000, relative throughput is 0.75 in 11.4.1, problem arrives in 10.11, 11.4

At this point my hypothesis is that the problem is from a few changes to InnoDB but I need more data to confirm or deny that.

On the 24-core server (2 sockets, 12 cores/socket) I repeated sysbench for 1, 4, 8, 12, 16 and 18 threads. And then on the 32-core server I repeated it for 1, 4, 8, 12, 16, 20 and 24 threads. The goal was to determine at which thread count the regressions become obvious. Alas, I only used a subset of the microbenchmarks to get results in less time. Another run with more microbenchmarks is in progress.

The results will be in comments to follow.

Attachments

Issue Links

is blocked by

MDEV-34515 Contention between secondary index UPDATE and purge due to large innodb_purge_batch_size

Closed

MDEV-34759 buf_page_get_low() is unnecessarily acquiring exclusive latch on secondary index pages

Closed

is caused by

MDEV-15058 Remove multiple InnoDB buffer pool instances

Closed

relates to

MDEV-32176 Contention in ha_innobase::info_low (dict_table::lock_mutex_lock)

Closed

MDEV-35125 Unnecessary buf_pool.page_hash lookups

Closed

MDEV-35155 Performance degradation and unstable observed on 10.6.19

Confirmed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

View 35 older comments

Mark Callaghan added a comment - 2024-11-21 22:47 - edited

Results using the latest point releases with both low-concurrency (1 thread) and high concurrency (40 threads). Not much is new, other than I have results from new versions:

I still get better QPS in some cases by adding innodb_log_write_ahead_size=4k to my.cnf
I don't see many regressions from old to new MariaDB at low concurrency, so it has done great at avoiding CPU regressions
I still see big regressions from old to new MariaDB at high concurrency, so mutex contention is an issue and the changes to InnoDB seems to be the cause
the big regressions occur more for read-heavy than for write-heavy and in many cases modern MariaDB (10.5+) gets less than half of the QPS vs older MariaDB. The big regressions for read-heavy arrive in 10.5 and things haven't improved since then

All of the my.cnf files I use are archived here:

The gists I link to below use a naming pattern like x.ma110502_rel_withdbg.z11b_c8r32.pk1 and x.ma110502_rel_withdbg.z11b_lwas4k_c8r32.pk1. For now ignore the "x." at the start and the ".pk1" at the end. The "ma110502_rel_withdbg" means:

MariaDB ("ma")
11.5.2 ("110502")
a build that used CMAKE_BUILD_TYPE=RelWithDebInfo
my.cnf is my.cnf.cz11b_lwas4k_c8r32 (see here)

The links that follow have results in terms of relative QPS (rQPS) which is: (QPS for my version / QPS for older MariaDB). When I look at results across MariaDB major versions then the base case is 10.2.44. When I look at results only for 10.5 point releases then the base case is 10.4.34. When the relative QPS is 1.0 then QPS has not changed, when it is much less than 1.0 then there is a large regression.

First up is the results for the latest point releases at low concurrency. For point releases that support innodb_log_write_ahead_size I include results with it set to =4k and with it not set. The results are here and modern MariaDB almost always has relative QPS >= 0.9 which is OK.

For low concurrency with a focus on MariaDB 10.5 see here which has numbers for 10.5.0, 10.5.4, 10.5.10, 10.5.20, 10.5.24 and 10.5.27. The reason to share the results at low concurrency is to show there aren't big regressions in 10.5 at low concurrency in contrast to the results below that I share.

And note that my focus is on 10.5 because that is where the big regressions first arrived. They are still here, but to understand the root cause I need to show where they start.

Next up is results from the latest. point releases at high concurrency (40 threads) on a server with 48 cores (real cores, AMD SMT is disabled). The relative QPS numbers are here. In the worst cases the relative QPS drops to ~0.4 in 10.5 and has remained there through 11.7. This means older MariaDB gets ~2.5X more QPS than modern MariaDB:

points-covered-pk - fetch 1 row by PK on a covering index
points-notcovered-pk - fetch 1 row by PK on a non-covering index
random-points with range =100 and =1000 - point queries that fetch many rows via a covering index using SELECT with an IN-list
range-notcovered-si - do a short range scan on a non-covering secondary index

Results from the classic sysbench transaction are here. With them the relative QPS in modern MariaDB is 0.83 or 0.88 depending on the length of the range queries it uses. The regressions here arrived in 10.11 and not much changes since then. If you only use classic sysbench then you will miss some of the large regressions, because the regressions here are smaller than above.

A few write-heavy tests also have large regressions, but they aren't as bad as the results mentioned above

insert - relative QPS is 0.79, these arrived in 10.11
update-index - relative QPS is 0.80 in 11.7, these arrived in 10.5 but perf improved a bit since then
update-inlist - relative QPS is 0.48 in 11.7, these arrived in 10.5, then got worse in 10.11 and have been stable (not getting better) since then

Results from 10.5 point releases are here for high concurrency. I tried all point releases from 10.5.0 through 10.5.10, the even number point releases from 10.5.12 through 10.5.26 and then 10.5.27. I won't annotate them other than to say that the big regressions for read-heavy microbenchmarks arrived prior 10.5.10

Mark Callaghan added a comment - 2024-11-21 22:47 - edited Results using the latest point releases with both low-concurrency (1 thread) and high concurrency (40 threads). Not much is new, other than I have results from new versions: I still get better QPS in some cases by adding innodb_log_write_ahead_size=4k to my.cnf I don't see many regressions from old to new MariaDB at low concurrency, so it has done great at avoiding CPU regressions I still see big regressions from old to new MariaDB at high concurrency, so mutex contention is an issue and the changes to InnoDB seems to be the cause the big regressions occur more for read-heavy than for write-heavy and in many cases modern MariaDB (10.5+) gets less than half of the QPS vs older MariaDB. The big regressions for read-heavy arrive in 10.5 and things haven't improved since then All of the my.cnf files I use are archived here: for the low concurrency tests for the high concurrency tests The gists I link to below use a naming pattern like x.ma110502_rel_withdbg.z11b_c8r32.pk1 and x.ma110502_rel_withdbg.z11b_lwas4k_c8r32.pk1. For now ignore the "x." at the start and the ".pk1" at the end. The "ma110502_rel_withdbg" means: MariaDB ("ma") 11.5.2 ("110502") a build that used CMAKE_BUILD_TYPE=RelWithDebInfo my.cnf is my.cnf.cz11b_lwas4k_c8r32 (see here ) The links that follow have results in terms of relative QPS (rQPS) which is: (QPS for my version / QPS for older MariaDB). When I look at results across MariaDB major versions then the base case is 10.2.44. When I look at results only for 10.5 point releases then the base case is 10.4.34. When the relative QPS is 1.0 then QPS has not changed, when it is much less than 1.0 then there is a large regression. First up is the results for the latest point releases at low concurrency. For point releases that support innodb_log_write_ahead_size I include results with it set to =4k and with it not set. The results are here and modern MariaDB almost always has relative QPS >= 0.9 which is OK. For low concurrency with a focus on MariaDB 10.5 see here which has numbers for 10.5.0, 10.5.4, 10.5.10, 10.5.20, 10.5.24 and 10.5.27. The reason to share the results at low concurrency is to show there aren't big regressions in 10.5 at low concurrency in contrast to the results below that I share. And note that my focus is on 10.5 because that is where the big regressions first arrived. They are still here, but to understand the root cause I need to show where they start. Next up is results from the latest. point releases at high concurrency (40 threads) on a server with 48 cores (real cores, AMD SMT is disabled). The relative QPS numbers are here . In the worst cases the relative QPS drops to ~0.4 in 10.5 and has remained there through 11.7. This means older MariaDB gets ~2.5X more QPS than modern MariaDB: points-covered-pk - fetch 1 row by PK on a covering index points-notcovered-pk - fetch 1 row by PK on a non-covering index random-points with range =100 and =1000 - point queries that fetch many rows via a covering index using SELECT with an IN-list range-notcovered-si - do a short range scan on a non-covering secondary index Results from the classic sysbench transaction are here . With them the relative QPS in modern MariaDB is 0.83 or 0.88 depending on the length of the range queries it uses. The regressions here arrived in 10.11 and not much changes since then. If you only use classic sysbench then you will miss some of the large regressions, because the regressions here are smaller than above. A few write-heavy tests also have large regressions, but they aren't as bad as the results mentioned above insert - relative QPS is 0.79, these arrived in 10.11 update-index - relative QPS is 0.80 in 11.7, these arrived in 10.5 but perf improved a bit since then update-inlist - relative QPS is 0.48 in 11.7, these arrived in 10.5, then got worse in 10.11 and have been stable (not getting better) since then Results from 10.5 point releases are here for high concurrency. I tried all point releases from 10.5.0 through 10.5.10, the even number point releases from 10.5.12 through 10.5.26 and then 10.5.27. I won't annotate them other than to say that the big regressions for read-heavy microbenchmarks arrived prior 10.5.10

Mark Callaghan added a comment - 2024-11-21 22:56

I will repeat tests on the large server to collect thread stacks (PMP) and flamegraphs

Mark Callaghan added a comment - 2024-11-21 22:56 I will repeat tests on the large server to collect thread stacks (PMP) and flamegraphs

Mark Callaghan added a comment - 2024-11-26 18:59 - edited

Aggregated thread stacks from PMP on the 48-core server with sysbench using 40 threads and a cached workload. All of the aggregated thread stacks ("hierarchical") are in subdirectories here.

From random-points.range1000 for 10.4.34 and 10.5.27 and the obvious problem is the most frequent thread stack for 10.5.27 which is mutex contention on the InnoDB LRU (buf_page_make_young). And the problem is the same for random-points.range100, points-covered-pk, points-notcovered-pk and range-notcovered-si. The problem is still there with 11.4.4

futex_wait,__GI___lll_lock_wait,___pthread_mutex_lock,inline_mysql_mutex_lock,buf_page_make_young,buf_page_make_young_if_needed,buf_page_get_low,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,READ_RECORD::read_record,sub_select,do_select,JOIN::exec_inner,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread,clone3

But range-notcovered-si has two threadstacks with buf_page_make_young in the top 2 – one for lock, one for unlock – see here for 10.5.27 and the problem remains in 11.4.4.

For insert there is more mutex contention in 10.5.from the InnoDB redo and the binlog – the top N stacks are here for 10.4.44 and 10.5.27. The problem remains in 11.4.4.

For update-inlist the stacks are here for 10.4.34 and 10.5.27 and there is just more mutex contention from InnoDB in 10.5.27. The problem remains in 11.4.4.

Mark Callaghan added a comment - 2024-11-26 18:59 - edited Aggregated thread stacks from PMP on the 48-core server with sysbench using 40 threads and a cached workload. All of the aggregated thread stacks ("hierarchical") are in subdirectories here . From random-points.range1000 for 10.4.34 and 10.5.27 and the obvious problem is the most frequent thread stack for 10.5.27 which is mutex contention on the InnoDB LRU (buf_page_make_young). And the problem is the same for random-points.range100, points-covered-pk, points-notcovered-pk and range-notcovered-si. The problem is still there with 11.4.4 futex_wait,__GI___lll_lock_wait,___pthread_mutex_lock,inline_mysql_mutex_lock,buf_page_make_young,buf_page_make_young_if_needed,buf_page_get_low,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,READ_RECORD::read_record,sub_select,do_select,JOIN::exec_inner,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread,clone3 But range-notcovered-si has two threadstacks with buf_page_make_young in the top 2 – one for lock, one for unlock – see here for 10.5.27 and the problem remains in 11.4.4 . For insert there is more mutex contention in 10.5.from the InnoDB redo and the binlog – the top N stacks are here for 10.4.44 and 10.5.27 . The problem remains in 11.4.4 . For update-inlist the stacks are here for 10.4.34 and 10.5.27 and there is just more mutex contention from InnoDB in 10.5.27. The problem remains in 11.4.4 .

Mark Callaghan added a comment - 2024-12-20 21:54

Another round of results are here with a cached database, small server, and sysbench using 1 thread. The numbers are the relative QPS which is (QPS for my version / QPS for MariaDB 10.2.44). Mostly I see a few small regressions and a few small improvements. The largest improvement is from setting innodb_log_write_ahead_size for 10.11 through 11.7.

This has results for:

ma100244_rel_withdbg.z11a_c8r32 - 10.2.44 with the z11a_c8r32 config
ma100339_rel_withdbg.z11a_c8r32 - 10.3.39 with the z11a_c8r32 config
ma100434_rel_withdbg.z11a_c8r32 - 10.4.34 with the z11a_c8r32 config
ma100527_rel_withdbg.z11a_c8r32 - 10.5.27 with the z11a_c8r32 config
ma100620_rel_withdbg.z11a_lwas4k_c8r32 - 10.6.20 with the z11a_c8r32 config
ma101110_rel_withdbg.z11a_lwas4k_c8r32 - 10.11.10 with the z11a_c8r32 config
ma110404_rel_withdbg.z11b_lwas4k_c8r32 - 11.4.4 with the z11b_c8r32 config
ma110502_rel_withdbg.z11b_lwas4k_c8r32 - 11.5.2 with the z11b_c8r32 config
ma110602_rel_withdbg.z11b_lwas4k_c8r32 - 11.6.2 with the z11b_c8r32 config
ma110701_rel_withdbg.z11b_lwas4k_c8r32 - 11.7.1 with the z11b_c8r32 config

The my.cnf files are in the subdirectories here. The differences in them are:

z11a_c8r32 uses innodb_flush_method while z11b_c8r32 uses the new options (innodb_log_file_buffering etc)
the 11.6.2 and 11.7.1 config files add innodb_snapshot_isolation=OFF

And this has results for a few more configs (see here)

z11a_lwas4k_c8r32 - adds innodb_log_write_ahead_size=4096, see this issue
z11a_si_c8r32 - adds innodb_snapshot_isolation=ON
z11b_lwas4k_si_c8r32 - adds innodb_log_write_ahead_size=4096 and innodb_snapshot_isolation=ON

I don't see an impact from enabling or disabling snapshot isolation. I do see an impact from innodb_log_write_ahead_size=4096 but that is explained in issue 33894.

Mark Callaghan added a comment - 2024-12-20 21:54 Another round of results are here with a cached database, small server, and sysbench using 1 thread. The numbers are the relative QPS which is (QPS for my version / QPS for MariaDB 10.2.44). Mostly I see a few small regressions and a few small improvements. The largest improvement is from setting innodb_log_write_ahead_size for 10.11 through 11.7. This has results for: ma100244_rel_withdbg.z11a_c8r32 - 10.2.44 with the z11a_c8r32 config ma100339_rel_withdbg.z11a_c8r32 - 10.3.39 with the z11a_c8r32 config ma100434_rel_withdbg.z11a_c8r32 - 10.4.34 with the z11a_c8r32 config ma100527_rel_withdbg.z11a_c8r32 - 10.5.27 with the z11a_c8r32 config ma100620_rel_withdbg.z11a_lwas4k_c8r32 - 10.6.20 with the z11a_c8r32 config ma101110_rel_withdbg.z11a_lwas4k_c8r32 - 10.11.10 with the z11a_c8r32 config ma110404_rel_withdbg.z11b_lwas4k_c8r32 - 11.4.4 with the z11b_c8r32 config ma110502_rel_withdbg.z11b_lwas4k_c8r32 - 11.5.2 with the z11b_c8r32 config ma110602_rel_withdbg.z11b_lwas4k_c8r32 - 11.6.2 with the z11b_c8r32 config ma110701_rel_withdbg.z11b_lwas4k_c8r32 - 11.7.1 with the z11b_c8r32 config The my.cnf files are in the subdirectories here . The differences in them are: z11a_c8r32 uses innodb_flush_method while z11b_c8r32 uses the new options (innodb_log_file_buffering etc) the 11.6.2 and 11.7.1 config files add innodb_snapshot_isolation=OFF And this has results for a few more configs ( see here ) z11a_lwas4k_c8r32 - adds innodb_log_write_ahead_size=4096, see this issue z11a_si_c8r32 - adds innodb_snapshot_isolation=ON z11b_lwas4k_si_c8r32 - adds innodb_log_write_ahead_size=4096 and innodb_snapshot_isolation=ON I don't see an impact from enabling or disabling snapshot isolation. I do see an impact from innodb_log_write_ahead_size=4096 but that is explained in issue 33894.

Marko Mäkelä added a comment - 2025-01-15 08:27

In https://github.com/MariaDB/server/pull/3522 I tried to address the observed contention in buf_page_make_young() by tweaking the buffer pool LRU replacement policy. It helped in one test workload but caused performance regressions in others. I think that the way forward would be to concentrate on one test case where that tweak causes a regression. I think that it had better be a read-only workload, so that it can be more deterministic. My hypothesis is that the regression occurs because we are evicting too recently used pages, causing more frequent reloading of pages.

There are some other performance bottlenecks around dict_sys.latch (~~MDEV-35000~~, MDEV-34999, MDEV-34988, MDEV-33594), which are more straightforward to fix. Those ought to be largely independent of any buf_pool.mutex contention, though.

Marko Mäkelä added a comment - 2025-01-15 08:27 In https://github.com/MariaDB/server/pull/3522 I tried to address the observed contention in buf_page_make_young() by tweaking the buffer pool LRU replacement policy. It helped in one test workload but caused performance regressions in others. I think that the way forward would be to concentrate on one test case where that tweak causes a regression. I think that it had better be a read-only workload, so that it can be more deterministic. My hypothesis is that the regression occurs because we are evicting too recently used pages, causing more frequent reloading of pages. There are some other performance bottlenecks around dict_sys.latch ( MDEV-35000 , MDEV-34999 , MDEV-34988 , MDEV-33594 ), which are more straightforward to fix. Those ought to be largely independent of any buf_pool.mutex contention, though.

MariaDB Server

sysbench performance regression with concurrent workloads

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration