Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Fix
-
10.3(EOL)
-
None
Description
I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup
- Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16
- MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
- sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)
The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2
Loading table, sysbench prepare
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
|
sysbench run
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
|
where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096
The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)
Below are the results from the runs, for 10.2.4 and recent 10.3.7
UPDATE: marko mentioned the git 1a62c8a39647db00283b4e35fb6db3c7bc31e3ca as the version right before 10.3 switched to lockless, so
I added a column for it too. It appears to be better than 10.3.7, but worse than 10.2.14
Users | TPS 10.2.14 | TPS 10.3.7 | 10.3-pre-lockless |
---|---|---|---|
1 | 1549.92 | 1547.75 | 1551.07 |
2 | 2373.89 | 2209.05 | 2334.80 |
4 | 4466.03 | 4276.46 | 4446.64 |
8 | 9069.82 | 8587.59 | 9098.95 |
16 | 16631.45 | 15719.70 | 16231.39 |
32 | 28989.79 | 27482.44 | 27786.32 |
64 | 36996.12 | 34843.23 | 35089.44 |
128 | 37287.05 | 35172.49 | 36223.68 |
256 | 38038.01 | 35896.52 | 36818.22 |
512 | 38360.71 | 36567.07 | 37195.47 |
1024 | 38265.25 | 36328.62 | 37253.98 |
2048 | 39990.85 | 36328.62 | 38572.81 |
4096 | 41988.71 | 39032.83 | 39606.38 |
(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.
I tried to do some initial profiling (based on CPU sampling), and this is what showed up.
Below., in all places, baseline refers to 10.2.14, and *comparison" is 10.3.7
Exclusive samples comparison
Individual functions (or, exclusive sample percentage with threshold 1% difference)
Comparison Column | Delta | Baseline Value | Comparison Value |
---|---|---|---|
ut_delay | 4.02 | 16.05 | 20.07 |
l_find | 2.44 | 0.16 | 2.60 |
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter | 2.42 | 0.18 | 2.60 |
TTASEventMutex<GenericPolicy>::enter | -4.42 | 4.42 | 0.00 |
SleepConditionVariableCS | -5.41 | 28.70 | 23.29 |
In short, more busy wait ut_delay, less lazy wait SleepConditionVariableCS . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
Inclusive samples comparison
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)
Comparison Column | Delta | Baseline Value | Comparison Value |
---|---|---|---|
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter | 28.69 | 2.53 | 31.22 |
ut_delay | 4.02 | 16.05 | 20.07 |
trx_purge | 3.81 | 5.67 | 9.48 |
srv_do_purge | 3.80 | 5.70 | 9.49 |
row_purge | 3.66 | 0.00 | 3.67 |
row_purge_step | 3.47 | 0.71 | 4.18 |
que_thr_step | 3.31 | 0.94 | 4.25 |
srv_purge_coordinator_thread | 3.25 | 6.37 | 9.63 |
trx_sys_t::clone_oldest_view | 3.18 | 0.00 | 3.18 |
btr_cur_search_to_nth_level_func | 3.12 | 0.00 | 3.12 |
row_purge_record_func | 2.94 | 0.00 | 2.94 |
row_purge_upd_exist_or_extern_func | 2.77 | 0.00 | 2.77 |
row_purge_reset_trx_id | 2.72 | 0.00 | 2.72 |
que_run_threads_low | 2.70 | 1.73 | 4.43 |
que_run_threads | 2.69 | 1.75 | 4.45 |
l_find | 2.46 | 0.18 | 2.64 |
ReadView::snapshot | 2.30 | 0.00 | 2.30 |
rw_trx_hash_t::iterate | 2.27 | 0.00 | 2.27 |
lf_hash_iterate | 2.25 | 0.00 | 2.25 |
srv_task_execute | 2.12 | 1.95 | 4.08 |
row_purge_reposition_pcur | 1.53 | 0.00 | 1.53 |
row_search_on_row_ref | 1.33 | 0.00 | 1.33 |
btr_pcur_open_low | 1.31 | 0.00 | 1.31 |
btr_search_guess_on_hash | 1.31 | 1.54 | 2.85 |
trx_purge_wait_for_workers_to_complete | 1.10 | 1.52 | 2.62 |
trx_undo_assign_low | 1.07 | 0.00 | 1.07 |
mtr_t::commit | 1.05 | 2.57 | 3.62 |
mtr_t::Command::execute | 1.01 | 2.42 | 3.43 |
srv_worker_thread | -1.09 | 9.84 | 8.75 |
sync_array_wait_event | -1.29 | 9.69 | 8.40 |
trx_write_serialisation_history | -1.34 | 2.62 | 1.28 |
trx_commit_low | -1.39 | 3.15 | 1.76 |
innobase_commit | -1.75 | 47.36 | 45.61 |
trx_commit | -1.75 | 4.79 | 3.04 |
ha_commit_one_phase | -1.76 | 47.55 | 45.78 |
commit_one_phase_2 | -1.76 | 47.54 | 45.77 |
trans_commit_stmt | -1.80 | 48.68 | 46.88 |
ha_commit_trans | -1.80 | 48.65 | 46.84 |
btr_cur_search_to_nth_level | -1.85 | 1.85 | 0.00 |
innobase_commit_ordered_2 | -2.04 | 5.17 | 3.13 |
trx_commit_for_mysql | -2.05 | 5.10 | 3.05 |
innobase_commit_low | -2.05 | 5.11 | 3.06 |
Prepared_statement::execute_loop | -2.05 | 74.58 | 72.52 |
mysql_stmt_execute_common | -2.14 | 74.85 | 72.71 |
mysqld_stmt_execute | -2.15 | 74.89 | 72.74 |
mysql_execute_command | -2.21 | 73.03 | 70.81 |
trx_undo_assign_undo | -2.24 | 2.24 | 0.00 |
Prepared_statement::execute | -2.24 | 74.03 | 71.79 |
threadpool_process_request | -2.32 | 81.71 | 79.38 |
dispatch_command | -2.33 | 79.21 | 76.88 |
do_command | -2.34 | 81.33 | 79.00 |
tp_callback | -2.38 | 82.66 | 80.28 |
srv_resume_thread | -3.33 | 7.94 | 4.61 |
os_event::wait_low | -4.98 | 28.41 | 23.43 |
SleepConditionVariableCS | -5.41 | 28.70 | 23.29 |
TTASEventMutex<GenericPolicy>::enter | -30.49 | 30.49 | 0.00 |
Attachments
- 1bp-1pc-10.2.14.txt
- 29 kB
- 1bp-1pc-10.3.7.txt
- 29 kB
- winpmp_10_2_14.txt
- 317 kB
- winpmp_10_3_7.txt
- 265 kB
Issue Links
- is duplicated by
-
MDEV-17410 mariabackup prepare crashes in Innodb recovery
-
- Closed
-
- relates to
-
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC
-
- Closed
-
-
MDEV-14425 Change the InnoDB redo log format to reduce write amplification
-
- Closed
-
-
MDEV-16232 Use fewer mini-transactions
-
- Stalled
-
-
MDEV-17003 service_manager_extend_timeout() being called too often
-
- Closed
-
-
MDEV-19845 Adaptive spin loops
-
- Closed
-
-
MDEV-17353 pinpoint perf regressions in early 10.3 to commits
-
- Open
-
Activity
Given that log_sys.mutex appears to be one of the contention points, I think that we should keep track of the average volume of redo log written per write transaction. More redo log writes should lead to more contention.
There are two features in MariaDB 10.3 that could increase the writes: MDEV-12288 (resetting transaction identifiers after writes) and MDEV-11369 (instant ADD COLUMN). As far as I understand, neither of these changes should matter.
Also, some mini-transactions for writing undo log were changed in the hope of simplifying crash recovery; one of the changes was essentially reverted in MDEV-15914 due to a performance regression.
I would like to compare how these fields in SHOW ENGINE INNODB STATUS proceed:
excerpt of SHOW ENGINE INNODB STATUS output |
…
|
------------
|
TRANSACTIONS
|
------------
|
Trx id counter 34
|
…
|
---
|
LOG
|
---
|
Log sequence number 1642672
|
…
|
We should record both counts at the start and end of the test, calculate the difference and the ratio. For example, if these counts were 50 and 2000000 at the end of the test, the ratio would be (2000000-1642672)/(50-34) = 22333 bytes per transaction.
If MariaDB 10.3 is generating significantly more redo log than 10.2, that could explain the increased log_sys.mutex contention.
If purge threads are using more CPU than earlier, that could be due to svoj’s work of removing trx_sys.mutex and employing some lock-free data structures. It could be useful to test a 10.3 version right before switching to lock-free data structures.
There was also this change to ut_delay() and some related discussion. On the Intel Skylake microarchitecture, the PAUSE instruction (REPZ NOP) takes more wall-clock time than on earlier ones. Maybe we are now burning more CPU on older microarchitectures?
ref ut_delay: I think I saw svoj test PTHREAD_MUTEX_ADAPTIVE_NP and it wasn't good.
Maybe some rethinking the adaptive delay in mariadb to a mean/std approach per mutex rather than a straight mean. We're wasting heaps of cycles in ut_delay so a little bit of maths extra isn't going to hurt too bad.
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=1519c142bd6ec5cc528c5921f28575d25c7fb664;hb=HEAD#l118
marko, I did some calculations of bytes per trx, guided by your hint, and this is what it is (surprisingly, to me maybe, the number of bytes is not integer
10.3.7
trxid1 = 193362180
LSN1 = 69634009621
trxid2 = 215344539
LSN2= 75578735333
bytes per trx = 270.43
10.2.14
trxid1 = 11388
LSN1= 13829598991
trxid2 = 22135718
LSN2= 19655598660
bytes per trx = 263.32
multiuser benchmarks was run between the point 1 and point 2.
Hope that helps
To reduce the redo log volume, it looks like we have to resume generating the redo log record types MLOG_UNDO_INIT and MLOG_UNDO_HDR_REUSE. The record MLOG_UNDO_INSERT was already restored in MDEV-15914.
MLOG_UNDO_HDR_REUSE was removed by MDEV-12288. That one we cannot restore. It should be equivalent to MLOG_UNDO_HDR_CREATE in the redo log volume. Similarly, reintroducing MLOG_UNDO_ERASE_END ought to increase the redo log volume written, so we’d better not restore that.
regarding "relax CPU" number of loops for "rep nop" being 200, https://github.com/MariaDB/server/commit/b3346c2f41d0f7bd0d919826592b37ae09fa41c8#diff-e877f572e9fd77c43d5e7af075f92e74R91 , this is something from me, I measured this as optimal back in 2008, on (probably) Core2 Duo. No idea if this still holds.
The MLOG_UNDO_INIT record should occupy 3..12 bytes while the lower-level records (MLOG_2BYTES, MLOG_4BYTES) could easily consume 7 more bytes.
I have now restored the MLOG_UNDO_INIT record, which I had removed in the hope that recovery would be simpler and faster. (But I failed to notice that a small increase on the number of redo log bytes written can hurt performance.)
I developed a simple test for measuring the volume of redo log written by an update. Note that transaction IDs are incremented both at start and at commit.
Due to MDEV-12288, it is clear that there will be more writes by the purge in 10.3. Before MDEV-12288, an UPDATE would not result in any purge activity, but after it, in 10.3, all DML will eventually result in the DB_TRX_ID column being zeroed out in the clustered index record. This will improve read performance at some cost of write performance.
--source include/have_innodb.inc
|
--source include/have_sequence.inc
|
|
let $n_updates=1;
|
let $n_rows=1;
|
|
create table t1 (id serial, c2 char(120) default 'foobar') engine=innodb; |
#connect stop_purge, localhost, root; |
#START TRANSACTION WITH CONSISTENT SNAPSHOT; |
#connection default; |
eval insert into t1(id) select * from seq_1_to_$n_rows; |
|
--replace_regex /.*Log sequence number ([0-9]+).*/\1/
|
let $lsn= `SHOW ENGINE INNODB STATUS`;
|
let $lsn= `SELECT REGEXP_REPLACE('$lsn', 'InnoDB\\t*', '')`; |
--replace_regex /.*Trx id counter ([0-9]+).*/\1/
|
let $trx= `SHOW ENGINE INNODB STATUS`;
|
let $trx= `SELECT REGEXP_REPLACE('$trx', 'InnoDB\\t*', '')`; |
|
while ($n_updates) {
|
update t1 set c2='boofar'; |
update t1 set c2='foobar'; |
dec $n_updates; |
}
|
|
--replace_regex /.*Log sequence number ([0-9]+).*/\1/
|
let $new_lsn= `SHOW ENGINE INNODB STATUS`;
|
let $new_lsn= `SELECT REGEXP_REPLACE('$new_lsn', 'InnoDB\\t*', '')`; |
--replace_regex /.*Trx id counter ([0-9]+).*/\1/
|
let $new_trx= `SHOW ENGINE INNODB STATUS`;
|
let $new_trx= `SELECT REGEXP_REPLACE('$new_trx', 'InnoDB\\t*', '')`; |
|
eval SELECT ($new_lsn-$lsn)/($new_trx-$trx) `bytes/(transaction*2)`; |
drop table t1; |
In the above test, you can disable purge to observe the effect to the number of redo log bytes written. The test should probably run a longer time for the purge to kick in.
The effect of MDEV-12288 can also be disabled by removing the body of the function row_purge_reset_trx_id(). After that change, read-only workloads should be somewhat slower due to performing lookups for DB_TRX_ID that have been purged.
I can confirm this in a memory-based benchmark:
sysbench oltp_update_index.lua --tables=20 --table_size=100000 --rand-type=uniform --time=60 --report-interval=10 --threads=40
|
and
numactl --interleave=all \
|
$MDIR/sql/mysqld --no-defaults --log-output=file --general-log=OFF --core-file --loose-debug-sync-timeout=300 --innodb \
|
--datadir=$TDIR --lc-messages-dir=$MDIR/sql/share/ --socket=$MYSQL_SOCK \
|
--loose-innodb_file_per_table=true \
|
--loose-innodb-open-files=4000 \
|
--loose-innodb_data_file_path=ibdata1:10M:autoextend \
|
--loose-innodb_flush_log_at_trx_commit=0 \
|
--loose-innodb_flush_method=O_DIRECT_NO_FSYNC \
|
--loose-innodb-adaptive-hash-index-partitions=32 \
|
--loose-innodb_log_buffer_size=256M \
|
--loose-innodb_log_file_size=1900M \
|
--loose-innodb_buffer_pool_size=50G \
|
--loose-innodb-buffer-pool-instances=32 \
|
--loose-innodb-thread-concurrency=0 \
|
--loose-innodb_log_files_in_group=3 \
|
--loose-innodb_read_io_threads=8 \
|
--loose-innodb_write_io_threads=8 \
|
--loose-innodb_io_capacity=8000 \
|
--loose-innodb_flush_sync=0 \
|
--table-open-cache=1400 \
|
--table-open-cache-instances=64 \
|
--max-connections=200 \
|
--aria-checkpoint-interval=0 \
|
--loose-performance-schema=off
|
The transactions/second number is:
version | threads=1 | threads=40 |
---|---|---|
10.2 a4e7800701d0764fe4cbb85b81d7c7cb54677334 | 10031.27 | 189489.50 |
mariadb-10.3.4 | 9035.31 | 151952.76 |
54c715acca8077d9a1694521bd34d6ef1505d7e1 | 9398.94 | 173670.23 |
8d1d38f9539e2b0d30fdb9531d06b66c7662e9dd | 9424.46 | 178949.21 |
c7d0448797a2013079c3e0b7ca97f7bdc1138696 | 9489.61 | 178983.94 |
5db9c6e490ea2c6b7997dff4cbee18274496dd2c | 9593.49 | 98489.03 |
dcc09afa634b0f525d2e8e7d15ef36e5fdd5d270 | 9373.73 | 100722.44 |
af566d8a63f908c76d43957e790e861d944486b0 | 9188.53 | 104609.54 |
mariadb-10.3.5 | 9625.65 | 95901.31 |
10.3 b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 | 10160.03 | 95248.29 |
patched b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 | 9555.82 | 82289.68 |
Something has drastically changed between 10.3.4 and 10.3.5. Commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c is the biggest culprit. I still think that the idea of MDEV-15132 is good, but there must be something bad about this particular implementation.
I chose commit af566d8a63f908c76d43957e790e861d944486b0 because it is in the middle of some code changes by svoj. There is some visible regression between that and mariadb-10.3.5, and not much improvement since then.
The patch against the most recent 10.3 essentially disabled the resetting of the DB_TRX_ID (MDEV-12288). This resetting does create some purge activity and cause more writes, but it could also greatly help performance, as demonstrated above. (If DB_TRX_ID=0, we immediately know that there is no older version and the record is not implicitly locked.)
diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
|
index af8cd682288..8b22c90e0d1 100644
|
--- a/storage/innobase/row/row0purge.cc
|
+++ b/storage/innobase/row/row0purge.cc
|
@@ -671,6 +671,7 @@ static
|
void
|
row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
|
{
|
+#if 0
|
ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S));
|
/* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */
|
mtr->start();
|
@@ -729,6 +730,7 @@ row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
|
}
|
|
mtr->commit();
|
+#endif
|
}
|
|
/***********************************************************//**
|
diff --git a/storage/innobase/row/row0umod.cc b/storage/innobase/row/row0umod.cc
|
index 47c515f70cf..3b5f37adf50 100644
|
--- a/storage/innobase/row/row0umod.cc
|
+++ b/storage/innobase/row/row0umod.cc
|
@@ -385,6 +385,7 @@ row_undo_mod_clust(
|
false, &mtr);
|
ut_ad(err == DB_SUCCESS
|
|| err == DB_OUT_OF_FILE_SPACE);
|
+#if 0
|
} else if (!index->table->is_temporary() && node->new_trx_id) {
|
/* We rolled back a record so that it still exists.
|
We must reset the DB_TRX_ID if the history is no
|
@@ -430,6 +431,7 @@ row_undo_mod_clust(
|
sizeof reset_trx_id, &mtr);
|
}
|
}
|
+#endif
|
} else {
|
goto func_exit;
|
} |
Surprisingly, removing the following supposed optimization
diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
|
index 26b857a75f3..218e913d264 100644
|
--- a/storage/innobase/trx/trx0trx.cc
|
+++ b/storage/innobase/trx/trx0trx.cc
|
@@ -1341,11 +1341,6 @@ trx_write_serialisation_history(
|
#endif
|
const bool update_binlog_pos = trx->mysql_log_file_name
|
&& *trx->mysql_log_file_name;
|
- if (!update_binlog_pos
|
-#ifdef WITH_WSREP
|
- && !update_wsrep
|
-#endif
|
- ) return true;
|
|
buf_block_t* block = trx_sysf_get(mtr);
|
#ifdef WITH_WSREP |
fixes the performance regression of 5db9c6e490ea2c6b7997dff4cbee18274496dd2c. With this optimization removed, we are always acquiring and releasing the TRX_SYS page latch. Replication and Galera are disabled, after all.
This patch is not applicable to later versions, because MDEV-15158 later removed the write to the TRX_SYS page altogether. Maybe this page lock somehow served as useful throttling?
I applied a similar unnecessary acquisition of the TRX_SYS page lock to the latest 10.3, but it did reduce performance as expected:
diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
|
index 11d83c96a92..8ded9a03e3d 100644
|
--- a/storage/innobase/trx/trx0trx.cc
|
+++ b/storage/innobase/trx/trx0trx.cc
|
@@ -1120,6 +1120,7 @@ trx_write_serialisation_history(
|
|
mutex_exit(&rseg->mutex);
|
|
+ trx_sysf_get(mtr);
|
MONITOR_INC(MONITOR_TRX_COMMIT_UNDO);
|
|
trx->mysql_log_file_name = NULL; |
I tried to remove a write that was added by MDEV-15158, but it did not help performance:
diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
|
index 32e050e9b91..3817ce8bcca 100644
|
--- a/storage/innobase/trx/trx0purge.cc
|
+++ b/storage/innobase/trx/trx0purge.cc
|
@@ -254,8 +254,10 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr)
|
rseg_header + TRX_RSEG_HISTORY_SIZE,
|
hist_size + undo->size, MLOG_4BYTES, mtr);
|
|
+#if 0
|
mlog_write_ull(rseg_header + TRX_RSEG_MAX_TRX_ID,
|
trx_sys.get_max_trx_id(), mtr);
|
+#endif
|
}
|
|
/* Before any transaction-generating background threads or the
|
|
This write is performed while holding rseg->mutex, but there are 128 rollback segments, so I do not think it should matter with 40 concurrent users, if each transaction is assigned a different rollback segment.
My 10.3 (b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 ) results don't go inline with what marko reported:
[ 10s ] thds: 40 tps: 246050.74 qps: 246050.94 (r/w/o: 0.00/246050.94/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
|
[ 20s ] thds: 40 tps: 266876.11 qps: 266876.11 (r/w/o: 0.00/266876.11/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00
|
[ 30s ] thds: 40 tps: 265013.72 qps: 265013.52 (r/w/o: 0.00/265013.52/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
[ 40s ] thds: 40 tps: 267120.70 qps: 267120.70 (r/w/o: 0.00/267120.70/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
|
[ 50s ] thds: 40 tps: 267445.25 qps: 267445.25 (r/w/o: 0.00/267445.25/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
|
[ 60s ] thds: 40 tps: 265585.03 qps: 265585.03 (r/w/o: 0.00/265585.03/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00
|
[ 70s ] thds: 40 tps: 262178.67 qps: 262178.67 (r/w/o: 0.00/262178.67/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
[ 80s ] thds: 40 tps: 265030.96 qps: 265030.96 (r/w/o: 0.00/265030.96/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
|
[ 90s ] thds: 40 tps: 263325.84 qps: 263325.94 (r/w/o: 0.00/263325.94/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
|
[ 100s ] thds: 40 tps: 249319.73 qps: 249319.63 (r/w/o: 0.00/249319.63/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
|
[ 110s ] thds: 40 tps: 263994.66 qps: 263994.66 (r/w/o: 0.00/263994.66/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
|
[ 120s ] thds: 40 tps: 263167.54 qps: 263167.54 (r/w/o: 0.00/263167.54/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
|
|
transactions: 31452722 (262087.56 per sec.)
|
queries: 31452722 (262087.56 per sec.)
|
However 10.2 results are perfectly inline:
[ 10s ] thds: 40 tps: 189485.74 qps: 189485.84 (r/w/o: 0.00/189485.84/0.00) lat (ms,95%): 0.34 err/s: 0.00 reconn/s: 0.00
|
[ 20s ] thds: 40 tps: 198558.47 qps: 198558.37 (r/w/o: 0.00/198558.37/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 30s ] thds: 40 tps: 195900.53 qps: 195900.53 (r/w/o: 0.00/195900.53/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 40s ] thds: 40 tps: 196844.47 qps: 196844.47 (r/w/o: 0.00/196844.47/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 50s ] thds: 40 tps: 198759.66 qps: 198759.66 (r/w/o: 0.00/198759.66/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00
|
[ 60s ] thds: 40 tps: 197423.50 qps: 197423.50 (r/w/o: 0.00/197423.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 70s ] thds: 40 tps: 196168.50 qps: 196168.50 (r/w/o: 0.00/196168.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 80s ] thds: 40 tps: 196777.38 qps: 196777.38 (r/w/o: 0.00/196777.38/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 90s ] thds: 40 tps: 198326.91 qps: 198326.91 (r/w/o: 0.00/198326.91/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 100s ] thds: 40 tps: 196600.91 qps: 196600.91 (r/w/o: 0.00/196600.91/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00
|
[ 110s ] thds: 40 tps: 198558.49 qps: 198558.49 (r/w/o: 0.00/198558.49/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
[ 120s ] thds: 40 tps: 197079.29 qps: 197079.29 (r/w/o: 0.00/197079.29/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
|
|
transactions: 23605935 (196703.44 per sec.)
|
queries: 23605935 (196703.44 per sec.)
|
maybe you need to run just a little bit longer, like marko and I did
OTOH, marko ran for 2 minutes as well. I reported on 300 seconds
I used the following build configuration for both 10.2 and 10.3:
cmake -DCONC_WITH_{UNITTEST,SSL}=OFF -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_{TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT}=NO -DPLUGIN_PERFSCHEMA=ON -DWITH_SAFEMALLOC=OFF -DWITH_SSL=bundled ..
|
svoj used the following:
cmake -DBUILD_CONFIG=mysql_release -DPLUGIN_PERFSCHEMA=NO -DPLUGIN_CONNECT=NO -DPLUGIN_MROONGA=NO -DPLUGIN_TOKUDB=NO -DPLUGIN_SPIDER=NO -DPLUGIN_OQGRAPH=NO -DWITH_JEMALLOC=system ..
|
I retested commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with the latter config:
build | threads=1 | threads=40 |
---|---|---|
Marko | 9593.49 | 98489.03 |
Svoj | 9272.23 | 217758.55 |
I applied the patch to remove the optimization from trx_write_serialisation_history() (which counter-intuitively improved performance from about 100ktps to 178ktps using my config):
build | threads=1 | threads=40 |
---|---|---|
Marko | 9489.61 | 178983.94 |
Svoj | 9177.91 | 237928.43 |
So, removing the optimization does help in both cases, but much more with my build configuration. This change is not relevant for solving the regression, because the code has been seriously rewritten since then, but it shows that performance is hard. Perhaps the explanation could be that the TRX_SYS page lock serves as some kind of useful throttle that reduces contention on the rseg->mutex? It would seem best to acquire and release rseg->mutex only once during commit, and to try to ensure that the transactions running in different threads do not get assigned the same rseg (of course, this is only possible for up to 128 concurrent threads).
I tried modifying svoj’s config by removing -DWITH_JEMALLOC=system. Performance remained roughly the same:
Svoj’s build without jemalloc | threads=1 | threads=40 |
---|---|---|
5db9c6e490ea2c6b7997dff4cbee18274496dd2c | 9704.28 | 204079.81 |
5db9c6e490ea2c6b7997dff4cbee18274496dd2c with patch | 9355.49 | 237138.82 |
So, jemalloc does help multi-client performance a little compared to GNU libc 2.27, but not that much, and the relative performance difference remains.
A major difference between our build configurations could be that cmake/build_configurations/mysql_release.cmake uses -O3 while -DCMAKE_BUILD_TYPE=RelWithDebInfo only uses -O2. After replacing -O2 with -O3 in my configuration, I got the following:
Marko’s build with -O3 | threads=1 | threads=40 |
---|---|---|
5db9c6e490ea2c6b7997dff4cbee18274496dd2c | 10477.51 | 199604.51 |
This suggests that there could be a clear regression from 10.2 to 10.3 when compiling with -O2. I hope that most distributed packages use -O3.
distributed packages are compiled with -DBUILD_CONFIGURATION=mysql_release
Still, I suggest to look more than just threads=1, and thread=40. 40 threads might be a sweet spot on svoj's machine. I tried 1,..,4096, that's more comprehensive. I suggest to vary durability settings as well (innodb_flush_log_at_trx_commit = 1). innodb_flush_method=O_DIRECT_NO_FSYNC is also a trick that pretends your box has battery backed disk. And it does not probably. dataset size, AHI vs no AHI, threadpool vs no threadpool., etc.
Seeing a full picture after running 2 tests. 2 minutes each , is not easy. I'd really start full comparison for different levels on concurrency, and then vary other parameters, if that does not show regression
I had a systematic error, caused by not rebuilding from the scratch. It looks like -DBUILD_CONFIG=mysql_release does use -O2 after all.
My observations about 5db9c6e490ea2c6b7997dff4cbee18274496dd2c and its predecessor are repeatable. And I still see a significant performance regression with 10.3.5 or the latest 10.3 Here are some rough results from 3-minute runs:
revision | threads=40 |
---|---|
5db9c6e490ea2c6b7997dff4cbee18274496dd2c | 230ktps |
af566d8a63f908c76d43957e790e861d944486b0 | 228346.51tps |
53cc9aa5be63b4dd03aba804aebaa3bd470dab9c | 186491.66tps |
c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 | 210410.54tps |
bc7a1dc1fbd27e6064d3b40443fe242397668af7 | 225957.50tps |
ba125eca551e3f5faee9f6d7ebed290f04cfc8fd | 143308.74tps |
b782971c58b5656820429b8ef3fae5fd82f5a0f7 | 137140.79tps |
There is some fluctuation in these numbers, so 10% variance could be ignored.
Surprisingly, 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c (MDEV-15104 - Remove trx_sys_t::rw_trx_ids) reduces performance compared to its parent af566d8a63f908c76d43957e790e861d944486b0. The next InnoDB commit c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 (MDEV-15104 - Remove trx_sys_t::serialisation_list) is better again. This is what I see with GCC 7.3.0; @svoj sees a steady improvement for these commits with GCC 5.4.0.
Another interesting pair is bc7a1dc1fbd27e6064d3b40443fe242397668af7 and b782971c58b5656820429b8ef3fae5fd82f5a0f7. The latter commit fixed a correctness bug that was introduced by the former. There are quite a few changes in between, and there appears to be some regression already in the parent of the fix (ba125eca551e3f5faee9f6d7ebed290f04cfc8fd).
Next I will try to find the regression between bc7a1dc1fbd27e6064d3b40443fe242397668af7 and ba125eca551e3f5faee9f6d7ebed290f04cfc8fd.
revision | threads=40 |
---|---|
bc7a1dc1fbd27e6064d3b40443fe242397668af7 | 225957.50tps |
d6ed077fc82c2f4d20895db9a6b16ca295f23c33 | 212136.14tps |
217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 | 197112.85 tps |
217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 without st_mem_root::name | 214732.32tps |
eda142590f385b3d79e7b6a6e0cc13ba8fd25961 | 133523.27tps |
53476abce829f14b9d591a9bf160bc07cffaa768 | 127573.76tps |
Commit 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 is a merge from bb-10.2-ext, most notably adding st_mem_root::name. That is clearly costing some performance compared to its parent d6ed077fc82c2f4d20895db9a6b16ca295f23c33. I tried reverting the patch:
git show d69642deddf4f8d51f8c6807361576451334f1f9|patch -Rp1
|
# note: in 10.3, sql/rpl_handler.h was renamed to sql/rpl_rli.h |
# note: adjust a few init_alloc_root() calls that did not exist in bb-10.2-ext |
It seems that we must define st_mem_root::name in debug builds only.
Still, the additional regression between 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 and eda142590f385b3d79e7b6a6e0cc13ba8fd25961 has to be narrowed down. It is outside InnoDB; the InnoDB differences in that range are trivial integer type mismatch warning fixes and fixes of source code comments. The only other commits in this range are CTE fixes (MDEV-15162 and MDEV-15119).
revision | threads=40 |
---|---|
217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 | 197112.85tps |
465979eabfe3acacc0566fa93402e449e1a9f476 | 229538.83tps |
f271100836d8a91a775894ec36b869a66a3145e5 | 205113.14tps |
6c279ad6a71c63cb595fde7c951aadb31c3dbebc | 123927.89tps |
It is surprising that the performance improved in the middle, even though the st_mem_root::name is still present. The CTE fixes seem innocent to the performance.
It is also surprising that the MDEV-15091 size_t fix (commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc) is hurting so much. That seems to be the main source of regression (for my RAM-based toy benchmark).
I would suggest introducing a compile-time option for enabling st_mem_root::name (disabled by default), and revisiting MDEV-15091.
Here's what I get on 6c279ad6a71c63cb595fde7c951aadb31c3dbebc:
[ 10s ] thds: 40 tps: 245242.59 qps: 245242.69 (r/w/o: 0.00/245242.69/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
|
[ 20s ] thds: 40 tps: 254569.26 qps: 254569.16 (r/w/o: 0.00/254569.16/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
[ 30s ] thds: 40 tps: 254972.73 qps: 254972.73 (r/w/o: 0.00/254972.73/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
|
[ 40s ] thds: 40 tps: 252851.43 qps: 252851.53 (r/w/o: 0.00/252851.53/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
[ 50s ] thds: 40 tps: 251944.58 qps: 251944.48 (r/w/o: 0.00/251944.48/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
[ 60s ] thds: 40 tps: 253112.17 qps: 253112.17 (r/w/o: 0.00/253112.17/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
[ 70s ] thds: 40 tps: 250768.06 qps: 250768.06 (r/w/o: 0.00/250768.06/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
|
[ 80s ] thds: 40 tps: 246130.98 qps: 246130.98 (r/w/o: 0.00/246130.98/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
|
[ 90s ] thds: 40 tps: 243231.00 qps: 243231.00 (r/w/o: 0.00/243231.00/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
|
[ 100s ] thds: 40 tps: 245256.26 qps: 245256.26 (r/w/o: 0.00/245256.26/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
|
[ 110s ] thds: 40 tps: 246891.01 qps: 246891.01 (r/w/o: 0.00/246891.01/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
|
[ 120s ] thds: 40 tps: 246429.67 qps: 246429.87 (r/w/o: 0.00/246429.87/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
|
SQL statistics:
|
queries performed:
|
read: 0
|
write: 29916061
|
other: 0
|
total: 29916061
|
transactions: 29916061 (249278.75 per sec.)
|
queries: 29916061 (249278.75 per sec.)
|
Sometimes I experience similar turbulence in benchmarks, which I couldn't explain so far. It is usually gone by itself and I can't reliably analyse it.
I guess if marko would re-run benchmark against 6c279ad6a71c63cb595fde7c951aadb31c3dbebc e.g. tomorrow he would see totally different picture.
I get a feeling like original report is completely ignored, and everyone is running ramdisk benchmarks now, testing CPU caches, and compiler optimizations, on 40 threads, on 40 core machines, for 2 minutes
You're right, we need to concentrate on original report. Which I couldn't reproduce locally, most probably due to some known issues with Samsung SSD being slow on heavy fsync.
I will try to look into ramdisk performance more closely, with longer run and more concurrent connections (like 1000).
I agree that we must address the I/O issue, but at the same time it could be worth looking for CPU regressions, that is, approach repeatable regressions from multiple angles. For what it is worth, I repeated some 180-second ramdisk benchmarks today:
revision | threads=40 |
---|---|
6c279ad6a71c63cb595fde7c951aadb31c3dbebc | 130213.08tps |
6c279ad6a71c63cb595fde7c951aadb31c3dbebc (patch 1) | 134518.90tps |
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 | 104287.79tps |
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 2) | 112871.50tps |
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1m+2) | 109021.72tps |
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1+2) | 114488.47tps |
The baseline was 123927.89tps before the weekend, and now it is 130213.08tps, so there indeed is some turbulence in the figures even after I switched to always do a full rebuild.
Patch 1 reverts some struct or class member variables from size_t to a narrower type, such as uint or uint32. A minimal version of the patch (1m) is below:
diff --git a/sql/sql_cache.h b/sql/sql_cache.h
|
index 0ed45a9ed81..ca2c3a59ee7 100644
|
--- a/sql/sql_cache.h
|
+++ b/sql/sql_cache.h
|
@@ -350,7 +350,7 @@ class Query_cache
|
size_t min_allocation_unit, min_result_data_size;
|
uint def_query_hash_size, def_table_hash_size;
|
|
- size_t mem_bin_num, mem_bin_steps; // See at init_cache & find_bin
|
+ uint mem_bin_num, mem_bin_steps; // See at init_cache & find_bin
|
|
bool initialized;
|
|
In my test resuls above, this minimal patch appeared to cause a slight performance regression, but I think that we can explain it by saying that it falls within the 5% or 10% of fluctuation.
Patch 2 removes st_mem_root::name and the name parameter of init_alloc_root():
diff --git a/include/my_alloc.h b/include/my_alloc.h
|
index 3ba8a02fc48..c9879125d96 100644
|
--- a/include/my_alloc.h
|
+++ b/include/my_alloc.h
|
@@ -52,7 +52,9 @@ typedef struct st_mem_root
|
unsigned int first_block_usage;
|
|
void (*error_handler)(void);
|
+#ifndef DBUG_OFF
|
const char *name;
|
+#endif
|
} MEM_ROOT;
|
|
#ifdef __cplusplus
|
diff --git a/include/my_sys.h b/include/my_sys.h
|
index 13ab7b12320..fb18cd92e8a 100644
|
--- a/include/my_sys.h
|
+++ b/include/my_sys.h
|
@@ -888,9 +888,16 @@ extern void my_free_lock(void *ptr);
|
#define alloc_root_inited(A) ((A)->min_malloc != 0)
|
#define ALLOC_ROOT_MIN_BLOCK_SIZE (MALLOC_OVERHEAD + sizeof(USED_MEM) + 8)
|
#define clear_alloc_root(A) do { (A)->free= (A)->used= (A)->pre_alloc= 0; (A)->min_malloc=0;} while(0)
|
+#ifndef DBUG_OFF
|
extern void init_alloc_root(MEM_ROOT *mem_root, const char *name,
|
size_t block_size, size_t pre_alloc_size,
|
myf my_flags);
|
+#else
|
+extern void init_alloc_Root(MEM_ROOT *mem_root,
|
+ size_t block_size, size_t pre_alloc_size,
|
+ myf my_flags);
|
+# define init_alloc_root(m,n,b,p,f) init_alloc_Root(m,b,p,f)
|
+#endif
|
extern void *alloc_root(MEM_ROOT *mem_root, size_t Size);
|
extern void *multi_alloc_root(MEM_ROOT *mem_root, ...);
|
extern void free_root(MEM_ROOT *root, myf MyFLAGS);
|
diff --git a/mysys/my_alloc.c b/mysys/my_alloc.c
|
index 7139466be17..fddc9118312 100644
|
--- a/mysys/my_alloc.c
|
+++ b/mysys/my_alloc.c
|
@@ -54,9 +54,15 @@
|
Because of this, we store in MY_THREAD_SPECIFIC as bit 1 in block_size
|
*/
|
|
+#ifndef DBUG_OFF
|
void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size,
|
size_t pre_alloc_size __attribute__((unused)),
|
myf my_flags)
|
+#else
|
+void init_alloc_Root(MEM_ROOT *mem_root, size_t block_size,
|
+ size_t pre_alloc_size __attribute__((unused)),
|
+ myf my_flags)
|
+#endif
|
{
|
DBUG_ENTER("init_alloc_root");
|
DBUG_PRINT("enter",("root: %p name: %s prealloc: %zu", mem_root,
|
@@ -72,7 +78,9 @@ void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size,
|
mem_root->block_num= 4; /* We shift this with >>2 */
|
mem_root->first_block_usage= 0;
|
mem_root->total_alloc= 0;
|
+#ifndef DBUG_OFF
|
mem_root->name= name;
|
+#endif
|
|
#if !(defined(HAVE_valgrind) && defined(EXTRA_DEBUG))
|
if (pre_alloc_size) |
I would suggest to include this patch in 10.3 in some form.
With GCC 7.3.0 and the 3-minute CPU-bound ramdisk benchmark, we still appear to have a significant regression somewhere between 10.3.4 and 10.3.5. I will keep trying to trace it down while Svoj can concentrate on the originally reported regression.
I can still repeat the huge drop caused by commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: dropping from 207119.69tps to 130029.74tps. I narrowed it down to the sql_class.h and sql_class.cc changes in that commit, but when I reverted that part of the patch in the latest 10.3, I did not see any improvement.
In my benchmark, I would see contention in lock_sys.mutex and log_sys.mutex. The situation should be improved by MDEV-16232 and MDEV-14425.
Also, at low concurrency, it seems that purge is wasting resources and creating unnecessary contention. Perhaps the amount of purge threads should be automatically scaled based on the amount of purgeable records?
A bad merge of MDEV-14705 to MariaDB 10.3.6 causes frequent calls to service_manager_extend_timeout() from all purge threads. These calls were supposed to only occur during a slow shutdown. This will be fixed in MDEV-17003.
Given that the performance regression reported in this ticket was reported on Windows, where the systemd interface is not present, this probably does not explain much of the regression. Also, I probably did not have libsystemd-dev installed when compiling code for running benchmarks.
By now, we are all educated that MDEV-12288 caused that issue. It just came not really announced back then
Assigning to marko by default Innodb guru. svoj, you can jump in, if you want to repeat, or have any idea how to fix. axel, same for you.