[MDEV-16168] Performance regression on sysbench write benchmarks from 10.2 to 10.3 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.3(EOL)
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
None

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16

MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows

[mysqld]

back_log=500

max_connections=4096

max_prepared_stmt_count=500000

table_open_cache=10000

transaction_isolation=REPEATABLE-READ

innodb_status_file=0

innodb_buffer_pool_size=20G

innodb_log_file_size=15G

innodb_log_buffer_size=1000M

innodb-io-capacity=4000

innodb-io-capacity-max=5000

innodb_doublewrite=0

innodb-page-cleaners=1

innodb-buffer-pool-instances=1

innodb_adaptive_flushing=0

innodb_adaptive_flushing_lwm=10

sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

Loading table, sysbench prepare

sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1  --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL  --num-threads=1 prepare

sysbench run

sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1  --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL  --num-threads=%N% run

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7
UPDATE: marko mentioned the git 1a62c8a39647db00283b4e35fb6db3c7bc31e3ca as the version right before 10.3 switched to lockless, so
I added a column for it too. It appears to be better than 10.3.7, but worse than 10.2.14

Users	TPS 10.2.14	TPS 10.3.7	10.3-pre-lockless
1	1549.92	1547.75	1551.07
2	2373.89	2209.05	2334.80
4	4466.03	4276.46	4446.64
8	9069.82	8587.59	9098.95
16	16631.45	15719.70	16231.39
32	28989.79	27482.44	27786.32
64	36996.12	34843.23	35089.44
128	37287.05	35172.49	36223.68
256	38038.01	35896.52	36818.22
512	38360.71	36567.07	37195.47
1024	38265.25	36328.62	37253.98
2048	39990.85	36328.62	38572.81
4096	41988.71	39032.83	39606.38

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, baseline refers to 10.2.14, and *comparison" is 10.3.7

Exclusive samples comparison
Individual functions (or, exclusive sample percentage with threshold 1% difference)

Comparison Column	Delta	Baseline Value	Comparison Value
ut_delay	4.02	16.05	20.07
l_find	2.44	0.16	2.60
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter	2.42	0.18	2.60
TTASEventMutex<GenericPolicy>::enter	-4.42	4.42	0.00
SleepConditionVariableCS	-5.41	28.70	23.29

In short, more busy wait ut_delay, less lazy wait SleepConditionVariableCS . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
Inclusive samples comparison
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

Comparison Column	Delta	Baseline Value	Comparison Value
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter	28.69	2.53	31.22
ut_delay	4.02	16.05	20.07
trx_purge	3.81	5.67	9.48
srv_do_purge	3.80	5.70	9.49
row_purge	3.66	0.00	3.67
row_purge_step	3.47	0.71	4.18
que_thr_step	3.31	0.94	4.25
srv_purge_coordinator_thread	3.25	6.37	9.63
trx_sys_t::clone_oldest_view	3.18	0.00	3.18
btr_cur_search_to_nth_level_func	3.12	0.00	3.12
row_purge_record_func	2.94	0.00	2.94
row_purge_upd_exist_or_extern_func	2.77	0.00	2.77
row_purge_reset_trx_id	2.72	0.00	2.72
que_run_threads_low	2.70	1.73	4.43
que_run_threads	2.69	1.75	4.45
l_find	2.46	0.18	2.64
ReadView::snapshot	2.30	0.00	2.30
rw_trx_hash_t::iterate	2.27	0.00	2.27
lf_hash_iterate	2.25	0.00	2.25
srv_task_execute	2.12	1.95	4.08
row_purge_reposition_pcur	1.53	0.00	1.53
row_search_on_row_ref	1.33	0.00	1.33
btr_pcur_open_low	1.31	0.00	1.31
btr_search_guess_on_hash	1.31	1.54	2.85
trx_purge_wait_for_workers_to_complete	1.10	1.52	2.62
trx_undo_assign_low	1.07	0.00	1.07
mtr_t::commit	1.05	2.57	3.62
mtr_t::Command::execute	1.01	2.42	3.43
srv_worker_thread	-1.09	9.84	8.75
sync_array_wait_event	-1.29	9.69	8.40
trx_write_serialisation_history	-1.34	2.62	1.28
trx_commit_low	-1.39	3.15	1.76
innobase_commit	-1.75	47.36	45.61
trx_commit	-1.75	4.79	3.04
ha_commit_one_phase	-1.76	47.55	45.78
commit_one_phase_2	-1.76	47.54	45.77
trans_commit_stmt	-1.80	48.68	46.88
ha_commit_trans	-1.80	48.65	46.84
btr_cur_search_to_nth_level	-1.85	1.85	0.00
innobase_commit_ordered_2	-2.04	5.17	3.13
trx_commit_for_mysql	-2.05	5.10	3.05
innobase_commit_low	-2.05	5.11	3.06
Prepared_statement::execute_loop	-2.05	74.58	72.52
mysql_stmt_execute_common	-2.14	74.85	72.71
mysqld_stmt_execute	-2.15	74.89	72.74
mysql_execute_command	-2.21	73.03	70.81
trx_undo_assign_undo	-2.24	2.24	0.00
Prepared_statement::execute	-2.24	74.03	71.79
threadpool_process_request	-2.32	81.71	79.38
dispatch_command	-2.33	79.21	76.88
do_command	-2.34	81.33	79.00
tp_callback	-2.38	82.66	80.28
srv_resume_thread	-3.33	7.94	4.61
os_event::wait_low	-4.98	28.41	23.43
SleepConditionVariableCS	-5.41	28.70	23.29
TTASEventMutex<GenericPolicy>::enter	-30.49	30.49	0.00

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

10.3.pre-lockless.txt
29 kB
2018-05-15 16:52
1bp-1pc-10.2.14.txt
29 kB
2018-05-14 23:41
1bp-1pc-10.3.7.txt
29 kB
2018-05-14 23:41
winpmp_10_2_14.txt
317 kB
2018-05-18 16:30
winpmp_10_3_7.txt
265 kB
2018-05-18 14:37

Issue Links

is duplicated by

MDEV-17410 mariabackup prepare crashes in Innodb recovery

Closed

relates to

MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC

Closed

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-16232 Use fewer mini-transactions

Stalled

MDEV-17003 service_manager_extend_timeout() being called too often

Closed

MDEV-19845 Adaptive spin loops

Closed

MDEV-17353 pinpoint perf regressions in early 10.3 to commits

Open

(2 relates to)

Activity

Ascending order - Click to sort in descending order

Vladislav Vaintroub added a comment - 2018-05-14 23:34

Assigning to marko by default Innodb guru. svoj, you can jump in, if you want to repeat, or have any idea how to fix. axel, same for you.

Vladislav Vaintroub added a comment - 2018-05-14 23:34 Assigning to marko by default Innodb guru. svoj , you can jump in, if you want to repeat, or have any idea how to fix. axel , same for you.

Marko Mäkelä added a comment - 2018-05-15 06:26

Given that log_sys.mutex appears to be one of the contention points, I think that we should keep track of the average volume of redo log written per write transaction. More redo log writes should lead to more contention.

There are two features in MariaDB 10.3 that could increase the writes: ~~MDEV-12288~~ (resetting transaction identifiers after writes) and ~~MDEV-11369~~ (instant ADD COLUMN). As far as I understand, neither of these changes should matter.

Also, some mini-transactions for writing undo log were changed in the hope of simplifying crash recovery; one of the changes was essentially reverted in ~~MDEV-15914~~ due to a performance regression.

I would like to compare how these fields in SHOW ENGINE INNODB STATUS proceed:

excerpt of SHOW ENGINE INNODB STATUS output
…
------------
TRANSACTIONS
------------
Trx id counter 34
…
---
LOG
---
Log sequence number 1642672
…

We should record both counts at the start and end of the test, calculate the difference and the ratio. For example, if these counts were 50 and 2000000 at the end of the test, the ratio would be (2000000-1642672)/(50-34) = 22333 bytes per transaction.

If MariaDB 10.3 is generating significantly more redo log than 10.2, that could explain the increased log_sys.mutex contention.

If purge threads are using more CPU than earlier, that could be due to svoj’s work of removing trx_sys.mutex and employing some lock-free data structures. It could be useful to test a 10.3 version right before switching to lock-free data structures.

Marko Mäkelä added a comment - 2018-05-15 06:26 Given that log_sys.mutex appears to be one of the contention points, I think that we should keep track of the average volume of redo log written per write transaction. More redo log writes should lead to more contention. There are two features in MariaDB 10.3 that could increase the writes: MDEV-12288 (resetting transaction identifiers after writes) and MDEV-11369 (instant ADD COLUMN ). As far as I understand, neither of these changes should matter. Also, some mini-transactions for writing undo log were changed in the hope of simplifying crash recovery; one of the changes was essentially reverted in MDEV-15914 due to a performance regression. I would like to compare how these fields in SHOW ENGINE INNODB STATUS proceed: excerpt of SHOW ENGINE INNODB STATUS output … ------------ TRANSACTIONS ------------ Trx id counter 34 … --- LOG --- Log sequence number 1642672 … We should record both counts at the start and end of the test, calculate the difference and the ratio. For example, if these counts were 50 and 2000000 at the end of the test, the ratio would be (2000000-1642672)/(50-34) = 22333 bytes per transaction. If MariaDB 10.3 is generating significantly more redo log than 10.2, that could explain the increased log_sys.mutex contention. If purge threads are using more CPU than earlier, that could be due to svoj ’s work of removing trx_sys.mutex and employing some lock-free data structures. It could be useful to test a 10.3 version right before switching to lock-free data structures .

Marko Mäkelä added a comment - 2018-05-15 06:56

There was also this change to ut_delay() and some related discussion. On the Intel Skylake microarchitecture, the PAUSE instruction (REPZ NOP) takes more wall-clock time than on earlier ones. Maybe we are now burning more CPU on older microarchitectures?

Marko Mäkelä added a comment - 2018-05-15 06:56 There was also this change to ut_delay() and some related discussion. On the Intel Skylake microarchitecture, the PAUSE instruction ( REPZ NOP ) takes more wall-clock time than on earlier ones. Maybe we are now burning more CPU on older microarchitectures?

Daniel Black added a comment - 2018-05-15 07:30 - edited

ref ut_delay: I think I saw svoj test PTHREAD_MUTEX_ADAPTIVE_NP and it wasn't good.

Maybe some rethinking the adaptive delay in mariadb to a mean/std approach per mutex rather than a straight mean. We're wasting heaps of cycles in ut_delay so a little bit of maths extra isn't going to hurt too bad.
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=1519c142bd6ec5cc528c5921f28575d25c7fb664;hb=HEAD#l118

Daniel Black added a comment - 2018-05-15 07:30 - edited ref ut_delay: I think I saw svoj test PTHREAD_MUTEX_ADAPTIVE_NP and it wasn't good. Maybe some rethinking the adaptive delay in mariadb to a mean/std approach per mutex rather than a straight mean. We're wasting heaps of cycles in ut_delay so a little bit of maths extra isn't going to hurt too bad. https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=1519c142bd6ec5cc528c5921f28575d25c7fb664;hb=HEAD#l118

Vladislav Vaintroub added a comment - 2018-05-15 18:35 - edited

marko, I did some calculations of bytes per trx, guided by your hint, and this is what it is (surprisingly, to me maybe, the number of bytes is not integer

10.3.7
trxid1 = 193362180
LSN1 = 69634009621

trxid2 = 215344539
LSN2= 75578735333

bytes per trx = 270.43

10.2.14

trxid1 = 11388
LSN1= 13829598991
trxid2 = 22135718
LSN2= 19655598660

bytes per trx = 263.32

multiuser benchmarks was run between the point 1 and point 2.
Hope that helps

Vladislav Vaintroub added a comment - 2018-05-15 18:35 - edited marko , I did some calculations of bytes per trx, guided by your hint, and this is what it is (surprisingly, to me maybe, the number of bytes is not integer 10.3.7 trxid1 = 193362180 LSN1 = 69634009621 trxid2 = 215344539 LSN2= 75578735333 bytes per trx = 270.43 10.2.14 trxid1 = 11388 LSN1= 13829598991 trxid2 = 22135718 LSN2= 19655598660 bytes per trx = 263.32 multiuser benchmarks was run between the point 1 and point 2. Hope that helps

Marko Mäkelä added a comment - 2018-05-15 20:25

To reduce the redo log volume, it looks like we have to resume generating the redo log record types MLOG_UNDO_INIT and MLOG_UNDO_HDR_REUSE. The record MLOG_UNDO_INSERT was already restored in ~~MDEV-15914~~.

Marko Mäkelä added a comment - 2018-05-15 20:25 To reduce the redo log volume, it looks like we have to resume generating the redo log record types MLOG_UNDO_INIT and MLOG_UNDO_HDR_REUSE . The record MLOG_UNDO_INSERT was already restored in MDEV-15914 .

Marko Mäkelä added a comment - 2018-05-15 20:35

MLOG_UNDO_HDR_REUSE was removed by ~~MDEV-12288~~. That one we cannot restore. It should be equivalent to MLOG_UNDO_HDR_CREATE in the redo log volume. Similarly, reintroducing MLOG_UNDO_ERASE_END ought to increase the redo log volume written, so we’d better not restore that.

Marko Mäkelä added a comment - 2018-05-15 20:35 MLOG_UNDO_HDR_REUSE was removed by MDEV-12288 . That one we cannot restore. It should be equivalent to MLOG_UNDO_HDR_CREATE in the redo log volume. Similarly, reintroducing MLOG_UNDO_ERASE_END ought to increase the redo log volume written, so we’d better not restore that.

Vladislav Vaintroub added a comment - 2018-05-15 20:41 - edited

regarding "relax CPU" number of loops for "rep nop" being 200, https://github.com/MariaDB/server/commit/b3346c2f41d0f7bd0d919826592b37ae09fa41c8#diff-e877f572e9fd77c43d5e7af075f92e74R91 , this is something from me, I measured this as optimal back in 2008, on (probably) Core2 Duo. No idea if this still holds.

Vladislav Vaintroub added a comment - 2018-05-15 20:41 - edited regarding "relax CPU" number of loops for "rep nop" being 200, https://github.com/MariaDB/server/commit/b3346c2f41d0f7bd0d919826592b37ae09fa41c8#diff-e877f572e9fd77c43d5e7af075f92e74R91 , this is something from me, I measured this as optimal back in 2008, on (probably) Core2 Duo. No idea if this still holds.

Marko Mäkelä added a comment - 2018-05-15 21:00

The MLOG_UNDO_INIT record should occupy 3..12 bytes while the lower-level records (MLOG_2BYTES, MLOG_4BYTES) could easily consume 7 more bytes.

Marko Mäkelä added a comment - 2018-05-15 21:00 The MLOG_UNDO_INIT record should occupy 3..12 bytes while the lower-level records ( MLOG_2BYTES , MLOG_4BYTES ) could easily consume 7 more bytes.

Marko Mäkelä added a comment - 2018-05-16 11:24

I have now restored the MLOG_UNDO_INIT record, which I had removed in the hope that recovery would be simpler and faster. (But I failed to notice that a small increase on the number of redo log bytes written can hurt performance.)

I developed a simple test for measuring the volume of redo log written by an update. Note that transaction IDs are incremented both at start and at commit.

Due to ~~MDEV-12288~~, it is clear that there will be more writes by the purge in 10.3. Before ~~MDEV-12288~~, an UPDATE would not result in any purge activity, but after it, in 10.3, all DML will eventually result in the DB_TRX_ID column being zeroed out in the clustered index record. This will improve read performance at some cost of write performance.

--source include/have_innodb.inc

--source include/have_sequence.inc

let $n_updates=1;

let $n_rows=1;

create table t1 (id serial, c2 char(120) default 'foobar') engine=innodb;

#connect stop_purge, localhost, root;

#START TRANSACTION WITH CONSISTENT SNAPSHOT;

#connection default;

eval insert into t1(id) select * from seq_1_to_$n_rows;

--replace_regex /.*Log sequence number ([0-9]+).*/\1/

let $lsn= `SHOW ENGINE INNODB STATUS`;

let $lsn= `SELECT REGEXP_REPLACE('$lsn', 'InnoDB\\t*', '')`;

--replace_regex /.*Trx id counter ([0-9]+).*/\1/

let $trx= `SHOW ENGINE INNODB STATUS`;

let $trx= `SELECT REGEXP_REPLACE('$trx', 'InnoDB\\t*', '')`;

while ($n_updates) {

update t1 set c2='boofar';

update t1 set c2='foobar';

dec $n_updates;

--replace_regex /.*Log sequence number ([0-9]+).*/\1/

let $new_lsn= `SHOW ENGINE INNODB STATUS`;

let $new_lsn= `SELECT REGEXP_REPLACE('$new_lsn', 'InnoDB\\t*', '')`;

--replace_regex /.*Trx id counter ([0-9]+).*/\1/

let $new_trx= `SHOW ENGINE INNODB STATUS`;

let $new_trx= `SELECT REGEXP_REPLACE('$new_trx', 'InnoDB\\t*', '')`;

eval SELECT ($new_lsn-$lsn)/($new_trx-$trx) `bytes/(transaction*2)`;

drop table t1;

In the above test, you can disable purge to observe the effect to the number of redo log bytes written. The test should probably run a longer time for the purge to kick in.

The effect of ~~MDEV-12288~~ can also be disabled by removing the body of the function row_purge_reset_trx_id(). After that change, read-only workloads should be somewhat slower due to performing lookups for DB_TRX_ID that have been purged.

Marko Mäkelä added a comment - 2018-05-16 11:24 I have now restored the MLOG_UNDO_INIT record, which I had removed in the hope that recovery would be simpler and faster. (But I failed to notice that a small increase on the number of redo log bytes written can hurt performance.) I developed a simple test for measuring the volume of redo log written by an update. Note that transaction IDs are incremented both at start and at commit. Due to MDEV-12288 , it is clear that there will be more writes by the purge in 10.3. Before MDEV-12288 , an UPDATE would not result in any purge activity, but after it, in 10.3, all DML will eventually result in the DB_TRX_ID column being zeroed out in the clustered index record. This will improve read performance at some cost of write performance. --source include/have_innodb.inc --source include/have_sequence.inc let $n_updates=1; let $n_rows=1; create table t1 (id serial, c2 char (120) default 'foobar' ) engine=innodb; # connect stop_purge, localhost, root; #START TRANSACTION WITH CONSISTENT SNAPSHOT; # connection default ; eval insert into t1(id) select * from seq_1_to_$n_rows; --replace_regex /.*Log sequence number ([0-9]+).*/\1/ let $lsn= `SHOW ENGINE INNODB STATUS`; let $lsn= ` SELECT REGEXP_REPLACE( '$lsn' , 'InnoDB\\t*' , '' )`; --replace_regex /.*Trx id counter ([0-9]+).*/\1/ let $trx= `SHOW ENGINE INNODB STATUS`; let $trx= ` SELECT REGEXP_REPLACE( '$trx' , 'InnoDB\\t*' , '' )`; while ($n_updates) { update t1 set c2= 'boofar' ; update t1 set c2= 'foobar' ; dec $n_updates; } --replace_regex /.*Log sequence number ([0-9]+).*/\1/ let $new_lsn= `SHOW ENGINE INNODB STATUS`; let $new_lsn= ` SELECT REGEXP_REPLACE( '$new_lsn' , 'InnoDB\\t*' , '' )`; --replace_regex /.*Trx id counter ([0-9]+).*/\1/ let $new_trx= `SHOW ENGINE INNODB STATUS`; let $new_trx= ` SELECT REGEXP_REPLACE( '$new_trx' , 'InnoDB\\t*' , '' )`; eval SELECT ($new_lsn-$lsn)/($new_trx-$trx) `bytes/( transaction *2)`; drop table t1; In the above test, you can disable purge to observe the effect to the number of redo log bytes written. The test should probably run a longer time for the purge to kick in. The effect of MDEV-12288 can also be disabled by removing the body of the function row_purge_reset_trx_id() . After that change, read-only workloads should be somewhat slower due to performing lookups for DB_TRX_ID that have been purged.

Marko Mäkelä added a comment - 2018-05-17 13:25 - edited

I can confirm this in a memory-based benchmark:

sysbench oltp_update_index.lua --tables=20 --table_size=100000 --rand-type=uniform --time=60 --report-interval=10 --threads=40

and

numactl --interleave=all \

$MDIR/sql/mysqld --no-defaults --log-output=file --general-log=OFF --core-file --loose-debug-sync-timeout=300 --innodb \

  --datadir=$TDIR --lc-messages-dir=$MDIR/sql/share/ --socket=$MYSQL_SOCK \

  --loose-innodb_file_per_table=true \

  --loose-innodb-open-files=4000 \

  --loose-innodb_data_file_path=ibdata1:10M:autoextend \

  --loose-innodb_flush_log_at_trx_commit=0 \

  --loose-innodb_flush_method=O_DIRECT_NO_FSYNC \

  --loose-innodb-adaptive-hash-index-partitions=32 \

  --loose-innodb_log_buffer_size=256M \

  --loose-innodb_log_file_size=1900M \

  --loose-innodb_buffer_pool_size=50G \

  --loose-innodb-buffer-pool-instances=32 \

  --loose-innodb-thread-concurrency=0 \

  --loose-innodb_log_files_in_group=3 \

  --loose-innodb_read_io_threads=8 \

  --loose-innodb_write_io_threads=8 \

  --loose-innodb_io_capacity=8000 \

  --loose-innodb_flush_sync=0 \

  --table-open-cache=1400 \

  --table-open-cache-instances=64 \

  --max-connections=200 \

  --aria-checkpoint-interval=0 \

  --loose-performance-schema=off

The transactions/second number is:

version	threads=1	threads=40
10.2 a4e7800701d0764fe4cbb85b81d7c7cb54677334	10031.27	189489.50
mariadb-10.3.4	9035.31	151952.76
54c715acca8077d9a1694521bd34d6ef1505d7e1	9398.94	173670.23
8d1d38f9539e2b0d30fdb9531d06b66c7662e9dd	9424.46	178949.21
c7d0448797a2013079c3e0b7ca97f7bdc1138696	9489.61	178983.94
5db9c6e490ea2c6b7997dff4cbee18274496dd2c	9593.49	98489.03
dcc09afa634b0f525d2e8e7d15ef36e5fdd5d270	9373.73	100722.44
af566d8a63f908c76d43957e790e861d944486b0	9188.53	104609.54
mariadb-10.3.5	9625.65	95901.31
10.3 b2f56085d4249e2cf3ed9f80d284b3ce837e9e29	10160.03	95248.29
patched b2f56085d4249e2cf3ed9f80d284b3ce837e9e29	9555.82	82289.68

Something has drastically changed between 10.3.4 and 10.3.5. Commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c is the biggest culprit. I still think that the idea of ~~MDEV-15132~~ is good, but there must be something bad about this particular implementation.

I chose commit af566d8a63f908c76d43957e790e861d944486b0 because it is in the middle of some code changes by svoj. There is some visible regression between that and mariadb-10.3.5, and not much improvement since then.

The patch against the most recent 10.3 essentially disabled the resetting of the DB_TRX_ID (~~MDEV-12288~~). This resetting does create some purge activity and cause more writes, but it could also greatly help performance, as demonstrated above. (If DB_TRX_ID=0, we immediately know that there is no older version and the record is not implicitly locked.)

diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc

index af8cd682288..8b22c90e0d1 100644

--- a/storage/innobase/row/row0purge.cc

+++ b/storage/innobase/row/row0purge.cc

@@ -671,6 +671,7 @@ static

 void

 row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)

+#if 0

 	ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S));

 	/* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */

 	mtr->start();

@@ -729,6 +730,7 @@ row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)

 	mtr->commit();

+#endif

 /***********************************************************//**

diff --git a/storage/innobase/row/row0umod.cc b/storage/innobase/row/row0umod.cc

index 47c515f70cf..3b5f37adf50 100644

--- a/storage/innobase/row/row0umod.cc

+++ b/storage/innobase/row/row0umod.cc

@@ -385,6 +385,7 @@ row_undo_mod_clust(

 					   false, &mtr);

 		ut_ad(err == DB_SUCCESS

 		      || err == DB_OUT_OF_FILE_SPACE);

+#if 0

 	} else if (!index->table->is_temporary() && node->new_trx_id) {

 		/* We rolled back a record so that it still exists.

 		We must reset the DB_TRX_ID if the history is no

@@ -430,6 +431,7 @@ row_undo_mod_clust(

 						  sizeof reset_trx_id, &mtr);

+#endif

 	} else {

 		goto func_exit;

Marko Mäkelä added a comment - 2018-05-17 13:25 - edited I can confirm this in a memory-based benchmark: sysbench oltp_update_index.lua --tables=20 --table_size=100000 --rand-type=uniform --time=60 --report-interval=10 --threads=40 and numactl --interleave=all \ $MDIR/sql/mysqld --no-defaults --log-output=file --general-log=OFF --core-file --loose-debug-sync-timeout=300 --innodb \ --datadir=$TDIR --lc-messages-dir=$MDIR/sql/share/ --socket=$MYSQL_SOCK \ --loose-innodb_file_per_table=true \ --loose-innodb-open-files=4000 \ --loose-innodb_data_file_path=ibdata1:10M:autoextend \ --loose-innodb_flush_log_at_trx_commit=0 \ --loose-innodb_flush_method=O_DIRECT_NO_FSYNC \ --loose-innodb-adaptive-hash-index-partitions=32 \ --loose-innodb_log_buffer_size=256M \ --loose-innodb_log_file_size=1900M \ --loose-innodb_buffer_pool_size=50G \ --loose-innodb-buffer-pool-instances=32 \ --loose-innodb-thread-concurrency=0 \ --loose-innodb_log_files_in_group=3 \ --loose-innodb_read_io_threads=8 \ --loose-innodb_write_io_threads=8 \ --loose-innodb_io_capacity=8000 \ --loose-innodb_flush_sync=0 \ --table-open-cache=1400 \ --table-open-cache-instances=64 \ --max-connections=200 \ --aria-checkpoint-interval=0 \ --loose-performance-schema=off The transactions/second number is: version threads=1 threads=40 10.2 a4e7800701d0764fe4cbb85b81d7c7cb54677334 10031.27 189489.50 mariadb-10.3.4 9035.31 151952.76 54c715acca8077d9a1694521bd34d6ef1505d7e1 9398.94 173670.23 8d1d38f9539e2b0d30fdb9531d06b66c7662e9dd 9424.46 178949.21 c7d0448797a2013079c3e0b7ca97f7bdc1138696 9489.61 178983.94 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 9593.49 98489.03 dcc09afa634b0f525d2e8e7d15ef36e5fdd5d270 9373.73 100722.44 af566d8a63f908c76d43957e790e861d944486b0 9188.53 104609.54 mariadb-10.3.5 9625.65 95901.31 10.3 b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 10160.03 95248.29 patched b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 9555.82 82289.68 Something has drastically changed between 10.3.4 and 10.3.5. Commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c is the biggest culprit. I still think that the idea of MDEV-15132 is good, but there must be something bad about this particular implementation. I chose commit af566d8a63f908c76d43957e790e861d944486b0 because it is in the middle of some code changes by svoj . There is some visible regression between that and mariadb-10.3.5, and not much improvement since then. The patch against the most recent 10.3 essentially disabled the resetting of the DB_TRX_ID ( MDEV-12288 ). This resetting does create some purge activity and cause more writes, but it could also greatly help performance, as demonstrated above. (If DB_TRX_ID=0, we immediately know that there is no older version and the record is not implicitly locked.) diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc index af8cd682288..8b22c90e0d1 100644 --- a/storage/innobase/row/row0purge.cc +++ b/storage/innobase/row/row0purge.cc @@ -671,6 +671,7 @@ static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr) { +#if 0 ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S)); /* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */ mtr->start(); @@ -729,6 +730,7 @@ row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr) } mtr->commit(); +#endif } /***********************************************************//** diff --git a/storage/innobase/row/row0umod.cc b/storage/innobase/row/row0umod.cc index 47c515f70cf..3b5f37adf50 100644 --- a/storage/innobase/row/row0umod.cc +++ b/storage/innobase/row/row0umod.cc @@ -385,6 +385,7 @@ row_undo_mod_clust( false, &mtr); ut_ad(err == DB_SUCCESS || err == DB_OUT_OF_FILE_SPACE); +#if 0 } else if (!index->table->is_temporary() && node->new_trx_id) { /* We rolled back a record so that it still exists. We must reset the DB_TRX_ID if the history is no @@ -430,6 +431,7 @@ row_undo_mod_clust( sizeof reset_trx_id, &mtr); } } +#endif } else { goto func_exit; }

Marko Mäkelä added a comment - 2018-05-17 14:29

Surprisingly, removing the following supposed optimization

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 26b857a75f3..218e913d264 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -1341,11 +1341,6 @@ trx_write_serialisation_history(

 #endif

 	const bool update_binlog_pos = trx->mysql_log_file_name

 		&& *trx->mysql_log_file_name;

-	if (!update_binlog_pos

-#ifdef WITH_WSREP

-	    && !update_wsrep

-#endif

-	    ) return true;

 	buf_block_t* block = trx_sysf_get(mtr);

 #ifdef WITH_WSREP

fixes the performance regression of 5db9c6e490ea2c6b7997dff4cbee18274496dd2c. With this optimization removed, we are always acquiring and releasing the TRX_SYS page latch. Replication and Galera are disabled, after all.

This patch is not applicable to later versions, because ~~MDEV-15158~~ later removed the write to the TRX_SYS page altogether. Maybe this page lock somehow served as useful throttling?

Marko Mäkelä added a comment - 2018-05-17 14:29 Surprisingly, removing the following supposed optimization diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 26b857a75f3..218e913d264 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1341,11 +1341,6 @@ trx_write_serialisation_history( #endif const bool update_binlog_pos = trx->mysql_log_file_name && *trx->mysql_log_file_name; - if (!update_binlog_pos -#ifdef WITH_WSREP - && !update_wsrep -#endif - ) return true; buf_block_t* block = trx_sysf_get(mtr); #ifdef WITH_WSREP fixes the performance regression of 5db9c6e490ea2c6b7997dff4cbee18274496dd2c. With this optimization removed, we are always acquiring and releasing the TRX_SYS page latch. Replication and Galera are disabled, after all. This patch is not applicable to later versions, because MDEV-15158 later removed the write to the TRX_SYS page altogether. Maybe this page lock somehow served as useful throttling?

Marko Mäkelä added a comment - 2018-05-17 14:49

I applied a similar unnecessary acquisition of the TRX_SYS page lock to the latest 10.3, but it did reduce performance as expected:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 11d83c96a92..8ded9a03e3d 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -1120,6 +1120,7 @@ trx_write_serialisation_history(

 	mutex_exit(&rseg->mutex);

+	trx_sysf_get(mtr);

 	MONITOR_INC(MONITOR_TRX_COMMIT_UNDO);

 	trx->mysql_log_file_name = NULL;

I tried to remove a write that was added by ~~MDEV-15158~~, but it did not help performance:

diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc

index 32e050e9b91..3817ce8bcca 100644

--- a/storage/innobase/trx/trx0purge.cc

+++ b/storage/innobase/trx/trx0purge.cc

@@ -254,8 +254,10 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr)

 			rseg_header + TRX_RSEG_HISTORY_SIZE,

 			hist_size + undo->size, MLOG_4BYTES, mtr);

+#if 0

 		mlog_write_ull(rseg_header + TRX_RSEG_MAX_TRX_ID,

 			       trx_sys.get_max_trx_id(), mtr);

+#endif

 	/* Before any transaction-generating background threads or the

This write is performed while holding rseg->mutex, but there are 128 rollback segments, so I do not think it should matter with 40 concurrent users, if each transaction is assigned a different rollback segment.

Marko Mäkelä added a comment - 2018-05-17 14:49 I applied a similar unnecessary acquisition of the TRX_SYS page lock to the latest 10.3, but it did reduce performance as expected: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 11d83c96a92..8ded9a03e3d 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1120,6 +1120,7 @@ trx_write_serialisation_history( mutex_exit(&rseg->mutex); + trx_sysf_get(mtr); MONITOR_INC(MONITOR_TRX_COMMIT_UNDO); trx->mysql_log_file_name = NULL; I tried to remove a write that was added by MDEV-15158 , but it did not help performance: diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index 32e050e9b91..3817ce8bcca 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -254,8 +254,10 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr) rseg_header + TRX_RSEG_HISTORY_SIZE, hist_size + undo->size, MLOG_4BYTES, mtr); +#if 0 mlog_write_ull(rseg_header + TRX_RSEG_MAX_TRX_ID, trx_sys.get_max_trx_id(), mtr); +#endif } /* Before any transaction-generating background threads or the This write is performed while holding rseg->mutex , but there are 128 rollback segments, so I do not think it should matter with 40 concurrent users, if each transaction is assigned a different rollback segment.

Sergey Vojtovich added a comment - 2018-05-17 15:36

My 10.3 (b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 ) results don't go inline with what marko reported:

[ 10s ] thds: 40 tps: 246050.74 qps: 246050.94 (r/w/o: 0.00/246050.94/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00

[ 20s ] thds: 40 tps: 266876.11 qps: 266876.11 (r/w/o: 0.00/266876.11/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00

[ 30s ] thds: 40 tps: 265013.72 qps: 265013.52 (r/w/o: 0.00/265013.52/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

[ 40s ] thds: 40 tps: 267120.70 qps: 267120.70 (r/w/o: 0.00/267120.70/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00

[ 50s ] thds: 40 tps: 267445.25 qps: 267445.25 (r/w/o: 0.00/267445.25/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00

[ 60s ] thds: 40 tps: 265585.03 qps: 265585.03 (r/w/o: 0.00/265585.03/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00

[ 70s ] thds: 40 tps: 262178.67 qps: 262178.67 (r/w/o: 0.00/262178.67/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

[ 80s ] thds: 40 tps: 265030.96 qps: 265030.96 (r/w/o: 0.00/265030.96/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00

[ 90s ] thds: 40 tps: 263325.84 qps: 263325.94 (r/w/o: 0.00/263325.94/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00

[ 100s ] thds: 40 tps: 249319.73 qps: 249319.63 (r/w/o: 0.00/249319.63/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00

[ 110s ] thds: 40 tps: 263994.66 qps: 263994.66 (r/w/o: 0.00/263994.66/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00

[ 120s ] thds: 40 tps: 263167.54 qps: 263167.54 (r/w/o: 0.00/263167.54/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00

    transactions:                        31452722 (262087.56 per sec.)

    queries:                             31452722 (262087.56 per sec.)

However 10.2 results are perfectly inline:

[ 10s ] thds: 40 tps: 189485.74 qps: 189485.84 (r/w/o: 0.00/189485.84/0.00) lat (ms,95%): 0.34 err/s: 0.00 reconn/s: 0.00

[ 20s ] thds: 40 tps: 198558.47 qps: 198558.37 (r/w/o: 0.00/198558.37/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 30s ] thds: 40 tps: 195900.53 qps: 195900.53 (r/w/o: 0.00/195900.53/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 40s ] thds: 40 tps: 196844.47 qps: 196844.47 (r/w/o: 0.00/196844.47/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 50s ] thds: 40 tps: 198759.66 qps: 198759.66 (r/w/o: 0.00/198759.66/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00

[ 60s ] thds: 40 tps: 197423.50 qps: 197423.50 (r/w/o: 0.00/197423.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 70s ] thds: 40 tps: 196168.50 qps: 196168.50 (r/w/o: 0.00/196168.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 80s ] thds: 40 tps: 196777.38 qps: 196777.38 (r/w/o: 0.00/196777.38/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 90s ] thds: 40 tps: 198326.91 qps: 198326.91 (r/w/o: 0.00/198326.91/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 100s ] thds: 40 tps: 196600.91 qps: 196600.91 (r/w/o: 0.00/196600.91/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00

[ 110s ] thds: 40 tps: 198558.49 qps: 198558.49 (r/w/o: 0.00/198558.49/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

[ 120s ] thds: 40 tps: 197079.29 qps: 197079.29 (r/w/o: 0.00/197079.29/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00

    transactions:                        23605935 (196703.44 per sec.)

    queries:                             23605935 (196703.44 per sec.)

Sergey Vojtovich added a comment - 2018-05-17 15:36 My 10.3 (b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 ) results don't go inline with what marko reported: [ 10s ] thds: 40 tps: 246050.74 qps: 246050.94 (r/w/o: 0.00/246050.94/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 40 tps: 266876.11 qps: 266876.11 (r/w/o: 0.00/266876.11/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 40 tps: 265013.72 qps: 265013.52 (r/w/o: 0.00/265013.52/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 40 tps: 267120.70 qps: 267120.70 (r/w/o: 0.00/267120.70/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 40 tps: 267445.25 qps: 267445.25 (r/w/o: 0.00/267445.25/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 40 tps: 265585.03 qps: 265585.03 (r/w/o: 0.00/265585.03/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 40 tps: 262178.67 qps: 262178.67 (r/w/o: 0.00/262178.67/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 40 tps: 265030.96 qps: 265030.96 (r/w/o: 0.00/265030.96/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 40 tps: 263325.84 qps: 263325.94 (r/w/o: 0.00/263325.94/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 40 tps: 249319.73 qps: 249319.63 (r/w/o: 0.00/249319.63/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 40 tps: 263994.66 qps: 263994.66 (r/w/o: 0.00/263994.66/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 40 tps: 263167.54 qps: 263167.54 (r/w/o: 0.00/263167.54/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 transactions: 31452722 (262087.56 per sec.) queries: 31452722 (262087.56 per sec.) However 10.2 results are perfectly inline: [ 10s ] thds: 40 tps: 189485.74 qps: 189485.84 (r/w/o: 0.00/189485.84/0.00) lat (ms,95%): 0.34 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 40 tps: 198558.47 qps: 198558.37 (r/w/o: 0.00/198558.37/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 40 tps: 195900.53 qps: 195900.53 (r/w/o: 0.00/195900.53/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 40 tps: 196844.47 qps: 196844.47 (r/w/o: 0.00/196844.47/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 40 tps: 198759.66 qps: 198759.66 (r/w/o: 0.00/198759.66/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 40 tps: 197423.50 qps: 197423.50 (r/w/o: 0.00/197423.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 40 tps: 196168.50 qps: 196168.50 (r/w/o: 0.00/196168.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 40 tps: 196777.38 qps: 196777.38 (r/w/o: 0.00/196777.38/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 40 tps: 198326.91 qps: 198326.91 (r/w/o: 0.00/198326.91/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 40 tps: 196600.91 qps: 196600.91 (r/w/o: 0.00/196600.91/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 40 tps: 198558.49 qps: 198558.49 (r/w/o: 0.00/198558.49/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 40 tps: 197079.29 qps: 197079.29 (r/w/o: 0.00/197079.29/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 transactions: 23605935 (196703.44 per sec.) queries: 23605935 (196703.44 per sec.)

Vladislav Vaintroub added a comment - 2018-05-17 15:50

maybe you need to run just a little bit longer, like marko and I did

Vladislav Vaintroub added a comment - 2018-05-17 15:50 maybe you need to run just a little bit longer, like marko and I did

Vladislav Vaintroub added a comment - 2018-05-17 15:51

OTOH, marko ran for 2 minutes as well. I reported on 300 seconds

Vladislav Vaintroub added a comment - 2018-05-17 15:51 OTOH, marko ran for 2 minutes as well. I reported on 300 seconds

Marko Mäkelä added a comment - 2018-05-17 20:09

I used the following build configuration for both 10.2 and 10.3:

cmake -DCONC_WITH_{UNITTEST,SSL}=OFF -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_{TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT}=NO -DPLUGIN_PERFSCHEMA=ON -DWITH_SAFEMALLOC=OFF -DWITH_SSL=bundled ..

svoj used the following:

cmake -DBUILD_CONFIG=mysql_release -DPLUGIN_PERFSCHEMA=NO -DPLUGIN_CONNECT=NO -DPLUGIN_MROONGA=NO -DPLUGIN_TOKUDB=NO -DPLUGIN_SPIDER=NO -DPLUGIN_OQGRAPH=NO -DWITH_JEMALLOC=system ..

I retested commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with the latter config:

build	threads=1	threads=40
Marko	9593.49	98489.03
Svoj	9272.23	217758.55

I applied the patch to remove the optimization from trx_write_serialisation_history() (which counter-intuitively improved performance from about 100ktps to 178ktps using my config):

build	threads=1	threads=40
Marko	9489.61	178983.94
Svoj	9177.91	237928.43

So, removing the optimization does help in both cases, but much more with my build configuration. This change is not relevant for solving the regression, because the code has been seriously rewritten since then, but it shows that performance is hard. Perhaps the explanation could be that the TRX_SYS page lock serves as some kind of useful throttle that reduces contention on the rseg->mutex? It would seem best to acquire and release rseg->mutex only once during commit, and to try to ensure that the transactions running in different threads do not get assigned the same rseg (of course, this is only possible for up to 128 concurrent threads).

Marko Mäkelä added a comment - 2018-05-17 20:09 I used the following build configuration for both 10.2 and 10.3: cmake -DCONC_WITH_{UNITTEST,SSL}=OFF -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_{TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT}=NO -DPLUGIN_PERFSCHEMA=ON -DWITH_SAFEMALLOC=OFF -DWITH_SSL=bundled .. svoj used the following: cmake -DBUILD_CONFIG=mysql_release -DPLUGIN_PERFSCHEMA=NO -DPLUGIN_CONNECT=NO -DPLUGIN_MROONGA=NO -DPLUGIN_TOKUDB=NO -DPLUGIN_SPIDER=NO -DPLUGIN_OQGRAPH=NO -DWITH_JEMALLOC=system .. I retested commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with the latter config: build threads=1 threads=40 Marko 9593.49 98489.03 Svoj 9272.23 217758.55 I applied the patch to remove the optimization from trx_write_serialisation_history() (which counter-intuitively improved performance from about 100ktps to 178ktps using my config): build threads=1 threads=40 Marko 9489.61 178983.94 Svoj 9177.91 237928.43 So, removing the optimization does help in both cases, but much more with my build configuration. This change is not relevant for solving the regression, because the code has been seriously rewritten since then, but it shows that performance is hard. Perhaps the explanation could be that the TRX_SYS page lock serves as some kind of useful throttle that reduces contention on the rseg->mutex ? It would seem best to acquire and release rseg->mutex only once during commit, and to try to ensure that the transactions running in different threads do not get assigned the same rseg (of course, this is only possible for up to 128 concurrent threads).

Marko Mäkelä added a comment - 2018-05-17 20:47

I tried modifying svoj’s config by removing -DWITH_JEMALLOC=system. Performance remained roughly the same:

Svoj’s build without jemalloc	threads=1	threads=40
5db9c6e490ea2c6b7997dff4cbee18274496dd2c	9704.28	204079.81
5db9c6e490ea2c6b7997dff4cbee18274496dd2c with patch	9355.49	237138.82

So, jemalloc does help multi-client performance a little compared to GNU libc 2.27, but not that much, and the relative performance difference remains.

A major difference between our build configurations could be that cmake/build_configurations/mysql_release.cmake uses -O3 while -DCMAKE_BUILD_TYPE=RelWithDebInfo only uses -O2. After replacing -O2 with -O3 in my configuration, I got the following:

Marko’s build with -O3	threads=1	threads=40
5db9c6e490ea2c6b7997dff4cbee18274496dd2c	10477.51	199604.51

This suggests that there could be a clear regression from 10.2 to 10.3 when compiling with -O2. I hope that most distributed packages use -O3.

Marko Mäkelä added a comment - 2018-05-17 20:47 I tried modifying svoj ’s config by removing -DWITH_JEMALLOC=system . Performance remained roughly the same: Svoj’s build without jemalloc threads=1 threads=40 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 9704.28 204079.81 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with patch 9355.49 237138.82 So, jemalloc does help multi-client performance a little compared to GNU libc 2.27, but not that much, and the relative performance difference remains. A major difference between our build configurations could be that cmake/build_configurations/mysql_release.cmake uses -O3 while -DCMAKE_BUILD_TYPE=RelWithDebInfo only uses -O2 . After replacing -O2 with -O3 in my configuration, I got the following: Marko’s build with -O3 threads=1 threads=40 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 10477.51 199604.51 This suggests that there could be a clear regression from 10.2 to 10.3 when compiling with -O2 . I hope that most distributed packages use -O3 .

Vladislav Vaintroub added a comment - 2018-05-17 21:04 - edited

distributed packages are compiled with -DBUILD_CONFIGURATION=mysql_release

Still, I suggest to look more than just threads=1, and thread=40. 40 threads might be a sweet spot on svoj's machine. I tried 1,..,4096, that's more comprehensive. I suggest to vary durability settings as well (innodb_flush_log_at_trx_commit = 1). innodb_flush_method=O_DIRECT_NO_FSYNC is also a trick that pretends your box has battery backed disk. And it does not probably. dataset size, AHI vs no AHI, threadpool vs no threadpool., etc.

Seeing a full picture after running 2 tests. 2 minutes each , is not easy. I'd really start full comparison for different levels on concurrency, and then vary other parameters, if that does not show regression

Vladislav Vaintroub added a comment - 2018-05-17 21:04 - edited distributed packages are compiled with -DBUILD_CONFIGURATION=mysql_release Still, I suggest to look more than just threads=1, and thread=40. 40 threads might be a sweet spot on svoj's machine. I tried 1,..,4096, that's more comprehensive. I suggest to vary durability settings as well (innodb_flush_log_at_trx_commit = 1). innodb_flush_method=O_DIRECT_NO_FSYNC is also a trick that pretends your box has battery backed disk. And it does not probably. dataset size, AHI vs no AHI, threadpool vs no threadpool., etc. Seeing a full picture after running 2 tests. 2 minutes each , is not easy. I'd really start full comparison for different levels on concurrency, and then vary other parameters, if that does not show regression

Marko Mäkelä added a comment - 2018-05-18 12:25

I had a systematic error, caused by not rebuilding from the scratch. It looks like -DBUILD_CONFIG=mysql_release does use -O2 after all.

My observations about 5db9c6e490ea2c6b7997dff4cbee18274496dd2c and its predecessor are repeatable. And I still see a significant performance regression with 10.3.5 or the latest 10.3 Here are some rough results from 3-minute runs:

revision	threads=40
5db9c6e490ea2c6b7997dff4cbee18274496dd2c	230ktps
af566d8a63f908c76d43957e790e861d944486b0	228346.51tps
53cc9aa5be63b4dd03aba804aebaa3bd470dab9c	186491.66tps
c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5	210410.54tps
bc7a1dc1fbd27e6064d3b40443fe242397668af7	225957.50tps
ba125eca551e3f5faee9f6d7ebed290f04cfc8fd	143308.74tps
b782971c58b5656820429b8ef3fae5fd82f5a0f7	137140.79tps

There is some fluctuation in these numbers, so 10% variance could be ignored.

Surprisingly, 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c (~~MDEV-15104~~ - Remove trx_sys_t::rw_trx_ids) reduces performance compared to its parent af566d8a63f908c76d43957e790e861d944486b0. The next InnoDB commit c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 (~~MDEV-15104~~ - Remove trx_sys_t::serialisation_list) is better again. This is what I see with GCC 7.3.0; @svoj sees a steady improvement for these commits with GCC 5.4.0.

Another interesting pair is bc7a1dc1fbd27e6064d3b40443fe242397668af7 and b782971c58b5656820429b8ef3fae5fd82f5a0f7. The latter commit fixed a correctness bug that was introduced by the former. There are quite a few changes in between, and there appears to be some regression already in the parent of the fix (ba125eca551e3f5faee9f6d7ebed290f04cfc8fd).

Next I will try to find the regression between bc7a1dc1fbd27e6064d3b40443fe242397668af7 and ba125eca551e3f5faee9f6d7ebed290f04cfc8fd.

Marko Mäkelä added a comment - 2018-05-18 12:25 I had a systematic error, caused by not rebuilding from the scratch. It looks like -DBUILD_CONFIG=mysql_release does use -O2 after all. My observations about 5db9c6e490ea2c6b7997dff4cbee18274496dd2c and its predecessor are repeatable. And I still see a significant performance regression with 10.3.5 or the latest 10.3 Here are some rough results from 3-minute runs: revision threads=40 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 230ktps af566d8a63f908c76d43957e790e861d944486b0 228346.51tps 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c 186491.66tps c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 210410.54tps bc7a1dc1fbd27e6064d3b40443fe242397668af7 225957.50tps ba125eca551e3f5faee9f6d7ebed290f04cfc8fd 143308.74tps b782971c58b5656820429b8ef3fae5fd82f5a0f7 137140.79tps There is some fluctuation in these numbers, so 10% variance could be ignored. Surprisingly, 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c ( MDEV-15104 - Remove trx_sys_t::rw_trx_ids) reduces performance compared to its parent af566d8a63f908c76d43957e790e861d944486b0. The next InnoDB commit c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 ( MDEV-15104 - Remove trx_sys_t::serialisation_list) is better again. This is what I see with GCC 7.3.0; @svoj sees a steady improvement for these commits with GCC 5.4.0. Another interesting pair is bc7a1dc1fbd27e6064d3b40443fe242397668af7 and b782971c58b5656820429b8ef3fae5fd82f5a0f7. The latter commit fixed a correctness bug that was introduced by the former. There are quite a few changes in between, and there appears to be some regression already in the parent of the fix (ba125eca551e3f5faee9f6d7ebed290f04cfc8fd). Next I will try to find the regression between bc7a1dc1fbd27e6064d3b40443fe242397668af7 and ba125eca551e3f5faee9f6d7ebed290f04cfc8fd.

Marko Mäkelä added a comment - 2018-05-18 13:54 - edited

revision	threads=40
bc7a1dc1fbd27e6064d3b40443fe242397668af7	225957.50tps
d6ed077fc82c2f4d20895db9a6b16ca295f23c33	212136.14tps
217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299	197112.85 tps
217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 without `st_mem_root::name`	214732.32tps
eda142590f385b3d79e7b6a6e0cc13ba8fd25961	133523.27tps
53476abce829f14b9d591a9bf160bc07cffaa768	127573.76tps

Commit 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 is a merge from bb-10.2-ext, most notably adding st_mem_root::name. That is clearly costing some performance compared to its parent d6ed077fc82c2f4d20895db9a6b16ca295f23c33. I tried reverting the patch:

git show d69642deddf4f8d51f8c6807361576451334f1f9|patch -Rp1

# note: in 10.3, sql/rpl_handler.h was renamed to sql/rpl_rli.h

# note: adjust a few init_alloc_root() calls that did not exist in bb-10.2-ext

It seems that we must define st_mem_root::name in debug builds only.

Still, the additional regression between 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 and eda142590f385b3d79e7b6a6e0cc13ba8fd25961 has to be narrowed down. It is outside InnoDB; the InnoDB differences in that range are trivial integer type mismatch warning fixes and fixes of source code comments. The only other commits in this range are CTE fixes (~~MDEV-15162~~ and ~~MDEV-15119~~).

Marko Mäkelä added a comment - 2018-05-18 13:54 - edited revision threads=40 bc7a1dc1fbd27e6064d3b40443fe242397668af7 225957.50tps d6ed077fc82c2f4d20895db9a6b16ca295f23c33 212136.14tps 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 197112.85 tps 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 without st_mem_root::name 214732.32tps eda142590f385b3d79e7b6a6e0cc13ba8fd25961 133523.27tps 53476abce829f14b9d591a9bf160bc07cffaa768 127573.76tps Commit 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 is a merge from bb-10.2-ext , most notably adding st_mem_root::name . That is clearly costing some performance compared to its parent d6ed077fc82c2f4d20895db9a6b16ca295f23c33. I tried reverting the patch: git show d69642deddf4f8d51f8c6807361576451334f1f9|patch -Rp1 # note: in 10.3, sql/rpl_handler.h was renamed to sql/rpl_rli.h # note: adjust a few init_alloc_root() calls that did not exist in bb-10.2-ext It seems that we must define st_mem_root::name in debug builds only. Still, the additional regression between 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 and eda142590f385b3d79e7b6a6e0cc13ba8fd25961 has to be narrowed down. It is outside InnoDB; the InnoDB differences in that range are trivial integer type mismatch warning fixes and fixes of source code comments. The only other commits in this range are CTE fixes ( MDEV-15162 and MDEV-15119 ).

Marko Mäkelä added a comment - 2018-05-18 14:39

revision	threads=40
217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299	197112.85tps
465979eabfe3acacc0566fa93402e449e1a9f476	229538.83tps
f271100836d8a91a775894ec36b869a66a3145e5	205113.14tps
6c279ad6a71c63cb595fde7c951aadb31c3dbebc	123927.89tps

It is surprising that the performance improved in the middle, even though the st_mem_root::name is still present. The CTE fixes seem innocent to the performance.

It is also surprising that the ~~MDEV-15091~~ size_t fix (commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc) is hurting so much. That seems to be the main source of regression (for my RAM-based toy benchmark).

I would suggest introducing a compile-time option for enabling st_mem_root::name (disabled by default), and revisiting ~~MDEV-15091~~.

Marko Mäkelä added a comment - 2018-05-18 14:39 revision threads=40 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 197112.85tps 465979eabfe3acacc0566fa93402e449e1a9f476 229538.83tps f271100836d8a91a775894ec36b869a66a3145e5 205113.14tps 6c279ad6a71c63cb595fde7c951aadb31c3dbebc 123927.89tps It is surprising that the performance improved in the middle, even though the st_mem_root::name is still present. The CTE fixes seem innocent to the performance. It is also surprising that the MDEV-15091 size_t fix (commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc) is hurting so much. That seems to be the main source of regression (for my RAM-based toy benchmark). I would suggest introducing a compile-time option for enabling st_mem_root::name (disabled by default), and revisiting MDEV-15091 .

Sergey Vojtovich added a comment - 2018-05-18 17:00 - edited

Here's what I get on 6c279ad6a71c63cb595fde7c951aadb31c3dbebc:

[ 10s ] thds: 40 tps: 245242.59 qps: 245242.69 (r/w/o: 0.00/245242.69/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00

[ 20s ] thds: 40 tps: 254569.26 qps: 254569.16 (r/w/o: 0.00/254569.16/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

[ 30s ] thds: 40 tps: 254972.73 qps: 254972.73 (r/w/o: 0.00/254972.73/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00

[ 40s ] thds: 40 tps: 252851.43 qps: 252851.53 (r/w/o: 0.00/252851.53/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

[ 50s ] thds: 40 tps: 251944.58 qps: 251944.48 (r/w/o: 0.00/251944.48/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

[ 60s ] thds: 40 tps: 253112.17 qps: 253112.17 (r/w/o: 0.00/253112.17/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

[ 70s ] thds: 40 tps: 250768.06 qps: 250768.06 (r/w/o: 0.00/250768.06/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00

[ 80s ] thds: 40 tps: 246130.98 qps: 246130.98 (r/w/o: 0.00/246130.98/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00

[ 90s ] thds: 40 tps: 243231.00 qps: 243231.00 (r/w/o: 0.00/243231.00/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00

[ 100s ] thds: 40 tps: 245256.26 qps: 245256.26 (r/w/o: 0.00/245256.26/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00

[ 110s ] thds: 40 tps: 246891.01 qps: 246891.01 (r/w/o: 0.00/246891.01/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00

[ 120s ] thds: 40 tps: 246429.67 qps: 246429.87 (r/w/o: 0.00/246429.87/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00

SQL statistics:

    queries performed:

        read:                            0

        write:                           29916061

        other:                           0

        total:                           29916061

    transactions:                        29916061 (249278.75 per sec.)

    queries:                             29916061 (249278.75 per sec.)

Sometimes I experience similar turbulence in benchmarks, which I couldn't explain so far. It is usually gone by itself and I can't reliably analyse it.
I guess if marko would re-run benchmark against 6c279ad6a71c63cb595fde7c951aadb31c3dbebc e.g. tomorrow he would see totally different picture.

Sergey Vojtovich added a comment - 2018-05-18 17:00 - edited Here's what I get on 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: [ 10s ] thds: 40 tps: 245242.59 qps: 245242.69 (r/w/o: 0.00/245242.69/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 40 tps: 254569.26 qps: 254569.16 (r/w/o: 0.00/254569.16/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 40 tps: 254972.73 qps: 254972.73 (r/w/o: 0.00/254972.73/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 40 tps: 252851.43 qps: 252851.53 (r/w/o: 0.00/252851.53/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 40 tps: 251944.58 qps: 251944.48 (r/w/o: 0.00/251944.48/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 40 tps: 253112.17 qps: 253112.17 (r/w/o: 0.00/253112.17/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 40 tps: 250768.06 qps: 250768.06 (r/w/o: 0.00/250768.06/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 40 tps: 246130.98 qps: 246130.98 (r/w/o: 0.00/246130.98/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 40 tps: 243231.00 qps: 243231.00 (r/w/o: 0.00/243231.00/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 40 tps: 245256.26 qps: 245256.26 (r/w/o: 0.00/245256.26/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 40 tps: 246891.01 qps: 246891.01 (r/w/o: 0.00/246891.01/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 40 tps: 246429.67 qps: 246429.87 (r/w/o: 0.00/246429.87/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 29916061 other: 0 total: 29916061 transactions: 29916061 (249278.75 per sec.) queries: 29916061 (249278.75 per sec.) Sometimes I experience similar turbulence in benchmarks, which I couldn't explain so far. It is usually gone by itself and I can't reliably analyse it. I guess if marko would re-run benchmark against 6c279ad6a71c63cb595fde7c951aadb31c3dbebc e.g. tomorrow he would see totally different picture.

Vladislav Vaintroub added a comment - 2018-05-18 17:04 - edited

I get a feeling like original report is completely ignored, and everyone is running ramdisk benchmarks now, testing CPU caches, and compiler optimizations, on 40 threads, on 40 core machines, for 2 minutes

Vladislav Vaintroub added a comment - 2018-05-18 17:04 - edited I get a feeling like original report is completely ignored, and everyone is running ramdisk benchmarks now, testing CPU caches, and compiler optimizations, on 40 threads, on 40 core machines, for 2 minutes

Sergey Vojtovich added a comment - 2018-05-18 17:11

You're right, we need to concentrate on original report. Which I couldn't reproduce locally, most probably due to some known issues with Samsung SSD being slow on heavy fsync.

I will try to look into ramdisk performance more closely, with longer run and more concurrent connections (like 1000).

Sergey Vojtovich added a comment - 2018-05-18 17:11 You're right, we need to concentrate on original report. Which I couldn't reproduce locally, most probably due to some known issues with Samsung SSD being slow on heavy fsync. I will try to look into ramdisk performance more closely, with longer run and more concurrent connections (like 1000).

Marko Mäkelä added a comment - 2018-05-21 07:08

I agree that we must address the I/O issue, but at the same time it could be worth looking for CPU regressions, that is, approach repeatable regressions from multiple angles. For what it is worth, I repeated some 180-second ramdisk benchmarks today:

revision	threads=40
6c279ad6a71c63cb595fde7c951aadb31c3dbebc	130213.08tps
6c279ad6a71c63cb595fde7c951aadb31c3dbebc (patch 1)	134518.90tps
7ad9b3588f53672b3d2d9f345ec9ab7335b50206	104287.79tps
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 2)	112871.50tps
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1m+2)	109021.72tps
7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1+2)	114488.47tps

The baseline was 123927.89tps before the weekend, and now it is 130213.08tps, so there indeed is some turbulence in the figures even after I switched to always do a full rebuild.

Patch 1 reverts some struct or class member variables from size_t to a narrower type, such as uint or uint32. A minimal version of the patch (1m) is below:

diff --git a/sql/sql_cache.h b/sql/sql_cache.h

index 0ed45a9ed81..ca2c3a59ee7 100644

--- a/sql/sql_cache.h

+++ b/sql/sql_cache.h

@@ -350,7 +350,7 @@ class Query_cache

   size_t min_allocation_unit, min_result_data_size;

   uint def_query_hash_size, def_table_hash_size;

-  size_t mem_bin_num, mem_bin_steps;		// See at init_cache & find_bin

+  uint mem_bin_num, mem_bin_steps;		// See at init_cache & find_bin

   bool initialized;

In my test resuls above, this minimal patch appeared to cause a slight performance regression, but I think that we can explain it by saying that it falls within the 5% or 10% of fluctuation.

Patch 2 removes st_mem_root::name and the name parameter of init_alloc_root():

diff --git a/include/my_alloc.h b/include/my_alloc.h

index 3ba8a02fc48..c9879125d96 100644

--- a/include/my_alloc.h

+++ b/include/my_alloc.h

@@ -52,7 +52,9 @@ typedef struct st_mem_root

   unsigned int first_block_usage;

   void (*error_handler)(void);

+#ifndef DBUG_OFF

   const char *name;

+#endif

 } MEM_ROOT;

 #ifdef  __cplusplus

diff --git a/include/my_sys.h b/include/my_sys.h

index 13ab7b12320..fb18cd92e8a 100644

--- a/include/my_sys.h

+++ b/include/my_sys.h

@@ -888,9 +888,16 @@ extern void my_free_lock(void *ptr);

 #define alloc_root_inited(A) ((A)->min_malloc != 0)

 #define ALLOC_ROOT_MIN_BLOCK_SIZE (MALLOC_OVERHEAD + sizeof(USED_MEM) + 8)

 #define clear_alloc_root(A) do { (A)->free= (A)->used= (A)->pre_alloc= 0; (A)->min_malloc=0;} while(0)

+#ifndef DBUG_OFF

 extern void init_alloc_root(MEM_ROOT *mem_root, const char *name,

                             size_t block_size, size_t pre_alloc_size,

                             myf my_flags);

+#else

+extern void init_alloc_Root(MEM_ROOT *mem_root,

+                            size_t block_size, size_t pre_alloc_size,

+                            myf my_flags);

+# define init_alloc_root(m,n,b,p,f) init_alloc_Root(m,b,p,f)

+#endif

 extern void *alloc_root(MEM_ROOT *mem_root, size_t Size);

 extern void *multi_alloc_root(MEM_ROOT *mem_root, ...);

 extern void free_root(MEM_ROOT *root, myf MyFLAGS);

diff --git a/mysys/my_alloc.c b/mysys/my_alloc.c

index 7139466be17..fddc9118312 100644

--- a/mysys/my_alloc.c

+++ b/mysys/my_alloc.c

@@ -54,9 +54,15 @@

     Because of this, we store in MY_THREAD_SPECIFIC as bit 1 in block_size

*/

+#ifndef DBUG_OFF

 void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size,

 		     size_t pre_alloc_size __attribute__((unused)),

                      myf my_flags)

+#else

+void init_alloc_Root(MEM_ROOT *mem_root, size_t block_size,

+		     size_t pre_alloc_size __attribute__((unused)),

+                     myf my_flags)

+#endif

   DBUG_ENTER("init_alloc_root");

   DBUG_PRINT("enter",("root: %p  name: %s  prealloc: %zu", mem_root,

@@ -72,7 +78,9 @@ void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size,

   mem_root->block_num= 4;			/* We shift this with >>2 */

   mem_root->first_block_usage= 0;

   mem_root->total_alloc= 0;

+#ifndef DBUG_OFF

   mem_root->name= name;

+#endif

 #if !(defined(HAVE_valgrind) && defined(EXTRA_DEBUG))

   if (pre_alloc_size)

I would suggest to include this patch in 10.3 in some form.

With GCC 7.3.0 and the 3-minute CPU-bound ramdisk benchmark, we still appear to have a significant regression somewhere between 10.3.4 and 10.3.5. I will keep trying to trace it down while Svoj can concentrate on the originally reported regression.

Marko Mäkelä added a comment - 2018-05-21 07:08 I agree that we must address the I/O issue, but at the same time it could be worth looking for CPU regressions, that is, approach repeatable regressions from multiple angles. For what it is worth, I repeated some 180-second ramdisk benchmarks today: revision threads=40 6c279ad6a71c63cb595fde7c951aadb31c3dbebc 130213.08tps 6c279ad6a71c63cb595fde7c951aadb31c3dbebc (patch 1) 134518.90tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 104287.79tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 2) 112871.50tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1m+2) 109021.72tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1+2) 114488.47tps The baseline was 123927.89tps before the weekend, and now it is 130213.08tps, so there indeed is some turbulence in the figures even after I switched to always do a full rebuild. Patch 1 reverts some struct or class member variables from size_t to a narrower type, such as uint or uint32 . A minimal version of the patch (1m) is below: diff --git a/sql/sql_cache.h b/sql/sql_cache.h index 0ed45a9ed81..ca2c3a59ee7 100644 --- a/sql/sql_cache.h +++ b/sql/sql_cache.h @@ -350,7 +350,7 @@ class Query_cache size_t min_allocation_unit, min_result_data_size; uint def_query_hash_size, def_table_hash_size; - size_t mem_bin_num, mem_bin_steps; // See at init_cache & find_bin + uint mem_bin_num, mem_bin_steps; // See at init_cache & find_bin bool initialized; In my test resuls above, this minimal patch appeared to cause a slight performance regression, but I think that we can explain it by saying that it falls within the 5% or 10% of fluctuation. Patch 2 removes st_mem_root::name and the name parameter of init_alloc_root() : diff --git a/include/my_alloc.h b/include/my_alloc.h index 3ba8a02fc48..c9879125d96 100644 --- a/include/my_alloc.h +++ b/include/my_alloc.h @@ -52,7 +52,9 @@ typedef struct st_mem_root unsigned int first_block_usage; void (*error_handler)(void); +#ifndef DBUG_OFF const char *name; +#endif } MEM_ROOT; #ifdef __cplusplus diff --git a/include/my_sys.h b/include/my_sys.h index 13ab7b12320..fb18cd92e8a 100644 --- a/include/my_sys.h +++ b/include/my_sys.h @@ -888,9 +888,16 @@ extern void my_free_lock(void *ptr); #define alloc_root_inited(A) ((A)->min_malloc != 0) #define ALLOC_ROOT_MIN_BLOCK_SIZE (MALLOC_OVERHEAD + sizeof(USED_MEM) + 8) #define clear_alloc_root(A) do { (A)->free= (A)->used= (A)->pre_alloc= 0; (A)->min_malloc=0;} while(0) +#ifndef DBUG_OFF extern void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size, size_t pre_alloc_size, myf my_flags); +#else +extern void init_alloc_Root(MEM_ROOT *mem_root, + size_t block_size, size_t pre_alloc_size, + myf my_flags); +# define init_alloc_root(m,n,b,p,f) init_alloc_Root(m,b,p,f) +#endif extern void *alloc_root(MEM_ROOT *mem_root, size_t Size); extern void *multi_alloc_root(MEM_ROOT *mem_root, ...); extern void free_root(MEM_ROOT *root, myf MyFLAGS); diff --git a/mysys/my_alloc.c b/mysys/my_alloc.c index 7139466be17..fddc9118312 100644 --- a/mysys/my_alloc.c +++ b/mysys/my_alloc.c @@ -54,9 +54,15 @@ Because of this, we store in MY_THREAD_SPECIFIC as bit 1 in block_size */ +#ifndef DBUG_OFF void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size, size_t pre_alloc_size __attribute__((unused)), myf my_flags) +#else +void init_alloc_Root(MEM_ROOT *mem_root, size_t block_size, + size_t pre_alloc_size __attribute__((unused)), + myf my_flags) +#endif { DBUG_ENTER("init_alloc_root"); DBUG_PRINT("enter",("root: %p name: %s prealloc: %zu", mem_root, @@ -72,7 +78,9 @@ void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size, mem_root->block_num= 4; /* We shift this with >>2 */ mem_root->first_block_usage= 0; mem_root->total_alloc= 0; +#ifndef DBUG_OFF mem_root->name= name; +#endif #if !(defined(HAVE_valgrind) && defined(EXTRA_DEBUG)) if (pre_alloc_size) I would suggest to include this patch in 10.3 in some form. With GCC 7.3.0 and the 3-minute CPU-bound ramdisk benchmark, we still appear to have a significant regression somewhere between 10.3.4 and 10.3.5. I will keep trying to trace it down while Svoj can concentrate on the originally reported regression.

Marko Mäkelä added a comment - 2018-05-21 13:10

I can still repeat the huge drop caused by commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: dropping from 207119.69tps to 130029.74tps. I narrowed it down to the sql_class.h and sql_class.cc changes in that commit, but when I reverted that part of the patch in the latest 10.3, I did not see any improvement.

In my benchmark, I would see contention in lock_sys.mutex and log_sys.mutex. The situation should be improved by MDEV-16232 and ~~MDEV-14425~~.

Also, at low concurrency, it seems that purge is wasting resources and creating unnecessary contention. Perhaps the amount of purge threads should be automatically scaled based on the amount of purgeable records?

Marko Mäkelä added a comment - 2018-05-21 13:10 I can still repeat the huge drop caused by commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: dropping from 207119.69tps to 130029.74tps. I narrowed it down to the sql_class.h and sql_class.cc changes in that commit, but when I reverted that part of the patch in the latest 10.3, I did not see any improvement. In my benchmark, I would see contention in lock_sys.mutex and log_sys.mutex . The situation should be improved by MDEV-16232 and MDEV-14425 . Also, at low concurrency, it seems that purge is wasting resources and creating unnecessary contention. Perhaps the amount of purge threads should be automatically scaled based on the amount of purgeable records?

Marko Mäkelä added a comment - 2018-08-24 12:24

A bad merge of ~~MDEV-14705~~ to MariaDB 10.3.6 causes frequent calls to service_manager_extend_timeout() from all purge threads. These calls were supposed to only occur during a slow shutdown. This will be fixed in ~~MDEV-17003~~.

Given that the performance regression reported in this ticket was reported on Windows, where the systemd interface is not present, this probably does not explain much of the regression. Also, I probably did not have libsystemd-dev installed when compiling code for running benchmarks.

Marko Mäkelä added a comment - 2018-08-24 12:24 A bad merge of MDEV-14705 to MariaDB 10.3.6 causes frequent calls to service_manager_extend_timeout() from all purge threads. These calls were supposed to only occur during a slow shutdown. This will be fixed in MDEV-17003 . Given that the performance regression reported in this ticket was reported on Windows, where the systemd interface is not present, this probably does not explain much of the regression. Also, I probably did not have libsystemd-dev installed when compiling code for running benchmarks.

Vladislav Vaintroub added a comment - 2020-11-17 13:34

By now, we are all educated that ~~MDEV-12288~~ caused that issue. It just came not really announced back then

Vladislav Vaintroub added a comment - 2020-11-17 13:34 By now, we are all educated that MDEV-12288 caused that issue. It just came not really announced back then

People

Assignee:: Marko Mäkelä

Reporter:: Vladislav Vaintroub

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2018-05-14 21:57

Updated:: 2020-11-17 13:34

Resolved:: 2020-11-17 13:34

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration