[MDEV-16168] Performance regression on sysbench write benchmarks from 10.2 to 10.3 - Jira

Vladislav Vaintroub created issue - 2018-05-14 21:57

Vladislav Vaintroub made changes - 2018-05-14 22:04

Field	Original Value	New Value
Description		I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3 Here is the setup * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk) * MariaDB 10.2.14 vs 10.3.7 (recent build) the my.ini file is as follows {noformat} [mysqld] back_log=500 max_connections=4096 max_connect_errors=5000 max_prepared_stmt_count=500000 table_open_cache=10000 transaction_isolation=REPEATABLE-READ innodb_status_file=0 innodb_buffer_pool_size=20G innodb_log_file_size=15G innodb_log_buffer_size=1000M innodb-io-capacity=4000 innodb-io-capacity-max=5000 enable-named-pipe character_set_server=latin1 collation_server=latin1_swedish_ci skip-character-set-client-handshake innodb_undo_log_truncate=off innodb_doublewrite=0 #innodb-max-purge-lag=1000000 #innodb_log_files_in_group=10 #innodb_max_purge_lag_delay=300000 #innodb_max_purge_lag=0 #innodb-write-io-threads=32 #innodb-page-size=4K [mariadb] innodb-page-cleaners=1 innodb-buffer-pool-instances=1 innodb_adaptive_flushing=0 #innodb-undo-tablespaces=2 #innodb_buffer_pool_dump_pct=100 #innodb_max_dirty_pages_pct=99 #innodb_max_dirty_pages_pct_lwm=80 innodb_adaptive_flushing_lwm=10 [client] protocol=pipe * sysbench 0.4

Vladislav Vaintroub made changes - 2018-05-14 22:19

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_connect_errors=5000
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
enable-named-pipe
character_set_server=latin1
collation_server=latin1_swedish_ci
skip-character-set-client-handshake
innodb_undo_log_truncate=off
innodb_doublewrite=0
#innodb-max-purge-lag=1000000
#innodb_log_files_in_group=10
#innodb_max_purge_lag_delay=300000
#innodb_max_purge_lag=0
#innodb-write-io-threads=32
#innodb-page-size=4K
[mariadb]
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
#innodb-undo-tablespaces=2
#innodb_buffer_pool_dump_pct=100
#innodb_max_dirty_pages_pct=99
#innodb_max_dirty_pages_pct_lwm=80
innodb_adaptive_flushing_lwm=10

[client]
protocol=pipe
* sysbench 0.4

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users||10.2.14||10.3.7
|1|
|2|
|

Vladislav Vaintroub made changes - 2018-05-14 22:32

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users||10.2.14||10.3.7
|1|
|2|
|

|I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

Vladislav Vaintroub made changes - 2018-05-14 22:52

Description

|I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling and

Vladislav Vaintroub made changes - 2018-05-14 23:03

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling and

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions
Comparison Column Delta Baseline Value Comparison Value
ut_delay 4.02 16.05 20.07
l_find 2.44 0.16 2.60
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter 2.42 0.18 2.60
TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
SleepConditionVariableCS -5.41 28.70 23.29

Vladislav Vaintroub made changes - 2018-05-14 23:05

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions
Comparison Column Delta Baseline Value Comparison Value
ut_delay 4.02 16.05 20.07
l_find 2.44 0.16 2.60
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter 2.42 0.18 2.60
TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
SleepConditionVariableCS -5.41 28.70 23.29

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| 4.02| 16.05| 20.07|
|l_find| 2.44| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| 2.42|0.18| 2.60|
TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
SleepConditionVariableCS -5.41 28.70 23.29

Vladislav Vaintroub made changes - 2018-05-14 23:22

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| 4.02| 16.05| 20.07|
|l_find| 2.44| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| 2.42|0.18| 2.60|
TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
SleepConditionVariableCS -5.41 28.70 23.29

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives this

||Comparison Column ||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
<sync_array_wait_event|-1.29|9.69|8.40
trx_write_serialisation_history|-1.34|2.62|1.28
trx_commit_low|-1.39|3.15|1.76
innobase_commit|-1.75|47.36|45.61
trx_commit|-1.75|4.79|3.04
ha_commit_one_phase|-1.76|47.55|45.78
commit_one_phase_2|-1.76|47.54|45.77
trans_commit_stmt|-1.80|48.68|46.88
ha_commit_trans|-1.80|48.65|46.84
btr_cur_search_to_nth_level|-1.85|1.85|0.00
innobase_commit_ordered_2|-2.04|5.17|3.13
trx_commit_for_mysql|-2.05|5.10|3.05
innobase_commit_low|-2.05|5.11|3.06
Prepared_statement::execute_loop|-2.05|74.58|72.52
mysql_stmt_execute_common|-2.14|74.85|72.71
mysqld_stmt_execute|-2.15|74.89|72.74
mysql_execute_command|-2.21|73.03|70.81
trx_undo_assign_undo|-2.24|2.24|0.00
Prepared_statement::execute|-2.24|74.03|71.79
threadpool_process_request|-2.32|81.71|79.38
dispatch_command|-2.33|79.21|76.88
do_command|-2.34|81.33|79.00
tp_callback|-2.38|82.66|80.28
srv_resume_thread|-3.33|7.94|4.61
os_event::wait_low|-4.98|28.41|23.43
SleepConditionVariableCS|-5.41|28.70|23.29
TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:27

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives this

||Comparison Column ||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
<sync_array_wait_event|-1.29|9.69|8.40
trx_write_serialisation_history|-1.34|2.62|1.28
trx_commit_low|-1.39|3.15|1.76
innobase_commit|-1.75|47.36|45.61
trx_commit|-1.75|4.79|3.04
ha_commit_one_phase|-1.76|47.55|45.78
commit_one_phase_2|-1.76|47.54|45.77
trans_commit_stmt|-1.80|48.68|46.88
ha_commit_trans|-1.80|48.65|46.84
btr_cur_search_to_nth_level|-1.85|1.85|0.00
innobase_commit_ordered_2|-2.04|5.17|3.13
trx_commit_for_mysql|-2.05|5.10|3.05
innobase_commit_low|-2.05|5.11|3.06
Prepared_statement::execute_loop|-2.05|74.58|72.52
mysql_stmt_execute_common|-2.14|74.85|72.71
mysqld_stmt_execute|-2.15|74.89|72.74
mysql_execute_command|-2.21|73.03|70.81
trx_undo_assign_undo|-2.24|2.24|0.00
Prepared_statement::execute|-2.24|74.03|71.79
threadpool_process_request|-2.32|81.71|79.38
dispatch_command|-2.33|79.21|76.88
do_command|-2.34|81.33|79.00
tp_callback|-2.38|82.66|80.28
srv_resume_thread|-3.33|7.94|4.61
os_event::wait_low|-4.98|28.41|23.43
SleepConditionVariableCS|-5.41|28.70|23.29
TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there is now more CPU spent in purge threads, and not in foreground ones

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:31

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there is now more CPU spent in purge threads, and not in foreground ones

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , because foreground use less of it (decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:33

Summary

[Draft]Performance regression on sysbench write benchmarks from 10.2 to 10.3

Performance regression on sysbench write benchmarks from 10.2 to 10.3

Vladislav Vaintroub made changes - 2018-05-14 23:33

Assignee

Marko Mäkelä [ marko ]

Vladislav Vaintroub made changes - 2018-05-14 23:36

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , because foreground use less of it (decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:38

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:39

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.

Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*
Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:40

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*
Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-14 23:41

Attachment		1bp-1pc-10.2.14.txt [ 45632 ]
Attachment		1bp-1pc-10.3.7.txt [ 45633 ]

Vladislav Vaintroub made changes - 2018-05-15 12:58

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16

* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-15 16:52

Attachment

10.3.pre-lockless.txt [ 45641 ]

Vladislav Vaintroub made changes - 2018-05-15 16:53

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16

* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16

* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7||
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Vladislav Vaintroub made changes - 2018-05-15 17:00

Description

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16

* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7

||Users|| TPS 10.2.14|| TPS 10.3.7||
|1|1549.92|1547.75|
|2|2373.89|2209.05|
|4|4466.03|4276.46|
|8|9069.82|8587.59|
|16|16631.45|15719.70|
|32|28989.79|27482.44|
|64|36996.12|34843.23|
|128|37287.05|35172.49|
|256|38038.01|35896.52|
|512|38360.71|36567.07|
|1024|38265.25|36328.62|
|2048|39990.85|36328.62|
|4096|41988.71|39032.83|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
Here is the setup

* Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Sockets: 1
Virtual processors: 16

* MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
the relevant part of my.ini file is as follows
{noformat}
[mysqld]
back_log=500
max_connections=4096
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_buffer_pool_size=20G
innodb_log_file_size=15G
innodb_log_buffer_size=1000M
innodb-io-capacity=4000
innodb-io-capacity-max=5000
innodb_doublewrite=0
innodb-page-cleaners=1
innodb-buffer-pool-instances=1
innodb_adaptive_flushing=0
innodb_adaptive_flushing_lwm=10
{noformat}
* sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

*Loading table, sysbench prepare*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
{noformat}

*sysbench run*
{noformat}
sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
{noformat}

where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

Below are the results from the runs, for 10.2.4 and recent 10.3.7
*UPDATE:* [~marko] mentioned the git 1a62c8a39647db00283b4e35fb6db3c7bc31e3ca as the version right before 10.3 switched to lockless, so
I added a column for it too. It appears to be better than 10.3.7, but worse than 10.2.14

||Users|| TPS 10.2.14|| TPS 10.3.7|| 10.3-pre-lockless||
|1|1549.92|1547.75|1551.07|
|2|2373.89|2209.05|2334.80|
|4|4466.03|4276.46|4446.64|
|8|9069.82|8587.59|9098.95|
|16|16631.45|15719.70|16231.39|
|32|28989.79|27482.44|27786.32|
|64|36996.12|34843.23|35089.44|
|128|37287.05|35172.49|36223.68|
|256|38038.01|35896.52|36818.22|
|512|38360.71|36567.07|37195.47|
|1024|38265.25|36328.62|37253.98|
|2048|39990.85|36328.62|38572.81|
|4096|41988.71|39032.83|39606.38|

(also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

*Exclusive samples comparison*
Individual functions (or, exclusive sample percentage with threshold 1% difference)

||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
|ut_delay| {color:red}4.02{color}| 16.05| 20.07|
|l_find| {color:red}2.44{color}| 0.16 |2.60|
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
|TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
|SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
*Inclusive samples comparison*
Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

||Comparison Column ||Delta||Baseline Value|| Comparison Value||
|PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
|ut_delay|4.02|16.05|20.07
|trx_purge|3.81|5.67|9.48
|srv_do_purge|3.80|5.70|9.49
|row_purge|3.66|0.00|3.67
|row_purge_step|3.47|0.71|4.18
|que_thr_step|3.31|0.94|4.25
|srv_purge_coordinator_thread|3.25|6.37|9.63
|trx_sys_t::clone_oldest_view|3.18|0.00|3.18
|btr_cur_search_to_nth_level_func|3.12|0.00|3.12
|row_purge_record_func|2.94|0.00|2.94
|row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
|row_purge_reset_trx_id|2.72|0.00|2.72
|que_run_threads_low|2.70|1.73|4.43
|que_run_threads|2.69|1.75|4.45
|l_find|2.46|0.18|2.64
|ReadView::snapshot|2.30|0.00|2.30
|rw_trx_hash_t::iterate|2.27|0.00|2.27
|lf_hash_iterate|2.25|0.00|2.25
|srv_task_execute|2.12|1.95|4.08
|row_purge_reposition_pcur|1.53|0.00|1.53
|row_search_on_row_ref|1.33|0.00|1.33
|btr_pcur_open_low|1.31|0.00|1.31
|btr_search_guess_on_hash|1.31|1.54|2.85
|trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
|trx_undo_assign_low|1.07|0.00|1.07
|mtr_t::commit|1.05|2.57|3.62
|mtr_t::Command::execute|1.01|2.42|3.43
|srv_worker_thread|-1.09|9.84|8.75
|sync_array_wait_event|-1.29|9.69|8.40
|trx_write_serialisation_history|-1.34|2.62|1.28
|trx_commit_low|-1.39|3.15|1.76
|innobase_commit|-1.75|47.36|45.61
|trx_commit|-1.75|4.79|3.04
|ha_commit_one_phase|-1.76|47.55|45.78
|commit_one_phase_2|-1.76|47.54|45.77
|trans_commit_stmt|-1.80|48.68|46.88
|ha_commit_trans|-1.80|48.65|46.84
|btr_cur_search_to_nth_level|-1.85|1.85|0.00
|innobase_commit_ordered_2|-2.04|5.17|3.13
|trx_commit_for_mysql|-2.05|5.10|3.05
|innobase_commit_low|-2.05|5.11|3.06
|Prepared_statement::execute_loop|-2.05|74.58|72.52
|mysql_stmt_execute_common|-2.14|74.85|72.71
|mysqld_stmt_execute|-2.15|74.89|72.74
|mysql_execute_command|-2.21|73.03|70.81
|trx_undo_assign_undo|-2.24|2.24|0.00
|Prepared_statement::execute|-2.24|74.03|71.79
|threadpool_process_request|-2.32|81.71|79.38
|dispatch_command|-2.33|79.21|76.88
|do_command|-2.34|81.33|79.00
|tp_callback|-2.38|82.66|80.28
|srv_resume_thread|-3.33|7.94|4.61
|os_event::wait_low|-4.98|28.41|23.43
|SleepConditionVariableCS|-5.41|28.70|23.29
|TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

Marko Mäkelä made changes - 2018-05-15 21:00

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä made changes - 2018-05-16 11:24

Link

This issue relates to ~~MDEV-12288~~ [ ~~MDEV-12288~~ ]

Marko Mäkelä made changes - 2018-05-17 12:41

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Marko Mäkelä made changes - 2018-05-17 13:25

Assignee

Marko Mäkelä [ marko ]

Sergey Vojtovich [ svoj ]

Vladislav Vaintroub made changes - 2018-05-18 14:37

Attachment

winpmp_10_3_7.txt [ 45665 ]

Vladislav Vaintroub made changes - 2018-05-18 16:30

Attachment

winpmp_10_2_14.txt [ 45666 ]

Marko Mäkelä made changes - 2018-05-21 13:10

Link

This issue relates to MDEV-16232 [ MDEV-16232 ]

Marko Mäkelä made changes - 2018-05-21 13:10

Link

This issue relates to ~~MDEV-14425~~ [ ~~MDEV-14425~~ ]

Sergei Golubchik made changes - 2018-05-22 15:16

Affects Version/s

10.3 [ 22126 ]

Sergei Golubchik made changes - 2018-05-22 15:16

Fix Version/s

10.3 [ 22126 ]

Marko Mäkelä made changes - 2018-08-24 12:24

Link

This issue relates to ~~MDEV-17003~~ [ ~~MDEV-17003~~ ]

Marko Mäkelä made changes - 2018-10-04 05:23

Link

This issue relates to MDEV-17353 [ MDEV-17353 ]

Marko Mäkelä made changes - 2019-03-26 16:20

Link

This issue is duplicated by ~~MDEV-17410~~ [ ~~MDEV-17410~~ ]

Sergei Golubchik made changes - 2019-03-29 12:05

Fix Version/s

10.4 [ 22408 ]

Marko Mäkelä made changes - 2019-06-24 08:56

Link

This issue relates to ~~MDEV-19845~~ [ ~~MDEV-19845~~ ]

Julien Fritsch made changes - 2020-11-17 12:43

Assignee

Sergey Vojtovich [ svoj ]

Marko Mäkelä [ marko ]

Vladislav Vaintroub made changes - 2020-11-17 13:34

issue.field.resolutiondate

2020-11-17 13:34:08.0

2020-11-17 13:34:08.664

Vladislav Vaintroub made changes - 2020-11-17 13:34

Fix Version/s		N/A [ 14700 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Resolution		Won't Fix [ 2 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Sergei Golubchik made changes - 2021-12-06 21:47

Workflow

MariaDB v3 [ 87225 ]

MariaDB v4 [ 154359 ]

MariaDB Server

Performance regression on sysbench write benchmarks from 10.2 to 10.3

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration

Users	TPS 10.2.14	TPS 10.3.7	10.3-pre-lockless
1	1549.92	1547.75	1551.07
2	2373.89	2209.05	2334.80
4	4466.03	4276.46	4446.64
8	9069.82	8587.59	9098.95
16	16631.45	15719.70	16231.39
32	28989.79	27482.44	27786.32
64	36996.12	34843.23	35089.44
128	37287.05	35172.49	36223.68
256	38038.01	35896.52	36818.22
512	38360.71	36567.07	37195.47
1024	38265.25	36328.62	37253.98
2048	39990.85	36328.62	38572.81
4096	41988.71	39032.83	39606.38

Comparison Column	Delta	Baseline Value	Comparison Value
PolicyMutex<TTASEventMutex<GenericPolicy> >::enter	28.69	2.53	31.22
ut_delay	4.02	16.05	20.07
trx_purge	3.81	5.67	9.48
srv_do_purge	3.80	5.70	9.49
row_purge	3.66	0.00	3.67
row_purge_step	3.47	0.71	4.18
que_thr_step	3.31	0.94	4.25
srv_purge_coordinator_thread	3.25	6.37	9.63
trx_sys_t::clone_oldest_view	3.18	0.00	3.18
btr_cur_search_to_nth_level_func	3.12	0.00	3.12
row_purge_record_func	2.94	0.00	2.94
row_purge_upd_exist_or_extern_func	2.77	0.00	2.77
row_purge_reset_trx_id	2.72	0.00	2.72
que_run_threads_low	2.70	1.73	4.43
que_run_threads	2.69	1.75	4.45
l_find	2.46	0.18	2.64
ReadView::snapshot	2.30	0.00	2.30
rw_trx_hash_t::iterate	2.27	0.00	2.27
lf_hash_iterate	2.25	0.00	2.25
srv_task_execute	2.12	1.95	4.08
row_purge_reposition_pcur	1.53	0.00	1.53
row_search_on_row_ref	1.33	0.00	1.33
btr_pcur_open_low	1.31	0.00	1.31
btr_search_guess_on_hash	1.31	1.54	2.85
trx_purge_wait_for_workers_to_complete	1.10	1.52	2.62
trx_undo_assign_low	1.07	0.00	1.07
mtr_t::commit	1.05	2.57	3.62
mtr_t::Command::execute	1.01	2.42	3.43
srv_worker_thread	-1.09	9.84	8.75
sync_array_wait_event	-1.29	9.69	8.40
trx_write_serialisation_history	-1.34	2.62	1.28
trx_commit_low	-1.39	3.15	1.76
innobase_commit	-1.75	47.36	45.61
trx_commit	-1.75	4.79	3.04
ha_commit_one_phase	-1.76	47.55	45.78
commit_one_phase_2	-1.76	47.54	45.77
trans_commit_stmt	-1.80	48.68	46.88
ha_commit_trans	-1.80	48.65	46.84
btr_cur_search_to_nth_level	-1.85	1.85	0.00
innobase_commit_ordered_2	-2.04	5.17	3.13