[MDEV-10064] performance regression with threadpool Created: 2016-05-13  Updated: 2017-07-25  Resolved: 2017-07-25

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.0.25, 10.1.14
Fix Version/s: 10.1.25

Type: Bug Priority: Major
Reporter: Axel Schwenke Assignee: Axel Schwenke
Resolution: Not a Bug Votes: 0
Labels: threadpool
Environment:

Ubuntu x86_64


Attachments: Text File one_thread.txt     Text File pool.txt     File threadpool.ods     PNG File tp10.png     PNG File tp1000.png    
Sprint: 10.0.26, 10.0.28, 5.5.55, 10.0.30

 Description   

Enabling the thread pool leads to about 5% performance loss in MariaDB 10.0 and 10.1, but not in MariaDB 5.5. I tested 5.5.49 vs. 10.0.25 vs. 10.1.14.

The benchmark is sysbench OLTP read-only with 1000 point-selects per transaction. The benchmark machine has 16 cores (32 hyperthreads).

my.cnf:

[mysqld]
max_connections = 1300
table_open_cache = 2600
query_cache_type = 0
 
innodb_buffer_pool_size = 512M
innodb_buffer_pool_instances = 10
innodb_adaptive_hash_index_partitions = 20
 
thread_handling=pool-of-threads

See attached spread sheet for numbers.



 Comments   
Comment by Vladislav Vaintroub [ 2016-05-13 ]

axel, could you share your benchmark scripts too ? there are some relevant parameter as to number of rows, number of tables etc.I'd like to reproduce the results exactly as described

Comment by Axel Schwenke [ 2016-05-17 ]

I had a new set of benchmark runs, now going up to 4K benchmark threads (128x overloading the machine). I also did another round with only 10 selects per trx, which is nearly the original OLTP workload. In general I see degradation for 4K threads, both with and without thread pool.

Comment by Vladislav Vaintroub [ 2016-06-23 ]

I had a run of 10 point-selects per transaction, and results for 10.0.26 were nowhere near, on a supposedly similar machine

In my tests, thread-per-connection crumbles badly after 1K users. While pool-of-threads exceeds the numbers in tp10.png by a wide margin

(reaching its max ~400K qps at concurrency 126-512, and slowly going down to ~300K qps at concurrency 8192, sysbench 0.4 shows much better results even).

Granted, I do my tests manually, since I was not able to get the automated scripts running easily, but this should do no difference. I took I hope the same benchmarks spec close 20 tables, 50000 rows each (1 Mio rows overall), and cnf file and sysbench parameters below. All tests were run on a warm database (after cleanup/prepare)

here is sysbench 0.5 call , which I ran for N=4 ... 8192

taskset -c 24-31 sysbench-0.5 --test=lua/oltp.lua --oltp_tables_count=20 --oltp-table-size=50000 --num-threads=N --mysql-db=test --oltp-read-only=on --oltp_point_selects=10 --oltp_simple_ranges=0 --oltp_sum_ranges=0 --oltp_order_ranges=0 --oltp_distinct_ranges=0 --report-interval=3 --mysql-socket=/tmp/mysql.sock --max-time=120 --max-requests=0 --mysql-user=root  run

Here is what I have in my.ini (adapted from my.cnf.01, added more prepared statements limit and large max-connections)

 
#####non innodb options
max_connections = 25000
table_open_cache = 2600
query_cache_type = 0
max_prepared_stmt_count=1000000
datadir=/mnt/ssd-raid1/sysbench/datadir
 
#####innodb options
innodb_buffer_pool_size = 512M
innodb_buffer_pool_instances = 10
#innodb_adaptive_hash_index_partitions = 20

I start mysqld (pool-of-threads) with

taskset -c 0-23 sql/mysqld --defaults-file=my.cnf.01 --lc-messages-dir=$PWD/sql/share/english --skip-grant-tables --thread-handling=pool-of-threads --thread-pool-size=24 --log-error=mysqld.err

I start mysqld(thread-per-connection) with

taskset -c 0-23 sql/mysqld --defaults-file=my.cnf.01 --lc-messages-dir=$PWD/sql/share/english --skip-grant-tables --log-error=mysqld.err

Taskset is used to produce the best results (and if no taskset is used, results are still about the same)

Anyway, here are the results I get

Pool-of-threads

Concurrency Pool-of-threads qps Thread-per-connection qps
4 74001 75662
8 134601 137843
16 231176 236668
32 313854 308241
64 374751 351794
128 401685 350908
256 405135 343079
512 400087 314862
1024 362639 260873
2048 329087 79458
4096 314969 4177
8192 305018 1884

With all information given above, it should be easily reproducible on another machine I guess.

I will look into 10.1 tomorrow. Perhaps, it was a good XtraDB merge in 10.0 that introduced that difference, there were a lot of changes in os0sync etc. Maybe this turned out to give that boost for the threadpool, I dunno.

Comment by Vladislav Vaintroub [ 2016-06-23 ]

Attached raw output from sysbench pool.txt (for pool-of-threads test) and one_thread.txt (thread-per-connection)

Comment by Vladislav Vaintroub [ 2016-06-23 ]

Ah, also perhaps a relevant detail . this is default compilation, cmake . && make .
i.e perfschema is compiled in ( just in case this is relevant)

Comment by Vladislav Vaintroub [ 2016-06-28 ]

Ok, I measured some more , with and without taskset. So, one can see what my appear as very slight regression if taskset is not used ,specifically for threadpool case. But, this is a phantom regression . Indeed, as mentioned elsewhere (e.g in threadpool documentation, in the section of how to run benchmarks), benchmark driver seems to take a bigger share of the overall CPU. Concretely, in this case in 10.1, without pinning, you can get a situation where sysbench-0.5 is using 10 CPUs out of 32, while mysql is using 22 CPUs, as shown by "top". The idle time is 0%, there are 32 CPUs, that are all busy. However, mysqld can do more, if affinitized (use up to 24 CPUs, which results in better throughput, but then sysbench needs to be restricted).In all of my affinitized test, threadpool outperforms thread-per-connection (the later can be affinitized or not). In all of overall tests, threadpool continues to scale above 1024 concurrent selects.

Either there is something I do wrong on my end, or I'd say that the benchmarks were not run properly, and the same hardware can do better, and outperform thread-per-connection in all aspects, including raw throughput, if the benchmark would run using taskset, as mentioned in the threadpool documentation.
taskset really makes a visible difference.

I shared my results in
https://docs.google.com/spreadsheets/d/12KPobxrP89BzrevPaCoGxGUPnI4kuLWRtTLjTfPJw78/edit#gid=0

axel, I'm reasssigning this back. Could you please confirm, my findings (and, in this case, I think the MDEV can be closed), or tell my whether I do something wrong.

I shared details how I run the benchmarks, including sysbench and mysqld parameters (including the taskset params) in this comment

https://jira.mariadb.org/browse/MDEV-10064?focusedCommentId=84510&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-84510

Comment by Marko Mäkelä [ 2017-03-01 ]

If the origin of this regression is suspected to be this Percona XtraDB commit, then I presume that the condition

	} else if (free_len > max_free_len / 5) {

should be preserved intact.

Comment by Vladislav Vaintroub [ 2017-03-01 ]

marko I think the comment belongs to MDEV-10409, not this one

Comment by Axel Schwenke [ 2017-07-20 ]

Added a test case to the regression test suite to test thread pool behavior for all MariaDB releases, starting with 5.5.

Comment by Axel Schwenke [ 2017-07-25 ]

Could not find any regression with a 16:16 splitting of hyperthreads. Performance with threadpool enabled is flat over releases and performance at high thread counts is slightly better with threadpool enabled vs. one-thread-per-connection

Generated at Thu Feb 08 07:39:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.