[MDEV-14482] Cache line contention on ut_rnd_ulint_counter() Created: 2017-11-23 Updated: 2019-12-10 Resolved: 2018-01-26 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3.2 |
| Fix Version/s: | 10.3.5, 10.4.0, 10.2.31 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sandeep sethia | Assignee: | Sergey Vojtovich |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Ubuntu 16.02 on Arm 64 bit platform |
||
| Issue Links: |
|
||||||||
| Epic Link: | arm64 optimization | ||||||||
| Sprint: | 10.1.31 | ||||||||
| Description |
|
Hi Sergey, I see cache coherency and interconnect traffic issue with respect to the random function generator ut_rnd_ulint_counter() while doing a backoff in spin lock. I was looking at MySQL repository and see that there is a patch https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1 I also see MYSQL backed off it in 5.20 because of performance reason but the changes brought some scalability issue with the patch dedc8b3d567fbb92ce912f1559fe6a08b2857045 Again on Version 8.0 this has been committed again with the patch https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1with some new changes. However I am trying to port this to Mariadb but there a lot of code difference so no way I can port directly and provide a patch to test. I see files my_thread_local.h being not available and if I try to create there are some merge clashes with my_pthread.h . Please let me know your opinion and if required I can share a patch soon with the above issues resolved. |
| Comments |
| Comment by Sergey Vojtovich [ 2017-11-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
MySQL switched from global RNG state to per-thread state using thread-local-storage (TLS). But TLS performance as such is far from perfect. If my idea explained in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sandeep sethia [ 2017-11-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If I am correct we need to test in enter function ()
instead
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2017-11-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I plan to finish coding tomorrow, then we can try workable version that doesn't confuse anybody. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2017-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I did a few experiments with original 10.3 as of 7cb3520c0632ad912b309489ad86a90f9fc9bd0b. oltp_index_updates, 64 threads, ARM. vanilla: 143981 TPS Also there's potential false sharing between srv_spin_wait_delay/srv_n_spin_wait_rounds and statistics variables. I doubt they can contribute much to scalability, still worth checking. mutex (use pthread_mutex instead of ib mutex)
spin (just spin, never delay, never go asleep)
nospin (immediately go asleep, no delay)
nornd_X (different values to ut_delay, either hardcoded or max_delay)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sandeep sethia [ 2017-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi , Just to understand from your comment u get best numbers on nornd_1: 160053 TPS I am getting the same number for Mariadb and MySQL for same threads level for 20 tables with 1000000 records. But if put a patch for https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1 in MySQL I see 10-13% increases in TPS .This is consistent across 64,128,256 threads. What is vanilla in your comment? Also right now the random value is given same across all the threads . it should be based on contention of threads and not some function value based? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2017-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Vanilla is 10.3 as of 7cb3520c0632ad912b309489ad86a90f9fc9bd0b (10.3 branch as of today). Yes, I also wonder why I'm getting lower TPS compared to what you see (~144 000 vs ~160 000). Probably some difference in benchmark/settings/hardware? At least I have 100 000 records versus 1 000 000 that you have. Same 20 tables, 64 threads. It would be nice if you could try nornd_1 change on your side. It's not just about fixing shared variable, but also about using lower value for ut_delay specifically in mutex.
I'm afraid I don't quite get this question. Could you elaborate? You mean all the threads get the same random value? Or you mean delay should be based on mutex contention? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2017-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just for the record, this article suggests that YIELD can be used in spinloop: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489f/CIHEGBBF.html I wonder if we can use it YIELD on ARM as a replacement for Intel PAUSE. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sandeep sethia [ 2017-12-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried nornd_1 but I see scalability issue in my numbers.The tps around 74k instead of 160k which I am getting by defaults My sysbench command for run and prepare ./bin/sysbench --test=share/sysbench/oltp_update_index.lua --max-requests=0 --max-time=50 --tables=20 --table_size=100000 --report-interval=10 --mysql-user=root --mysql-db=test --mysql_storage_engine=innodb --mysql-socket=/tmp/mysql.sock --num-threads=64 run and my cnf file is All the runs are happening in tmpfs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2017-12-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, so we need to be careful with this. I'll do more research on this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko, please review fix for this bug: https://github.com/MariaDB/server/commit/62eeb1c760196207ac57338237a0d58df925847f | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looks OK to me. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sandeep sethia [ 2018-01-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sergey, In case if the atomics in arm/intel gets optimized or for the matter gets more degraded I feel this can still need to be adjusted. I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed. Because "The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++." is clearly not relevant any more | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-01-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ssethia, the argument is not relevant indeed. But I'm not completely sure what you're suggesting. Scalability is at least 3-dimensional thing: hardware, number of threads and workload. Usually when we improve some benchmark results that works on some crossing of these three, we inevitably make another crossing performance worse. The only way that gives non-negative result in all crossings I'm aware of is removing cache line contention (mutex, rwlock, false sharing, variables) from hot path. In the scope of this task we removed RNG variable contention, which goes perfectly inline with what I say above. If you have any specific suggestion how exactly we should tune mutex delay, I'm all for creating another JIRA task and discussing it separately. |