[MDEV-8684] UT_RELAX_CPU isn't at all relaxing Created: 2015-08-27 Updated: 2017-05-09 Resolved: 2016-04-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.0.21 |
| Fix Version/s: | 10.1.14 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stewart Smith | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
ppc64el |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Sprint: | 10.1.8-1, 10.1.8-3 | ||||||||||||
| Description |
|
See http://bugs.mysql.com/bug.php?id=74832 Patch has languished since Nov 2014. So i'm just pasting in my original description here: The code here has seemingly changed in 5.7.5 compared to previous versions. I'm talking about 5.7.5-m15 here. The purpose of ut_delay() is to spin for a while before InnoDB attempts to again acquire a mutex. Optimizations include (on x86) calling the pause instruction inside the spin loop and (on POWER) setting the thread priority to low for the duration of ut_delay. Here is the current (MySQL 5.7.5) implementation of ut_delay: UT_LOW_PRIORITY_CPU(); j = 0; for (i = 0; i < delay * 50; i++) { j += i; UT_RELAX_CPU(); }if (ut_always_false) { ut_always_false = (ibool) j; }UT_RESUME_PRIORITY_CPU(); return(j); There are a couple of problems with this code: But there's another problem that's a bit hidden.... In ut0ut.h we have the following:
Which is anything but relaxing. So, on POWER, where we have atomics but not pause instruction, we get that instead of an empty statement. This likely affects other platforms too (e.g. SPARC, MIPS, ARM, ia64, mips, m68k... basically everything that isn't x86). What we really want here is instead of that, just a compiler barrier, so that it knows that it cannot optimize away the loop. Back to ut_delay, if we look at the original PowerPC assembler for this, it's rather larger than it needs to be: 0000000000000380 <._Z8ut_delaym>: The bits that stare at me are the sync and isync instructions. We're executing memory barriers in there! In a loop! When we're meant to be relaxing! So, once I remove the buggy UT_RELAX_CPU() implementation and simplify ut_delay (patch attached), I end up with: 0000000000000380 <._Z8ut_delaym>: Which is exactly what we should be doing - we go into low priority (mr r1,r1), spin for a while, then resume normal priority (mr r2, r2) and return. We also avoid doing unnecessary work (which is good). This also may have a positive performance impact on x86 as the extra math and work around there would have to be done, and IIRC modern KVM on x86 will trap the pause instruction and attempt to schedule a vcpu that may hold the lock that we're spinning for. How to repeat: Suggested fix: |
| Comments |
| Comment by Stewart Smith [ 2015-08-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached is SQL and a script to recreate the issue. This will be related to another issue I'm about to file too, but from the perf output you can clearly see that ut_delay() is sucking huge amounts of CPU time, and this isn't really ideal. We've observed this in 10.0.17 and 10.0.20 (or was it 21... I can't remember). (Pasting in from Anton below)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2015-08-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
About a year ago we also analyzed ut_delay performance (can't remember exact workload though). We applied similar patch and ut_delay has gone from perf output, but overall system performance suffered as well. Thus it was put on a shelve and never remembered until now. We need to resume this analysis with suggested workload. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stewart Smith [ 2015-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It's possible that the delay needs to be increased when we add this patch as it'll do less work, thus take a lot less time (probably a LOT less actually as we're not forcing things out of L1 cache). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stewart Smith [ 2015-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Using the microbenchmarks from https://github.com/stewart-ibm/microbenchmarks/commit/9210e849374f14e1ffd652ed869637f3d70ba354 Here I use the timebase register on ppc to work out how long in wall time things take (timebase is 512Mhz and is "good enough" for this purpose). [stewart@p82 ~]$ gcc -O3 ut_delay_optim.c -o ut_delay_optim so with the optimized ut_delay, we're actually delaying for less than a tenth of the time. So, for tuning that to be approximately the same amount of wall time, we try changing the multiplication for the loop in ut_delay from *50 to *1024 (which should just optim down to a bit shift). With this (ut_delay_optim2.c) I compare the run of the MySQL one to the optim2 one (see https://github.com/stewart-ibm/microbenchmarks/commit/fa26a1cb71b5799ae846e49b951bb54d55a6a168 ) [stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_optim2 ; done [stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_mysql ; done So it's probably worth to try that implementation to see if there's still negative performance impact. It should burn the same amount of (wall) CPU time without placing additional stress on CPU parts. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2015-09-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So here're OLTP RW results for 160 threads, 16 tables, 2CPU/20cores, SMT=8, DSCR=1, sysbench-0.5 + RNG fix. Fresh 10.1 snapshot: 16688TPS (ut_delay takes 4.74%) Patch:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2015-09-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I couldn't reproduce any reasonable performance impact for the "psdoit" benchmark as well. stewart-ibm, did you see any effect in your benchmarks? Generally I agree with this patch. But it is hard to justify it with no visible improvement. Otherwise I'd prefer to have this problem fixed by: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2015-09-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jplindst, could you also share your thoughts on this? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2015-09-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Very interesting to see that different implementations have very different amount of time used on that single function. However, results are not clear, why the middle one that has the least amount of time used has significantly lower performance? Does it hit the real mutex wait sooner? When you compare the first one and the last one, there is no significant difference and I do not see any real reason to change from original to last one. Do you see similar results when number of threads are varied? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stewart Smith [ 2015-09-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
(Sorry I've been a bit behind on this - a million other things going on). I'm totally in favour of instead just using pthread mutexes and fixing the problem there! I'm doubly in support of doing this upstream too. I think what's going on with the first patch and decreased performance is that we poke the cacheline with the mutex in it a lot more and we end up sleeping a lot sooner. I don't think anyone has looked at what's the best way to spin like this for a while, I'll poke some people and see if we can come up with something better, as we may want to poke into various in-depth simulators/modelling tools and even just ask the chip designers. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2016-03-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok. Have just added https://github.com/MariaDB/server/pull/168 containing two patches from MySQL-5.7, Svoj's patch and an implementation of a UT_RELAX_CPU that isn't harmful outside of the CPU. Running with a sysbench (sysbench --test=sysbench/tests/db/select.lua --oltp_tables_count=64 --oltp-table-size=500000 --mysql-socket=/tmp/mysql.sock --mysql-user=root --max-time=600 --max-requests=2000000000 --report-interval=20 Overall the patches change the perf split between ut_delay, mutex_spin_wait and _raw_spin_lock Is running on ppc64 on SMT=8
I haven't done a x86 comparison yet as the really only impact is the removal of the maths which the micobenchmarks show as insignificant. So we've got a 10.645 increase in TPS while using 10% less cpu time. breakdown of perf on the HEAD:
After the patch series applied:
Using the microbenchmarks modifying Stewarts (https://github.com/grooverdan/microbenchmarks/tree/master/ut_delay) The time difference of the MATHS component was minimal on both x86 and Power. Running on x86_64 Intel(R) Xeon(R) CPU L5640 @ 2.27GHz
On Power:
So the ut_delay is taking about twice the length of time as x86. Removing the power UT_DELAY implemenation
Suggestions welcome. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2016-04-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
closed as per https://github.com/MariaDB/server/commit/9794cf2311c8fe86f05e046f0b96b46862219e03 |