[MDEV-19845] Adaptive spin loops Created: 2019-06-24 Updated: 2020-08-08 Resolved: 2019-06-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Locking, Server, Storage Engine - Aria, Storage Engine - InnoDB |
| Fix Version/s: | 10.3.17, 10.4.7, 10.5.0 |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance, threads | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
Starting with the Intel Skylake microarchitecture, the PAUSE instruction latency is about 140 clock cycles instead of earlier 10. Some other reference suggested that on AMD, the latency could be 10 or 50 clock cycles, depending on microarchitecture. Because of this big range of latency, the assumptions around which spin loops have been written can be invalid, suggesting that some redesign may be needed. I tried to find out how to detect the microarchitecture by the CPUID instruction, but did not find an up-to-date reference for that. It might require a lookup table that needs to be updated constantly with new processor models. Intel’s article on this includes some code that claims to implement ‘exponential backoff’, but it does not look like that to me. It looks like a constant number of loops around the PAUSE instruction, similar to LF_BACKOFF() in MariaDB. svoj used ut_delay(1) in the lock-free TRX_SYS refactoring in MariaDB Server 10.3. Other invocations of this function seem to be passing the value of innodb_spin_wait_delay. It is only 4 by default, because there is an internal multiplier in ut_delay(). The current range is inadequate for making a 14-fold adjustment. Even if we changed the value from 4 to 1, there would be a 3.5-fold increase (14/4). In An easy change for MariaDB would seem to be to replace both ut_delay() and LF_BACKOFF() with something that takes a new parameter for the loop count. The old parameter innodb_spin_wait_delay could be deprecated and have no effect. |
| Comments |
| Comment by Marko Mäkelä [ 2019-06-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some experiments with MySQL 8.0.16’s new parameter innodb_spin_wait_pause_multiplier (which makes the previously hard-wired multiplier of 50 in ut_delay() configurable) seem to suggest that simply configuring the PAUSE loop may not give significant improvements. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2019-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For CPU detection I suggest looking at target/target_clones as an IFUNC runtime determination. __builtin_cpu_is/supports can be used for branches. Windows is a bit more manual and seems to require CPUID. Is the size of mutex memory a problem? Having mutexes should already be grouped aligned to the L1 cache line (x86 is 64 bytes , 128 on arm64/ppc64). On sizes in bytes on x86_64 fc30 glibc 2.29), pthread_rwlock_t size 56 and pthread_mutex_t size 40 (unfortunately bloated (below) already for different pthread NP/Robust/elision types). Using adjacent memory on counts doesn't seem to incur additional cache misses and is probably padded in some cases. Some structures already have mutex adjacent to other structure members however extra padding of counts to a multiple of cacheline size.
On a guess at an implementation
This brings it up to 64 bytes and an initial upper/lower bound to spin cycles.
After the return we try to get a mutex and based on success/failure update stats;
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
With the default build options, ib_mutex_t is 48 bytes. My main concern is bloat of objects dict_index_t or buf_block_t. Anyway, let us concentrate on the immediate problem regarding to the x86 PAUSE instruction timing changes. I am afraid that CPUID will not work there. The simplest fix should be to run a short calibration loop with RDTSC and PAUSE at the startup, to initialize a global variable that would be replacing the constant in LF_BACKOFF() and ut_delay(). svoj pointed out a new Timed PAUSE instruction that is associated with a CPUID flag. That in combination with RDTSC could be a reasonable choice, and for this, danblack’s idea of target attribute should work. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on the following experiment, the TPAUSE instruction does not seem to be available on two computers that I tried, about 1½ years and 3 years old:
This returns the following bytes to me:
This is equivalent to what the cpuid instruction would return in the registers eax,ebx,ecx,edx when invoking it with eax=7 and the other 3 registers zeroed out. Notably, it returns ecx=0, and this is where the feature bit for TPAUSE should be located. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
svoj, please review bb-10.3-MDEV-19845 where I implemented simple calibration of the PAUSE instruction timing. Higher-level changes would be the subject of a separate task. We could also port part of this to 10.2 if there is demand. Before 10.3, LF_BACKOFF() does not invoke PAUSE, but ut_delay() does. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As a loosely related follow-up task, I cleaned up my_timer_cycles() and defined it inline in MariaDB 10.4.7. Avoiding a subroutine call for a single instruction (such as rdtsc) could yield a slight performance improvement. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on benchmarks, it seems that issuing 1/2 of the amount of PAUSE instructions on Skylake and later microarchitectures yields best overall performance. Originally, the scaling factor was 1/10. The instruction latency ratio is 1/14. The adjustment of the factor will be part of the 10.3.17 and 10.4.7 releases. |