[MDEV-19845] Adaptive spin loops - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.3.17, 10.4.7, 10.5.0
Component/s: Locking, Server, Storage Engine - Aria, Storage Engine - InnoDB
Labels:
- performance
- threads

Description

Starting with the Intel Skylake microarchitecture, the PAUSE instruction latency is about 140 clock cycles instead of earlier 10. Some other reference suggested that on AMD, the latency could be 10 or 50 clock cycles, depending on microarchitecture.

Because of this big range of latency, the assumptions around which spin loops have been written can be invalid, suggesting that some redesign may be needed.

I tried to find out how to detect the microarchitecture by the CPUID instruction, but did not find an up-to-date reference for that. It might require a lookup table that needs to be updated constantly with new processor models.

Intel’s article on this includes some code that claims to implement ‘exponential backoff’, but it does not look like that to me. It looks like a constant number of loops around the PAUSE instruction, similar to LF_BACKOFF() in MariaDB.

svoj used ut_delay(1) in the lock-free TRX_SYS refactoring in MariaDB Server 10.3. Other invocations of this function seem to be passing the value of innodb_spin_wait_delay. It is only 4 by default, because there is an internal multiplier in ut_delay(). The current range is inadequate for making a 14-fold adjustment. Even if we changed the value from 4 to 1, there would be a 3.5-fold increase (14/4).

In ~~MDEV-16168~~, danblack pointed to a mutex implementation that appears to be storing a spin count in each mutex, and using that for subsequent iterations. Something like: ‘try 2× as many spins as previous time, but no more than a given maximum’. However, storing the spin loop counts in each mutex would seem to increase the memory usage. If we tried to aggregate them (for the likes of buf_block_t::lock), updating the aggregated values could easily become a contention point.

An easy change for MariaDB would seem to be to replace both ut_delay() and LF_BACKOFF() with something that takes a new parameter for the loop count. The old parameter innodb_spin_wait_delay could be deprecated and have no effect.

Attachments

Issue Links

causes

MDEV-20233 Linking issue on Redhat 6

Closed

MDEV-23249 __builtin_readcyclecounter in include/my_rdtsc.h causes SIGILL on ARM

Closed

relates to

MDEV-19929 Add a startup message about PAUSE instruction timing

In Review

MDEV-16168 Performance regression on sysbench write benchmarks from 10.2 to 10.3

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-06-24 11:58

Some experiments with MySQL 8.0.16’s new parameter innodb_spin_wait_pause_multiplier (which makes the previously hard-wired multiplier of 50 in ut_delay() configurable) seem to suggest that simply configuring the PAUSE loop may not give significant improvements.

Marko Mäkelä added a comment - 2019-06-24 11:58 Some experiments with MySQL 8.0.16’s new parameter innodb_spin_wait_pause_multiplier (which makes the previously hard-wired multiplier of 50 in ut_delay() configurable) seem to suggest that simply configuring the PAUSE loop may not give significant improvements.

Daniel Black added a comment - 2019-06-26 04:16

For CPU detection I suggest looking at target/target_clones as an IFUNC runtime determination. __builtin_cpu_is/supports can be used for branches. Windows is a bit more manual and seems to require CPUID.

Is the size of mutex memory a problem? Having mutexes should already be grouped aligned to the L1 cache line (x86 is 64 bytes , 128 on arm64/ppc64). On sizes in bytes on x86_64 fc30 glibc 2.29), pthread_rwlock_t size 56 and pthread_mutex_t size 40 (unfortunately bloated (below) already for different pthread NP/Robust/elision types). Using adjacent memory on counts doesn't seem to incur additional cache misses and is probably padded in some cases. Some structures already have mutex adjacent to other structure members however extra padding of counts to a multiple of cacheline size.

size determination
$ cat /tmp/x.c
#include <pthread.h>
#include <stdio.h>

int main()
{
pthread_rwlock_t rw;
pthread_mutex_t l;
printf("pthread_rwlock_t size %d\npthread_mutex_t size %d\n", sizeof(pthread_rwlock_t), sizeof(pthread_mutex_t));
return 0;
}
$ gdb /tmp/sizes_pthread
GNU gdb (GDB) Fedora 8.3-3.fc30
(gdb) break main
Breakpoint 1 at 0x40112e: file /tmp/x.c, line 8.
(gdb) r
Starting program: /tmp/sizes_pthread
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-15.fc30.x86_64

Breakpoint 1, main () at /tmp/x.c:8
8 printf("pthread_rwlock_t size %d\npthread_mutex_t size %d\n", sizeof(pthread_rwlock_t), sizeof(pthread_mutex_t));
(gdb) p rw
$1 = {__data = {__readers = 0, __writers = 0, __wrphase_futex = 4198813, __writers_futex = 0, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 80 'P',
__pad1 = "\021@\000\000\000\000", __pad2 = 4198464, __flags = 4294958432},
__size = "\000\000\000\000\000\000\000\000\235\021@", '\000' <repeats 21 times>, "P\021@\000\000\000\000\000@\020@\000\000\000\000\000`\335\377\377\377\177\000", __align = 0}
(gdb) p l
$2 = {__data = {__lock = 4194368, __count = 0, __owner = 15775231, __nusers = 0, __kind = 194, __spins = 0, __elision = 0, __list = {__prev = 0x7fffffffdc57, __next = 0x7fffffffdc56}},
__size = "@\000@\000\000\000\000\000\377\265\360\000\000\000\000\000\302\000\000\000\000\000\000\000W\334\377\377\377\177\000\000V\334\377\377\377\177\000", __align = 4194368}

On a guess at an implementation

struct {

    mutex_t m;

    int32_t low_count;

    int32_t high_count;

   int64_t low_sum;

   int64_t high_sum;

} = { PTHREAD_MUTEX_INITIALIZER, 1, 1, 20, 40 }

This brings it up to 64 bytes and an initial upper/lower bound to spin cycles.

number of spins

int32_t ut_delay(m)
{
int32_t high_avg = (int32_t) (high_sum / high_count);
int32_t low_avg = (int32_t) (low_sum / low_count);
int64_t diff = high_avg - low_avg;
if high_count > low_count {
// try to lower the amount spins (and more likely to hit a fail)
spins = high_avg - diff / 8;
} else {
// raise spins
spins = low_avg + diff / 8;
}

for (int i ; i < spins ; i++) {
...
}
return spins;
}

After the return we try to get a mutex and based on success/failure update stats;

// likewise reverse for spin_success

void spin_fail(mutex m, int32_t spins)

  if (spins > (m.high_sum / m.high_count)) {

    // need to increase high limit

    m.high_sum += spins;

    m.high_count ++;

  m.low_count++;

  m.low_sum += spins;

  /* todo some overflow scale-down */

Daniel Black added a comment - 2019-06-26 04:16 For CPU detection I suggest looking at target/target_clones as an IFUNC runtime determination. __builtin_cpu_is/supports can be used for branches. Windows is a bit more manual and seems to require CPUID. Is the size of mutex memory a problem? Having mutexes should already be grouped aligned to the L1 cache line (x86 is 64 bytes , 128 on arm64/ppc64). On sizes in bytes on x86_64 fc30 glibc 2.29), pthread_rwlock_t size 56 and pthread_mutex_t size 40 (unfortunately bloated (below) already for different pthread NP/Robust/elision types). Using adjacent memory on counts doesn't seem to incur additional cache misses and is probably padded in some cases. Some structures already have mutex adjacent to other structure members however extra padding of counts to a multiple of cacheline size. size determination $ cat /tmp/x.c #include <pthread.h> #include <stdio.h> int main() { pthread_rwlock_t rw; pthread_mutex_t l; printf("pthread_rwlock_t size %d\npthread_mutex_t size %d\n", sizeof(pthread_rwlock_t), sizeof(pthread_mutex_t)); return 0; } $ gdb /tmp/sizes_pthread GNU gdb (GDB) Fedora 8.3-3.fc30 (gdb) break main Breakpoint 1 at 0x40112e: file /tmp/x.c, line 8. (gdb) r Starting program: /tmp/sizes_pthread Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-15.fc30.x86_64 Breakpoint 1, main () at /tmp/x.c:8 8 printf("pthread_rwlock_t size %d\npthread_mutex_t size %d\n", sizeof(pthread_rwlock_t), sizeof(pthread_mutex_t)); (gdb) p rw $1 = {__data = {__readers = 0, __writers = 0, __wrphase_futex = 4198813, __writers_futex = 0, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 80 'P', __pad1 = "\021@\000\000\000\000", __pad2 = 4198464, __flags = 4294958432}, __size = "\000\000\000\000\000\000\000\000\235\021@", '\000' <repeats 21 times>, "P\021@\000\000\000\000\000@\020@\000\000\000\000\000`\335\377\377\377\177\000", __align = 0} (gdb) p l $2 = {__data = {__lock = 4194368, __count = 0, __owner = 15775231, __nusers = 0, __kind = 194, __spins = 0, __elision = 0, __list = {__prev = 0x7fffffffdc57, __next = 0x7fffffffdc56}}, __size = "@\000@\000\000\000\000\000\377\265\360\000\000\000\000\000\302\000\000\000\000\000\000\000W\334\377\377\377\177\000\000V\334\377\377\377\177\000", __align = 4194368} On a guess at an implementation struct { mutex_t m; int32_t low_count; int32_t high_count; int64_t low_sum; int64_t high_sum; } = { PTHREAD_MUTEX_INITIALIZER, 1, 1, 20, 40 } This brings it up to 64 bytes and an initial upper/lower bound to spin cycles. number of spins int32_t ut_delay(m) { int32_t high_avg = (int32_t) (high_sum / high_count); int32_t low_avg = (int32_t) (low_sum / low_count); int64_t diff = high_avg - low_avg; if high_count > low_count { // try to lower the amount spins (and more likely to hit a fail) spins = high_avg - diff / 8; } else { // raise spins spins = low_avg + diff / 8; } for (int i ; i < spins ; i++) { ... } return spins; } After the return we try to get a mutex and based on success/failure update stats; // likewise reverse for spin_success void spin_fail(mutex m, int32_t spins) { if (spins > (m.high_sum / m.high_count)) { // need to increase high limit m.high_sum += spins; m.high_count ++; } m.low_count++; m.low_sum += spins; /* todo some overflow scale-down */ }

Marko Mäkelä added a comment - 2019-06-26 08:10

With the default build options, ib_mutex_t is 48 bytes. My main concern is bloat of objects dict_index_t or buf_block_t. ~~MDEV-18724~~ would address the latter by removing buf_block_t::mutex.

Anyway, let us concentrate on the immediate problem regarding to the x86 PAUSE instruction timing changes. I am afraid that CPUID will not work there.

The simplest fix should be to run a short calibration loop with RDTSC and PAUSE at the startup, to initialize a global variable that would be replacing the constant in LF_BACKOFF() and ut_delay().

svoj pointed out a new Timed PAUSE instruction that is associated with a CPUID flag. That in combination with RDTSC could be a reasonable choice, and for this, danblack’s idea of target attribute should work.

Marko Mäkelä added a comment - 2019-06-26 08:10 With the default build options, ib_mutex_t is 48 bytes. My main concern is bloat of objects dict_index_t or buf_block_t . MDEV-18724 would address the latter by removing buf_block_t::mutex . Anyway, let us concentrate on the immediate problem regarding to the x86 PAUSE instruction timing changes. I am afraid that CPUID will not work there. The simplest fix should be to run a short calibration loop with RDTSC and PAUSE at the startup, to initialize a global variable that would be replacing the constant in LF_BACKOFF() and ut_delay() . svoj pointed out a new Timed PAUSE instruction that is associated with a CPUID flag. That in combination with RDTSC could be a reasonable choice, and for this, danblack ’s idea of target attribute should work.

Marko Mäkelä added a comment - 2019-06-26 09:38

Based on the following experiment, the TPAUSE instruction does not seem to be available on two computers that I tried, about 1½ years and 3 years old:

sudo modprobe cpuid

sudo od -N 16 -j 0x70 -An -t x1 /dev/cpu/0/cpuid

This returns the following bytes to me:

Intel Xeon E5-2630 v4 @ 2.20GHz
00 00 00 00 bb bf 1c 02 00 00 00 00 00 04 00 9c

Intel Core i7-6500U CPU @ 2.50GHz
00 00 00 00 af 67 9c 02 00 00 00 00 00 24 00 9c

This is equivalent to what the cpuid instruction would return in the registers eax,ebx,ecx,edx when invoking it with eax=7 and the other 3 registers zeroed out. Notably, it returns ecx=0, and this is where the feature bit for TPAUSE should be located.

Marko Mäkelä added a comment - 2019-06-26 09:38 Based on the following experiment, the TPAUSE instruction does not seem to be available on two computers that I tried, about 1½ years and 3 years old: sudo modprobe cpuid sudo od -N 16 -j 0x70 -An -t x1 /dev/cpu/0/cpuid This returns the following bytes to me: Intel Xeon E5-2630 v4 @ 2.20GHz 00 00 00 00 bb bf 1c 02 00 00 00 00 00 04 00 9c Intel Core i7-6500U CPU @ 2.50GHz 00 00 00 00 af 67 9c 02 00 00 00 00 00 24 00 9c This is equivalent to what the cpuid instruction would return in the registers eax,ebx,ecx,edx when invoking it with eax=7 and the other 3 registers zeroed out. Notably, it returns ecx=0, and this is where the feature bit for TPAUSE should be located.

Marko Mäkelä added a comment - 2019-06-26 14:45

svoj, please review bb-10.3-MDEV-19845 where I implemented simple calibration of the PAUSE instruction timing. Higher-level changes would be the subject of a separate task.

We could also port part of this to 10.2 if there is demand. Before 10.3, LF_BACKOFF() does not invoke PAUSE, but ut_delay() does.

Marko Mäkelä added a comment - 2019-06-26 14:45 svoj , please review bb-10.3-MDEV-19845 where I implemented simple calibration of the PAUSE instruction timing. Higher-level changes would be the subject of a separate task. We could also port part of this to 10.2 if there is demand. Before 10.3, LF_BACKOFF() does not invoke PAUSE , but ut_delay() does.

Marko Mäkelä added a comment - 2019-06-28 16:23

As a loosely related follow-up task, I cleaned up my_timer_cycles() and defined it inline in MariaDB 10.4.7. Avoiding a subroutine call for a single instruction (such as rdtsc) could yield a slight performance improvement.

Marko Mäkelä added a comment - 2019-06-28 16:23 As a loosely related follow-up task, I cleaned up my_timer_cycles() and defined it inline in MariaDB 10.4.7. Avoiding a subroutine call for a single instruction (such as rdtsc ) could yield a slight performance improvement.

Marko Mäkelä added a comment - 2019-07-02 15:04

Based on benchmarks, it seems that issuing 1/2 of the amount of PAUSE instructions on Skylake and later microarchitectures yields best overall performance. Originally, the scaling factor was 1/10. The instruction latency ratio is 1/14. The adjustment of the factor will be part of the 10.3.17 and 10.4.7 releases.

Marko Mäkelä added a comment - 2019-07-02 15:04 Based on benchmarks, it seems that issuing 1/2 of the amount of PAUSE instructions on Skylake and later microarchitectures yields best overall performance. Originally, the scaling factor was 1/10. The instruction latency ratio is 1/14. The adjustment of the factor will be part of the 10.3.17 and 10.4.7 releases.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2019-06-24 08:56

Updated:: 2020-08-08 07:47

Resolved:: 2019-06-27 08:05

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.