[MDEV-27866] Switching log_sys.latch to use spin based variant Created: 2022-02-17  Updated: 2023-10-09  Resolved: 2022-02-23

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.8
Fix Version/s: 10.8.3

Type: Bug Priority: Major
Reporter: Krunal Bauskar Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: ARM, performance

Attachments: PNG File baseline, patched (lsn_lock pthread-mutex (spin-disabled) and patched (lsn_lock spinnin enabled + srw_spin_lock latch).png     File lsn_lock_is_pthread_mutex.diff     PNG File switching log_sys.latch to use spin based lock variant (vs normal).png    
Issue Links:
Relates
relates to MDEV-27774 Reduce scalability bottlenecks in mtr... Closed
relates to MDEV-32374 log_sys.lsn_lock is a performance hog Closed

 Description   
  • log_sys.mutex was replaced with log_sys.latch as part of redo log revamp efforts.
  • during the experiment it was observed that log_sys.latch spin-based variant
    continue to perform better on ARM in the range of 8-15%.
    (x86 didn't show any major performance difference).


 Comments   
Comment by Krunal Bauskar [ 2022-02-17 ]

just for reference I tried spin variant for
1. only latch
2. only lsn_lock
3. latch + lsn_lock

Based on the result finally zeroed out that latch need spin based variant.

Comment by Marko Mäkelä [ 2022-02-17 ]

Thank you. Before applying this change, I would like to independently test the impact of enabling the spinning on AMD64.

On a related note, I tested various forms of spinning around log_sys.lsn_lock, and the spin-free srw_mutex performed best on my dual-socket Intel Haswell system. One experiment that I am planning to do is to implement log_sys.lsn_lock with something like https://gitlab.com/numa-spinlock/numa-spinlock in case the Linux kernel is not smart enough to prioritize futex wake-ups on the same NUMA node. The idea would be to let multiple mini-transactions commit on the current NUMA node before switching to the next one. It is not only about minimizing the inter-NUMA traffic on log_sys.lsn_lock itself, but also on log_sys.buf. It could also make sense to use a pure spin lock (with no system calls) because the critical section of log_sys.lsn_lock is very small, and the context switches between user and kernel address space could incur a significant overhead.

Comment by Vladislav Vaintroub [ 2022-02-17 ]

The critical section of lsn_lock is tiny, yet when pure spinlock was used, it slowed thing down on everything. NUMA, non-NUMA, single socket, Windows. I do not have NUMA, but on 16CPU Windows, the 256 users update_index went down from 220-230ktps to 160ktps, while IIRC on marko's dual socket, it went down from 200 ktps to 30ktps in the same test.

It is not entirely obvious why this happened, I'd check memory access pattern

Comment by Marko Mäkelä [ 2022-02-21 ]

wlad, for what it is worth, I experimented with a more NUMA-friendly spinlock for log_sys.lsn_lock inspired by https://gitlab.com/numa-spinlock/numa-spinlock. At best, it was almost as good as the current spin-free log_sys.lsn_lock, but at times the throughput dropped significantly. I tested it only with one number of concurrent connections (probably 64). nproc says 40; there are 10 real threads (20 with hyperthreading) in each CPU. The idea is that the requests are essentially FIFO queued, and the contention on the memory location on log_sys.lsn_lock itself is reduced.

This test was conducted on Linux kernel 5.16.7, io_uring on NVMe storage. This would seem to be one more ‘proof’ that spinning on AMD64 CPUs is not that useful.

I have not checked how NUMA friendly the futex system calls in the Linux kernel are or how they have evolved over the time, but my common sense would suggest that they should try to group the futex wakeups, to reduce the rate of inter-NUMA-node transfers of the futex. Compared to a user-space spinlock implementation, the kernel has the advantage that it at all times knows and can easily decide which NUMA nodes or CPU cores each thread is associated with.

Comment by Vladislav Vaintroub [ 2022-02-21 ]

I think we might have more urgent problems, than that spinlock (for example, getting rid of lock-free transaction hash, the CPU for lf_finds is alarming). If it turns out to be hot, which it seems to be, I'm always for OS mutex, the pthread_mutex, or even for std::mutex, which for me turns out of to be native SRWLOCK. while does not seem to protect large section of code, I'm not sure how often it is entered, so it seems to be oft, and thus maybe we can stay with whatever is just normal mutex (if ARM likes spinning, there is this ADAPTIVE thing it may like). The mysql_mutex_t might give an idea of how hot it is, since it is perfschema-instrumented, on the other hand it might make it even hotter, exactly because of the perfschema overhead.

Comment by Vladislav Vaintroub [ 2022-02-21 ]

I attached the patch where lsn_lock is just pthread_mutex (with non-portable adaptive attribute) lsn_lock_is_pthread_mutex.diff , applied to commit bbe99cd4e2d7c83a06dd93ea88af97f2d5796810 (current 10.9)

that performs better for me 141851.87 tps without the patch, vs 152514.50 with the patch
in a 30 seconds update_index standoff , and this is on Intel. Perhaps ARM can also benefit from it.

I used 1000 clients , 8 tables x 1500000 rows, large redo log and buffer pool, --innodb-flush-log-at-trx-commit=2 --thread-handling=pool-of-threads #

And it performs good on Windows, too

Update : However, if I run this on Linux on 2 NUMA nodes, results are not so good, it is 83495.47 without the patch vs 65638.71 with the patch. If you notice the single NUMA node numbers, yes that's how the NUMA performs on that box I have, I give it 2x the CPUs, and it makes the server half as fast (therefore I almost never do any NUMA test). That's an old 3.10 kernel, so maybe things are better for someone else.

Comment by Vladislav Vaintroub [ 2022-02-21 ]

Update 2: If, in the above patch, I replace MY_MUTEX_INIT_FAST with 0, so default mutex, the performance on both single numa node, and 2 numa nodes is better again, that srw_mutex.

NUMA nodes srw_mutex pthread_mutex(MY_MUTEX_INIT_FAST) pthread_mutex default
1 141851.87 152514.50 143170.42
2 83495.47 65638.71 90748.48

Based on that, pthread_mutex_t is a winner on this particular box. It is much better on single NUMA node, and slightly better on 2 numa nodes than srw_mutex, and does not have 2 NUMA node regression which MY_MUTEX_INIT_FAST has.

Windows:

srw_spin_mutex SRWLOCK pthread_mutex_t srw_mutex CRITICAL_SECTION with spin count 10
223972.20 217495.12 222930.06 217841.48 227981.50

All currently perform about the same, although srw_spin_mutex and pthread_mutex_t are slightly better than the others. The absolute winner CRITICAL_SECTION with spin count 10 is pthread_mutex_t == CRITICAL_SECTION, with added SetCriticalSectionSpinCount call

Overall plain (non-adaptive) pthread_mutex_t seems to be quite good in all tests I performed, as candidate for "spinlock" log_sys.lsn_lock.

Comment by Krunal Bauskar [ 2022-02-22 ]

Marko/Vlad,

1. I tested the pthread_mutex_t approach from Vlad (with spin-disabled MY_MUTEX_INIT_FAST = 0). This helped improve the performance in some case (w.r.t baseline).
2. I then added the original srw_spin_lock patch. This got me mixed results.
3. Finally, I got a combination that helped me scale best with a significant difference with baseline. (srw_spin_lock for latch + pthread_mutex_t with spinning enabled). This got me best result on ARM.

check the graph attached.

Comment by Marko Mäkelä [ 2022-02-22 ]

wlad, thank you for experimenting and benchmarking.

I checked your lsn_lock_is_pthread_mutex.diff, and I am glad to see that it will not only avoid any PERFORMANCE_SCHEMA overhead, but that sizeof(log_sys.lsn_lock) is only 40 on my system. I thought that the size was 44 bytes on IA-32 and 48 bytes on AMD64, but it looks like I got it off by 8.

My main concern regarding the mutex size is that the data fields that follow the mutex should best be allocated in the same cache line (typically, 64 bytes on IA-32 or AMD64). That is definitely the case here.

Please commit that change, as well as krunalbauskar’s pull request to enable log_sys.latch spinning on ARMv8.

Generated at Thu Feb 08 09:56:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.