[MDEV-27866] Switching log_sys.latch to use spin based variant Created: 2022-02-17 Updated: 2023-10-09 Resolved: 2022-02-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.8 |
| Fix Version/s: | 10.8.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Krunal Bauskar | Assignee: | Vladislav Vaintroub |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ARM, performance | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
| Comments |
| Comment by Krunal Bauskar [ 2022-02-17 ] | ||||||||||||||||||||||
|
just for reference I tried spin variant for Based on the result finally zeroed out that latch need spin based variant. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-17 ] | ||||||||||||||||||||||
|
Thank you. Before applying this change, I would like to independently test the impact of enabling the spinning on AMD64. On a related note, I tested various forms of spinning around log_sys.lsn_lock, and the spin-free srw_mutex performed best on my dual-socket Intel Haswell system. One experiment that I am planning to do is to implement log_sys.lsn_lock with something like https://gitlab.com/numa-spinlock/numa-spinlock in case the Linux kernel is not smart enough to prioritize futex wake-ups on the same NUMA node. The idea would be to let multiple mini-transactions commit on the current NUMA node before switching to the next one. It is not only about minimizing the inter-NUMA traffic on log_sys.lsn_lock itself, but also on log_sys.buf. It could also make sense to use a pure spin lock (with no system calls) because the critical section of log_sys.lsn_lock is very small, and the context switches between user and kernel address space could incur a significant overhead. | ||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-17 ] | ||||||||||||||||||||||
|
The critical section of lsn_lock is tiny, yet when pure spinlock was used, it slowed thing down on everything. NUMA, non-NUMA, single socket, Windows. I do not have NUMA, but on 16CPU Windows, the 256 users update_index went down from 220-230ktps to 160ktps, while IIRC on marko's dual socket, it went down from 200 ktps to 30ktps in the same test. It is not entirely obvious why this happened, I'd check memory access pattern | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-21 ] | ||||||||||||||||||||||
|
wlad, for what it is worth, I experimented with a more NUMA-friendly spinlock for log_sys.lsn_lock inspired by https://gitlab.com/numa-spinlock/numa-spinlock. At best, it was almost as good as the current spin-free log_sys.lsn_lock, but at times the throughput dropped significantly. I tested it only with one number of concurrent connections (probably 64). nproc says 40; there are 10 real threads (20 with hyperthreading) in each CPU. The idea is that the requests are essentially FIFO queued, and the contention on the memory location on log_sys.lsn_lock itself is reduced. This test was conducted on Linux kernel 5.16.7, io_uring on NVMe storage. This would seem to be one more ‘proof’ that spinning on AMD64 CPUs is not that useful. I have not checked how NUMA friendly the futex system calls in the Linux kernel are or how they have evolved over the time, but my common sense would suggest that they should try to group the futex wakeups, to reduce the rate of inter-NUMA-node transfers of the futex. Compared to a user-space spinlock implementation, the kernel has the advantage that it at all times knows and can easily decide which NUMA nodes or CPU cores each thread is associated with. | ||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-21 ] | ||||||||||||||||||||||
|
I think we might have more urgent problems, than that spinlock (for example, getting rid of lock-free transaction hash, the CPU for lf_finds is alarming). If it turns out to be hot, which it seems to be, I'm always for OS mutex, the pthread_mutex, or even for std::mutex, which for me turns out of to be native SRWLOCK. while does not seem to protect large section of code, I'm not sure how often it is entered, so it seems to be oft, and thus maybe we can stay with whatever is just normal mutex (if ARM likes spinning, there is this ADAPTIVE thing it may like). The mysql_mutex_t might give an idea of how hot it is, since it is perfschema-instrumented, on the other hand it might make it even hotter, exactly because of the perfschema overhead. | ||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-21 ] | ||||||||||||||||||||||
|
I attached the patch where lsn_lock is just pthread_mutex (with non-portable adaptive attribute) lsn_lock_is_pthread_mutex.diff that performs better for me 141851.87 tps without the patch, vs 152514.50 with the patch I used 1000 clients , 8 tables x 1500000 rows, large redo log and buffer pool, --innodb-flush-log-at-trx-commit=2 --thread-handling=pool-of-threads # And it performs good on Windows, too Update : However, if I run this on Linux on 2 NUMA nodes, results are not so good, it is 83495.47 without the patch vs 65638.71 with the patch. If you notice the single NUMA node numbers, yes that's how the NUMA performs on that box I have, I give it 2x the CPUs, and it makes the server half as fast (therefore I almost never do any NUMA test). That's an old 3.10 kernel, so maybe things are better for someone else. | ||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-21 ] | ||||||||||||||||||||||
|
Update 2: If, in the above patch, I replace MY_MUTEX_INIT_FAST with 0, so default mutex, the performance on both single numa node, and 2 numa nodes is better again, that srw_mutex.
Based on that, pthread_mutex_t is a winner on this particular box. It is much better on single NUMA node, and slightly better on 2 numa nodes than srw_mutex, and does not have 2 NUMA node regression which MY_MUTEX_INIT_FAST has. Windows:
All currently perform about the same, although srw_spin_mutex and pthread_mutex_t are slightly better than the others. The absolute winner CRITICAL_SECTION with spin count 10 is pthread_mutex_t == CRITICAL_SECTION, with added SetCriticalSectionSpinCount call Overall plain (non-adaptive) pthread_mutex_t seems to be quite good in all tests I performed, as candidate for "spinlock" log_sys.lsn_lock. | ||||||||||||||||||||||
| Comment by Krunal Bauskar [ 2022-02-22 ] | ||||||||||||||||||||||
|
Marko/Vlad, 1. I tested the pthread_mutex_t approach from Vlad (with spin-disabled MY_MUTEX_INIT_FAST = 0). This helped improve the performance in some case (w.r.t baseline). check the graph attached. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-22 ] | ||||||||||||||||||||||
|
wlad, thank you for experimenting and benchmarking. I checked your lsn_lock_is_pthread_mutex.diff My main concern regarding the mutex size is that the data fields that follow the mutex should best be allocated in the same cache line (typically, 64 bytes on IA-32 or AMD64). That is definitely the case here. Please commit that change, as well as krunalbauskar’s pull request to enable log_sys.latch spinning on ARMv8. |