[MDEV-26467] Unnecessary compare-and-swap loop in futex-based synchronization Created: 2021-08-24 Updated: 2023-09-01 Resolved: 2021-09-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.0 |
| Fix Version/s: | 10.6.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
In the futex-based implementation of InnoDB synchronization primitives srw_mutex::wait_and_lock() we are unnecessarily using a compare-and-swap loop for setting the HOLDER flag. We could use a fetch_add (to register the current thread as a waiter) followed by a fetch_or to set the flag. The value of a granted lock word with no waiting threads would change from HOLDER to HOLDER + 1. Replacing the compare-and-swap loop seems to slightly improve performance in Sysbench oltp_update_index on my AMD64 based system, using a single NUMA node:
We still use a compare-and-swap loop in ssux_lock_low::rd_lock_try() ever since |
| Comments |
| Comment by Krunal Bauskar [ 2021-08-25 ] | ||||||
|
As requested by Marko I tried this patch on 2 numa node ARM server and I see a regression up to 10% with read-write and update-index workload. Check the attached graph. ( | ||||||
| Comment by Marko Mäkelä [ 2021-08-26 ] | ||||||
|
The slight improvement that I noticed could be attributed to random fluctuation. I only tested each number of connections for 2 minutes. Furthermore, I noticed that on AMD64, both GCC and MSVC would still emit lock cmpxchg for a loop around std::atomic::fetch_or(), so there should not have been much improvement on that platform after all. krunalbauskar, I hope that you can test the performance of the latest 10.6 development branch, to find out if any regression could have been introduced since the 10.6.4 release. | ||||||
| Comment by Marko Mäkelä [ 2021-09-01 ] | ||||||
|
I updated the branch and also included a change that replaces ut_delay(srv_spin_wait_delay) with something that does not multiply two global variables with each other inside a tight loop. I used a simple experiment of
to assess the NUMA-friendliness, on a dual CPU system (Intel® Xeon® E5-2630 v4 @ 2.20 GHz). The test runs for too short time to be really conclusive, and obviously this is a microbenchmark that is not necessarily representative for any database workload. Under numactl the wall-clock time is rather stable at 2.6 seconds (20+6 seconds user and system time). When using both CPUs it is 3.6 to 3.9 seconds, and the sum of user and system time varies between 75 and 95 seconds. | ||||||
| Comment by Axel Schwenke [ 2021-09-01 ] | ||||||
|
For oltp_update_non_index.png | ||||||
| Comment by Marko Mäkelä [ 2021-09-03 ] | ||||||
|
I thought that it might make sense to remove some spinloops, but that seemed to reduce throughput for me. Maybe index or page latches are often enough held for a very short time, and we should try a more fine-grained control of that. Perhaps a spin loop is less useful around index_lock and fil_space_t::latch but more useful around block_lock when that typically is a leaf page latch? | ||||||
| Comment by Marko Mäkelä [ 2021-09-06 ] | ||||||
|
I ended up making several changes:
The combined effect of all these changes seemed to improve performance on my system (dual Intel® Xeon® E5-2630 v4). Before removing some spin loops, I confirmed some regression, just like axel did. Removing the spin loop from buffer page latches (block_lock, buf_block_t::lock) would have reduced performance. Only at very large numbers of concurrent connections it could be better to avoid spin loops for page latches. I also tried storing srv_spin_wait_delay * my_cpu_relax_multiplier / 4 into another global variable, but that seemed to reduce throughput for me. Possibly the imul instruction helps to keep the CPU off the data bus for a little longer. Due to some uncertainty regarding purge tasks and | ||||||
| Comment by Marko Mäkelä [ 2021-09-28 ] | ||||||
|
I pushed a few follow-up changes. It turns out that there is an efficient single-bit fetch_or(b) & b on IA-32 and AMD64 in the form of LOCK BTSL, but every compiler at the moment would instead translate fetch_or() into a loop around LOCK CMPXCHG. On GCC, clang and compatible compilers, the futex-based srw_mutex_impl acquisition will now use LOCK BTSL using inline assembler. | ||||||
| Comment by Marko Mäkelä [ 2021-09-29 ] | ||||||
|
The Microsoft compiler allows LOCK BTS to be generated by _interlockedbittestandset(). So, now this part of acquisition will use fetch_add() and fetch_or() on all systems. On IA-32 and AMD64 those would translate into the 80486 LOCK XADD and (with some little help to the compilers) the 80386 LOCK BTS. We still use compare_exchange_weak() or compare_exchange_strong() in those places where it is necessary. I see that we have a few more single-bit fetch_and() and fetch_or() operations that could be translated more efficiently into LOCK BTR or LOCK BTS. I filed |