I ended up making several changes:
- Make the spin loop wait around a read operation (ensuring that the will not translate it into a loop around lock cmpxchg on AMD64).
- Avoid re-reading srv_spin_wait_delay inside the spin loop.
- Remove the spin loop from dict_index_t::lock and srw_lock and srw_mutex, except when it is expected to be useful (srw_spin_lock, srw_spin_mutex).
The combined effect of all these changes seemed to improve performance on my system (dual Intel® Xeon® E5-2630 v4). Before removing some spin loops, I confirmed some regression, just like axel did.
Removing the spin loop from buffer page latches (block_lock, buf_block_t::lock) would have reduced performance. Only at very large numbers of concurrent connections it could be better to avoid spin loops for page latches.
I also tried storing srv_spin_wait_delay * my_cpu_relax_multiplier / 4 into another global variable, but that seemed to reduce throughput for me. Possibly the imul instruction helps to keep the CPU off the data bus for a little longer.
Due to some uncertainty regarding purge tasks and MDEV-24258, I conducted my final tests with Sysbench oltp_read_only using a data set that was slightly larger than the buffer pool, so that some page reads would occur. I did get similar results with Sysbench oltp_update_non_index.
As requested by Marko I tried this patch on 2 numa node ARM server and I see a regression up to 10% with read-write and update-index workload. Check the attached graph. (
MDEV-26467commit-hash tried: 18a71918ec0ee86e837e9632d5042ab659bdec52)