[MDEV-26467] Unnecessary compare-and-swap loop in futex-based synchronization - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6.0
Fix Version/s: 10.6.5
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

In the futex-based implementation of InnoDB synchronization primitives srw_mutex::wait_and_lock() we are unnecessarily using a compare-and-swap loop for setting the HOLDER flag. We could use a fetch_add (to register the current thread as a waiter) followed by a fetch_or to set the flag. The value of a granted lock word with no waiting threads would change from HOLDER to HOLDER + 1.

Replacing the compare-and-swap loop seems to slightly improve performance in Sysbench oltp_update_index on my AMD64 based system, using a single NUMA node:

threads	unpatched transactions/s	patched transactions/s
10	91110	92171
20	158399	158623
30	162566	163092

We still use a compare-and-swap loop in ssux_lock_low::rd_lock_try() ever since ~~MDEV-24271~~. It used to be a greedy fetch_add() followed by a fetch_sub() back-off, but we observed starvation due to that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

1064.vs.MDEV-26467.png
61 kB
2021-08-25 02:15
oltp_update_non_index.png
41 kB
2021-09-01 13:25

Issue Links

relates to

MDEV-26720 Suboptimal translation of single-bit std::atomic::fetch_or() and fetch_and()

Closed

MDEV-24142 rw_lock_t has unnecessarily complex wait logic

Closed

MDEV-25404 read-only performance regression in 10.6

Closed

MDEV-32065 Always check whether lock is free at first to optimize InnoDB mutexes

Open

Activity

Ascending order - Click to sort in descending order

Krunal Bauskar added a comment - 2021-08-25 02:15

As requested by Marko I tried this patch on 2 numa node ARM server and I see a regression up to 10% with read-write and update-index workload. Check the attached graph. (~~MDEV-26467~~ commit-hash tried: 18a71918ec0ee86e837e9632d5042ab659bdec52)

Krunal Bauskar added a comment - 2021-08-25 02:15 As requested by Marko I tried this patch on 2 numa node ARM server and I see a regression up to 10% with read-write and update-index workload. Check the attached graph. ( MDEV-26467 commit-hash tried: 18a71918ec0ee86e837e9632d5042ab659bdec52)

Marko Mäkelä added a comment - 2021-08-26 09:38

The slight improvement that I noticed could be attributed to random fluctuation. I only tested each number of connections for 2 minutes. Furthermore, I noticed that on AMD64, both GCC and MSVC would still emit lock cmpxchg for a loop around std::atomic::fetch_or(), so there should not have been much improvement on that platform after all.

krunalbauskar, I hope that you can test the performance of the latest 10.6 development branch, to find out if any regression could have been introduced since the 10.6.4 release.

Marko Mäkelä added a comment - 2021-08-26 09:38 The slight improvement that I noticed could be attributed to random fluctuation. I only tested each number of connections for 2 minutes. Furthermore, I noticed that on AMD64, both GCC and MSVC would still emit lock cmpxchg for a loop around std::atomic::fetch_or() , so there should not have been much improvement on that platform after all. krunalbauskar , I hope that you can test the performance of the latest 10.6 development branch, to find out if any regression could have been introduced since the 10.6.4 release.

Marko Mäkelä added a comment - 2021-09-01 07:23

I updated the branch and also included a change that replaces ut_delay(srv_spin_wait_delay) with something that does not multiply two global variables with each other inside a tight loop. I used a simple experiment of

mkdir bld

cd bld

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. # and other parameters

cmake --build .

time numactl  --cpunodebind 1 --localalloc storage/innobase/unittest/innodb_sync-t

time storage/innobase/unittest/innodb_sync-t

to assess the NUMA-friendliness, on a dual CPU system (Intel® Xeon® E5-2630 v4 @ 2.20 GHz). The test runs for too short time to be really conclusive, and obviously this is a microbenchmark that is not necessarily representative for any database workload. Under numactl the wall-clock time is rather stable at 2.6 seconds (20+6 seconds user and system time). When using both CPUs it is 3.6 to 3.9 seconds, and the sum of user and system time varies between 75 and 95 seconds.

Marko Mäkelä added a comment - 2021-09-01 07:23 I updated the branch and also included a change that replaces ut_delay(srv_spin_wait_delay) with something that does not multiply two global variables with each other inside a tight loop. I used a simple experiment of mkdir bld cd bld cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. # and other parameters cmake --build . time numactl --cpunodebind 1 --localalloc storage/innobase/unittest/innodb_sync-t time storage/innobase/unittest/innodb_sync-t to assess the NUMA-friendliness, on a dual CPU system (Intel® Xeon® E5-2630 v4 @ 2.20 GHz). The test runs for too short time to be really conclusive, and obviously this is a microbenchmark that is not necessarily representative for any database workload. Under numactl the wall-clock time is rather stable at 2.6 seconds (20+6 seconds user and system time). When using both CPUs it is 3.6 to 3.9 seconds, and the sum of user and system time varies between 75 and 95 seconds.

Axel Schwenke added a comment - 2021-09-01 13:27

For oltp_update_non_index.png I see a regression. At least in commit f29b66be1d9

Axel Schwenke added a comment - 2021-09-01 13:27 For oltp_update_non_index.png I see a regression. At least in commit f29b66be1d9

Marko Mäkelä added a comment - 2021-09-03 16:14

I thought that it might make sense to remove some spinloops, but that seemed to reduce throughput for me. Maybe index or page latches are often enough held for a very short time, and we should try a more fine-grained control of that. Perhaps a spin loop is less useful around index_lock and fil_space_t::latch but more useful around block_lock when that typically is a leaf page latch?

Marko Mäkelä added a comment - 2021-09-03 16:14 I thought that it might make sense to remove some spinloops , but that seemed to reduce throughput for me. Maybe index or page latches are often enough held for a very short time, and we should try a more fine-grained control of that. Perhaps a spin loop is less useful around index_lock and fil_space_t::latch but more useful around block_lock when that typically is a leaf page latch?

Marko Mäkelä added a comment - 2021-09-06 12:03

I ended up making several changes:

Make the spin loop wait around a read operation (ensuring that the will not translate it into a loop around lock cmpxchg on AMD64).
Avoid re-reading srv_spin_wait_delay inside the spin loop.
Remove the spin loop from dict_index_t::lock and srw_lock and srw_mutex, except when it is expected to be useful (srw_spin_lock, srw_spin_mutex).

The combined effect of all these changes seemed to improve performance on my system (dual Intel® Xeon® E5-2630 v4). Before removing some spin loops, I confirmed some regression, just like axel did.

Removing the spin loop from buffer page latches (block_lock, buf_block_t::lock) would have reduced performance. Only at very large numbers of concurrent connections it could be better to avoid spin loops for page latches.

I also tried storing srv_spin_wait_delay * my_cpu_relax_multiplier / 4 into another global variable, but that seemed to reduce throughput for me. Possibly the imul instruction helps to keep the CPU off the data bus for a little longer.

Due to some uncertainty regarding purge tasks and ~~MDEV-24258~~, I conducted my final tests with Sysbench oltp_read_only using a data set that was slightly larger than the buffer pool, so that some page reads would occur. I did get similar results with Sysbench oltp_update_non_index.

Marko Mäkelä added a comment - 2021-09-06 12:03 I ended up making several changes: Make the spin loop wait around a read operation (ensuring that the will not translate it into a loop around lock cmpxchg on AMD64). Avoid re-reading srv_spin_wait_delay inside the spin loop. Remove the spin loop from dict_index_t::lock and srw_lock and srw_mutex , except when it is expected to be useful ( srw_spin_lock , srw_spin_mutex ). The combined effect of all these changes seemed to improve performance on my system (dual Intel® Xeon® E5-2630 v4). Before removing some spin loops, I confirmed some regression, just like axel did. Removing the spin loop from buffer page latches ( block_lock , buf_block_t::lock ) would have reduced performance. Only at very large numbers of concurrent connections it could be better to avoid spin loops for page latches. I also tried storing srv_spin_wait_delay * my_cpu_relax_multiplier / 4 into another global variable, but that seemed to reduce throughput for me. Possibly the imul instruction helps to keep the CPU off the data bus for a little longer. Due to some uncertainty regarding purge tasks and MDEV-24258 , I conducted my final tests with Sysbench oltp_read_only using a data set that was slightly larger than the buffer pool, so that some page reads would occur. I did get similar results with Sysbench oltp_update_non_index .

Marko Mäkelä added a comment - 2021-09-28 14:53

I pushed a few follow-up changes. It turns out that there is an efficient single-bit fetch_or(b) & b on IA-32 and AMD64 in the form of LOCK BTSL, but every compiler at the moment would instead translate fetch_or() into a loop around LOCK CMPXCHG. On GCC, clang and compatible compilers, the futex-based srw_mutex_impl acquisition will now use LOCK BTSL using inline assembler.

Marko Mäkelä added a comment - 2021-09-28 14:53 I pushed a few follow-up changes. It turns out that there is an efficient single-bit fetch_or(b) & b on IA-32 and AMD64 in the form of LOCK BTSL , but every compiler at the moment would instead translate fetch_or() into a loop around LOCK CMPXCHG . On GCC, clang and compatible compilers, the futex-based srw_mutex_impl acquisition will now use LOCK BTSL using inline assembler.

Marko Mäkelä added a comment - 2021-09-29 16:18

The Microsoft compiler allows LOCK BTS to be generated by _interlockedbittestandset(). So, now this part of acquisition will use fetch_add() and fetch_or() on all systems. On IA-32 and AMD64 those would translate into the 80486 LOCK XADD and (with some little help to the compilers) the 80386 LOCK BTS. We still use compare_exchange_weak() or compare_exchange_strong() in those places where it is necessary.

I see that we have a few more single-bit fetch_and() and fetch_or() operations that could be translated more efficiently into LOCK BTR or LOCK BTS. I filed ~~MDEV-26720~~ to take care of them.

Marko Mäkelä added a comment - 2021-09-29 16:18 The Microsoft compiler allows LOCK BTS to be generated by _interlockedbittestandset() . So, now this part of acquisition will use fetch_add() and fetch_or() on all systems. On IA-32 and AMD64 those would translate into the 80486 LOCK XADD and (with some little help to the compilers ) the 80386 LOCK BTS . We still use compare_exchange_weak() or compare_exchange_strong() in those places where it is necessary. I see that we have a few more single-bit fetch_and() and fetch_or() operations that could be translated more efficiently into LOCK BTR or LOCK BTS . I filed MDEV-26720 to take care of them.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2021-08-24 05:40

Updated:: 2023-09-01 07:09

Resolved:: 2021-09-06 12:03

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration