Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26467

Unnecessary compare-and-swap loop in futex-based synchronization

Details

    Description

      In the futex-based implementation of InnoDB synchronization primitives srw_mutex::wait_and_lock() we are unnecessarily using a compare-and-swap loop for setting the HOLDER flag. We could use a fetch_add (to register the current thread as a waiter) followed by a fetch_or to set the flag. The value of a granted lock word with no waiting threads would change from HOLDER to HOLDER + 1.

      Replacing the compare-and-swap loop seems to slightly improve performance in Sysbench oltp_update_index on my AMD64 based system, using a single NUMA node:

      threads unpatched transactions/s patched transactions/s
      10 91110 92171
      20 158399 158623
      30 162566 163092

      We still use a compare-and-swap loop in ssux_lock_low::rd_lock_try() ever since MDEV-24271. It used to be a greedy fetch_add() followed by a fetch_sub() back-off, but we observed starvation due to that.

      Attachments

        Issue Links

          Activity

            As requested by Marko I tried this patch on 2 numa node ARM server and I see a regression up to 10% with read-write and update-index workload. Check the attached graph. (MDEV-26467 commit-hash tried: 18a71918ec0ee86e837e9632d5042ab659bdec52)

            krunalbauskar Krunal Bauskar added a comment - As requested by Marko I tried this patch on 2 numa node ARM server and I see a regression up to 10% with read-write and update-index workload. Check the attached graph. ( MDEV-26467 commit-hash tried: 18a71918ec0ee86e837e9632d5042ab659bdec52)

            The slight improvement that I noticed could be attributed to random fluctuation. I only tested each number of connections for 2 minutes. Furthermore, I noticed that on AMD64, both GCC and MSVC would still emit lock cmpxchg for a loop around std::atomic::fetch_or(), so there should not have been much improvement on that platform after all.

            krunalbauskar, I hope that you can test the performance of the latest 10.6 development branch, to find out if any regression could have been introduced since the 10.6.4 release.

            marko Marko Mäkelä added a comment - The slight improvement that I noticed could be attributed to random fluctuation. I only tested each number of connections for 2 minutes. Furthermore, I noticed that on AMD64, both GCC and MSVC would still emit lock cmpxchg for a loop around std::atomic::fetch_or() , so there should not have been much improvement on that platform after all. krunalbauskar , I hope that you can test the performance of the latest 10.6 development branch, to find out if any regression could have been introduced since the 10.6.4 release.

            I updated the branch and also included a change that replaces ut_delay(srv_spin_wait_delay) with something that does not multiply two global variables with each other inside a tight loop. I used a simple experiment of

            mkdir bld
            cd bld
            cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. # and other parameters
            cmake --build .
            time numactl  --cpunodebind 1 --localalloc storage/innobase/unittest/innodb_sync-t
            time storage/innobase/unittest/innodb_sync-t
            

            to assess the NUMA-friendliness, on a dual CPU system (Intel® Xeon® E5-2630 v4 @ 2.20 GHz). The test runs for too short time to be really conclusive, and obviously this is a microbenchmark that is not necessarily representative for any database workload. Under numactl the wall-clock time is rather stable at 2.6 seconds (20+6 seconds user and system time). When using both CPUs it is 3.6 to 3.9 seconds, and the sum of user and system time varies between 75 and 95 seconds.

            marko Marko Mäkelä added a comment - I updated the branch and also included a change that replaces ut_delay(srv_spin_wait_delay) with something that does not multiply two global variables with each other inside a tight loop. I used a simple experiment of mkdir bld cd bld cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. # and other parameters cmake --build . time numactl --cpunodebind 1 --localalloc storage/innobase/unittest/innodb_sync-t time storage/innobase/unittest/innodb_sync-t to assess the NUMA-friendliness, on a dual CPU system (Intel® Xeon® E5-2630 v4 @ 2.20 GHz). The test runs for too short time to be really conclusive, and obviously this is a microbenchmark that is not necessarily representative for any database workload. Under numactl the wall-clock time is rather stable at 2.6 seconds (20+6 seconds user and system time). When using both CPUs it is 3.6 to 3.9 seconds, and the sum of user and system time varies between 75 and 95 seconds.
            axel Axel Schwenke added a comment -

            For oltp_update_non_index.png I see a regression. At least in commit f29b66be1d9

            axel Axel Schwenke added a comment - For oltp_update_non_index.png I see a regression. At least in commit f29b66be1d9

            I thought that it might make sense to remove some spinloops, but that seemed to reduce throughput for me. Maybe index or page latches are often enough held for a very short time, and we should try a more fine-grained control of that. Perhaps a spin loop is less useful around index_lock and fil_space_t::latch but more useful around block_lock when that typically is a leaf page latch?

            marko Marko Mäkelä added a comment - I thought that it might make sense to remove some spinloops , but that seemed to reduce throughput for me. Maybe index or page latches are often enough held for a very short time, and we should try a more fine-grained control of that. Perhaps a spin loop is less useful around index_lock and fil_space_t::latch but more useful around block_lock when that typically is a leaf page latch?

            I ended up making several changes:

            1. Make the spin loop wait around a read operation (ensuring that the will not translate it into a loop around lock cmpxchg on AMD64).
            2. Avoid re-reading srv_spin_wait_delay inside the spin loop.
            3. Remove the spin loop from dict_index_t::lock and srw_lock and srw_mutex, except when it is expected to be useful (srw_spin_lock, srw_spin_mutex).

            The combined effect of all these changes seemed to improve performance on my system (dual Intel® Xeon® E5-2630 v4). Before removing some spin loops, I confirmed some regression, just like axel did.

            Removing the spin loop from buffer page latches (block_lock, buf_block_t::lock) would have reduced performance. Only at very large numbers of concurrent connections it could be better to avoid spin loops for page latches.

            I also tried storing srv_spin_wait_delay * my_cpu_relax_multiplier / 4 into another global variable, but that seemed to reduce throughput for me. Possibly the imul instruction helps to keep the CPU off the data bus for a little longer.

            Due to some uncertainty regarding purge tasks and MDEV-24258, I conducted my final tests with Sysbench oltp_read_only using a data set that was slightly larger than the buffer pool, so that some page reads would occur. I did get similar results with Sysbench oltp_update_non_index.

            marko Marko Mäkelä added a comment - I ended up making several changes: Make the spin loop wait around a read operation (ensuring that the will not translate it into a loop around lock cmpxchg on AMD64). Avoid re-reading srv_spin_wait_delay inside the spin loop. Remove the spin loop from dict_index_t::lock and srw_lock and srw_mutex , except when it is expected to be useful ( srw_spin_lock , srw_spin_mutex ). The combined effect of all these changes seemed to improve performance on my system (dual Intel® Xeon® E5-2630 v4). Before removing some spin loops, I confirmed some regression, just like axel did. Removing the spin loop from buffer page latches ( block_lock , buf_block_t::lock ) would have reduced performance. Only at very large numbers of concurrent connections it could be better to avoid spin loops for page latches. I also tried storing srv_spin_wait_delay * my_cpu_relax_multiplier / 4 into another global variable, but that seemed to reduce throughput for me. Possibly the imul instruction helps to keep the CPU off the data bus for a little longer. Due to some uncertainty regarding purge tasks and MDEV-24258 , I conducted my final tests with Sysbench oltp_read_only using a data set that was slightly larger than the buffer pool, so that some page reads would occur. I did get similar results with Sysbench oltp_update_non_index .

            I pushed a few follow-up changes. It turns out that there is an efficient single-bit fetch_or(b) & b on IA-32 and AMD64 in the form of LOCK BTSL, but every compiler at the moment would instead translate fetch_or() into a loop around LOCK CMPXCHG. On GCC, clang and compatible compilers, the futex-based srw_mutex_impl acquisition will now use LOCK BTSL using inline assembler.

            marko Marko Mäkelä added a comment - I pushed a few follow-up changes. It turns out that there is an efficient single-bit fetch_or(b) & b on IA-32 and AMD64 in the form of LOCK BTSL , but every compiler at the moment would instead translate fetch_or() into a loop around LOCK CMPXCHG . On GCC, clang and compatible compilers, the futex-based srw_mutex_impl acquisition will now use LOCK BTSL using inline assembler.

            The Microsoft compiler allows LOCK BTS to be generated by _interlockedbittestandset(). So, now this part of acquisition will use fetch_add() and fetch_or() on all systems. On IA-32 and AMD64 those would translate into the 80486 LOCK XADD and (with some little help to the compilers) the 80386 LOCK BTS. We still use compare_exchange_weak() or compare_exchange_strong() in those places where it is necessary.

            I see that we have a few more single-bit fetch_and() and fetch_or() operations that could be translated more efficiently into LOCK BTR or LOCK BTS. I filed MDEV-26720 to take care of them.

            marko Marko Mäkelä added a comment - The Microsoft compiler allows LOCK BTS to be generated by _interlockedbittestandset() . So, now this part of acquisition will use fetch_add() and fetch_or() on all systems. On IA-32 and AMD64 those would translate into the 80486 LOCK XADD and (with some little help to the compilers ) the 80386 LOCK BTS . We still use compare_exchange_weak() or compare_exchange_strong() in those places where it is necessary. I see that we have a few more single-bit fetch_and() and fetch_or() operations that could be translated more efficiently into LOCK BTR or LOCK BTS . I filed MDEV-26720 to take care of them.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.