Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34983

Some std::atomic::fetch_or() or fetch_and() is being avoided unnecessarily

Details

    Description

      There are some compiler bugs around some single-bit test-and-set or test-and-reset operations, mostly affecting the most significant bit of a word.
      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566
      https://github.com/llvm/llvm-project/issues/37322

      That is, when the compilers generate code for the IA-32 or AMD64 ISA, they would emit a loop around the LOCK CMPXCHG instruction, instead of emitting LOCK BTS, LOCK BTR, LOCK OR or LOCK AND. Here is a test program:

      #include <atomic>
      uint32_t fa(std::atomic<uint32_t>&a) { return a.fetch_and(~(1U<<31)) & 1U<<31;}
      uint32_t fo(std::atomic<uint32_t>&a) { return a.fetch_or(1U<<31) & 1U<<31;}
      

      Starting with clang++-15 or GCC 7.1 (the first GA release of that GCC series), no LOCK CMPXCHG will be generated for the above.

      Fortunately, the oldest major GCC version that is included in any currently supported GNU/Linux distribution is 7. Also, FreeBSD 14 is using clang version 16. So, there should not be any need to worry about older versions of these compilers anymore.

      However, even the most recent version of MSVC would generate LOCK CMPXCHG for these. We would be forced to use _interlockedbittestandset() or _interlockedbittestandreset() on that platform in order to avoid a performance regression.

      Attachments

        Activity

          It turns out that only starting with GCC 13, a test-and-set of the most significant bit actually works efficiently in srw_lock.cc. Any version of clang would still emit a loop around lock cmpxchg.

          This can be worked around by moving the HOLDER flag from the most significant to the least significant end of the lock word; then it will work starting with clang 15 and GCC 7.

          Note: Switching from uint32_t to int32_t is not an option, because then it would be UB to toggle the most significant bit of the lock word in some std::atomic::fetch_add() operations, which is what the code is doing elsewhere.

          marko Marko Mäkelä added a comment - It turns out that only starting with GCC 13, a test-and-set of the most significant bit actually works efficiently in srw_lock.cc . Any version of clang would still emit a loop around lock cmpxchg . This can be worked around by moving the HOLDER flag from the most significant to the least significant end of the lock word; then it will work starting with clang 15 and GCC 7. Note: Switching from uint32_t to int32_t is not an option, because then it would be UB to toggle the most significant bit of the lock word in some std::atomic::fetch_add() operations, which is what the code is doing elsewhere.

          I posted two https://godbolt.org links to https://github.com/MariaDB/server/pull/3542 that show which instructions are being generated.

          On POWER, s390x and generic ARM, all atomic operations use something similar to the x86 loops around lock cmpxchg, namely loops around load-locked/store-conditional.

          Side note (not affected by these changes): The Large System Extensions that would be enabled by -march=armv8.1-a include one atomic instruction ldadd for fetch_add() and fetch_sub() (similar to the 80486 lock xadd and further instructions for atomic bitwise operations that can operate with arbitrary arguments and return the full original content. It could make sense to some runtime detection and target specific code for the InnoDB synchronization primitives, to reduce the overhead compared to -moutline-atomics.

          marko Marko Mäkelä added a comment - I posted two https://godbolt.org links to https://github.com/MariaDB/server/pull/3542 that show which instructions are being generated. On POWER, s390x and generic ARM, all atomic operations use something similar to the x86 loops around lock cmpxchg , namely loops around load-locked/store-conditional. Side note (not affected by these changes): The Large System Extensions that would be enabled by -march=armv8.1-a include one atomic instruction ldadd for fetch_add() and fetch_sub() (similar to the 80486 lock xadd and further instructions for atomic bitwise operations that can operate with arbitrary arguments and return the full original content. It could make sense to some runtime detection and target specific code for the InnoDB synchronization primitives, to reduce the overhead compared to -moutline-atomics .

          People

            marko Marko Mäkelä
            marko Marko Mäkelä
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.