[MDEV-34983] Some std::atomic::fetch_or() or fetch_and() is being avoided unnecessarily - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.5, 10.6, 10.11, 11.4
Fix Version/s: 10.6.20, 10.11.10, 11.2.6, 11.4.4, 11.6.2
Component/s: Storage Engine - InnoDB
Labels:
- cleanup
- performance
Environment:
i686, x86-64

Description

There are some compiler bugs around some single-bit test-and-set or test-and-reset operations, mostly affecting the most significant bit of a word.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566
https://github.com/llvm/llvm-project/issues/37322

That is, when the compilers generate code for the IA-32 or AMD64 ISA, they would emit a loop around the LOCK CMPXCHG instruction, instead of emitting LOCK BTS, LOCK BTR, LOCK OR or LOCK AND. Here is a test program:

#include <atomic>

uint32_t fa(std::atomic<uint32_t>&a) { return a.fetch_and(~(1U<<31)) & 1U<<31;}

uint32_t fo(std::atomic<uint32_t>&a) { return a.fetch_or(1U<<31) & 1U<<31;}

Starting with clang++-15 or GCC 7.1 (the first GA release of that GCC series), no LOCK CMPXCHG will be generated for the above.

Fortunately, the oldest major GCC version that is included in any currently supported GNU/Linux distribution is 7. Also, FreeBSD 14 is using clang version 16. So, there should not be any need to worry about older versions of these compilers anymore.

However, even the most recent version of MSVC would generate LOCK CMPXCHG for these. We would be forced to use _interlockedbittestandset() or _interlockedbittestandreset() on that platform in order to avoid a performance regression.

Attachments

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-09-23 06:08

It turns out that only starting with GCC 13, a test-and-set of the most significant bit actually works efficiently in srw_lock.cc. Any version of clang would still emit a loop around lock cmpxchg.

This can be worked around by moving the HOLDER flag from the most significant to the least significant end of the lock word; then it will work starting with clang 15 and GCC 7.

Note: Switching from uint32_t to int32_t is not an option, because then it would be UB to toggle the most significant bit of the lock word in some std::atomic::fetch_add() operations, which is what the code is doing elsewhere.

Marko Mäkelä added a comment - 2024-09-23 06:08 It turns out that only starting with GCC 13, a test-and-set of the most significant bit actually works efficiently in srw_lock.cc . Any version of clang would still emit a loop around lock cmpxchg . This can be worked around by moving the HOLDER flag from the most significant to the least significant end of the lock word; then it will work starting with clang 15 and GCC 7. Note: Switching from uint32_t to int32_t is not an option, because then it would be UB to toggle the most significant bit of the lock word in some std::atomic::fetch_add() operations, which is what the code is doing elsewhere.

Marko Mäkelä added a comment - 2024-09-24 07:16

I posted two https://godbolt.org links to https://github.com/MariaDB/server/pull/3542 that show which instructions are being generated.

On POWER, s390x and generic ARM, all atomic operations use something similar to the x86 loops around lock cmpxchg, namely loops around load-locked/store-conditional.

Side note (not affected by these changes): The Large System Extensions that would be enabled by -march=armv8.1-a include one atomic instruction ldadd for fetch_add() and fetch_sub() (similar to the 80486 lock xadd and further instructions for atomic bitwise operations that can operate with arbitrary arguments and return the full original content. It could make sense to some runtime detection and target specific code for the InnoDB synchronization primitives, to reduce the overhead compared to -moutline-atomics.

Marko Mäkelä added a comment - 2024-09-24 07:16 I posted two https://godbolt.org links to https://github.com/MariaDB/server/pull/3542 that show which instructions are being generated. On POWER, s390x and generic ARM, all atomic operations use something similar to the x86 loops around lock cmpxchg , namely loops around load-locked/store-conditional. Side note (not affected by these changes): The Large System Extensions that would be enabled by -march=armv8.1-a include one atomic instruction ldadd for fetch_add() and fetch_sub() (similar to the 80486 lock xadd and further instructions for atomic bitwise operations that can operate with arbitrary arguments and return the full original content. It could make sense to some runtime detection and target specific code for the InnoDB synchronization primitives, to reduce the overhead compared to -moutline-atomics .

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-09-23 05:04

Updated:: 2024-09-24 07:16

Resolved:: 2024-09-24 07:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server