[MDEV-23635] Add notional delay (using existing ut_delay) while spinning for registering reader in rw-locks - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Storage Engine - InnoDB
Labels:
- ARM
- ARMv8
- performance

Description

Currently when latches are acquired by the flow it spins w/o any wait before the latch is acquired. Check the logic here rw_lock_lock_word_decr
If we introduce a short delay to this logic using ut_delay before trying the next iteration it has shown to have +ve effect on performance.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-23635_Analysis.pdf
724 kB
2020-09-08 06:17

Issue Links

is blocked by

MDEV-23633 MY_RELAX_CPU performs unnecessary compare-and-swap on ARM

Closed

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Marko Mäkelä added a comment - 2021-01-20 07:32

krunalbauskar, in ~~MDEV-24167~~ and ~~MDEV-24142~~ in 10.6, the old rw_lock_t was replaced with srw_lock and ssux_lock. Have you run benchmarks on that code?

My idea to make S-latch acquisition a simple fetch_add() had to be abandoned, because it caused writer starvation (~~MDEV-24271~~). The compare-and-swap loop does look ugly, but I was not able to come up with anything better.

Marko Mäkelä added a comment - 2021-01-20 07:32 krunalbauskar , in MDEV-24167 and MDEV-24142 in 10.6, the old rw_lock_t was replaced with srw_lock and ssux_lock . Have you run benchmarks on that code? My idea to make S-latch acquisition a simple fetch_add() had to be abandoned, because it caused writer starvation ( MDEV-24271 ). The compare-and-swap loop does look ugly, but I was not able to come up with anything better.

Krunal Bauskar added a comment - 2021-01-20 07:39

@marko

Not yet. First thing to try is plain 10.6 benchmark and then explore based on the finding.
Said patch may still continue to help old releases like 10.4 if we plan to consider optimizing them.

Krunal Bauskar added a comment - 2021-01-20 07:39 @marko Not yet. First thing to try is plain 10.6 benchmark and then explore based on the finding. Said patch may still continue to help old releases like 10.4 if we plan to consider optimizing them.

Marko Mäkelä added a comment - 2021-01-20 08:42

I think that it should be fine to add an ARM-specific optimization to older releases with an ARM-specific #ifdef.
After the lesson of ~~MDEV-23475~~ (and ~~MDEV-24272~~), I would be wary to change anything on AMD64 in GA releases (10.5 or older). It could only be done after some extensive benchmarking, with different CPU microarchitectures. We know that the latency of the PAUSE instruction (which is a critical part of MY_RELAX_CPU or ut_delay on IA-32 and AMD64) has been drastically changed by Intel in Skylake, and possibly after that as well.

Marko Mäkelä added a comment - 2021-01-20 08:42 I think that it should be fine to add an ARM-specific optimization to older releases with an ARM-specific #ifdef . After the lesson of MDEV-23475 (and MDEV-24272 ), I would be wary to change anything on AMD64 in GA releases (10.5 or older). It could only be done after some extensive benchmarking, with different CPU microarchitectures. We know that the latency of the PAUSE instruction (which is a critical part of MY_RELAX_CPU or ut_delay on IA-32 and AMD64) has been drastically changed by Intel in Skylake, and possibly after that as well.

Marko Mäkelä added a comment - 2021-07-30 10:36

Somewhat related to this (and possibly improving the 10.6 implementation), https://rigtorp.se/spinlock/ discusses spinlocks (mutexes) claims that spinning on a read-modify-write instruction is less efficient than spinning on a read instruction. It is not immediately obvious to me how that could be applied to rw-locks, but maybe you could experiment with that, krunalbauskar?

Marko Mäkelä added a comment - 2021-07-30 10:36 Somewhat related to this (and possibly improving the 10.6 implementation), https://rigtorp.se/spinlock/ discusses spinlocks (mutexes) claims that spinning on a read-modify-write instruction is less efficient than spinning on a read instruction. It is not immediately obvious to me how that could be applied to rw-locks, but maybe you could experiment with that, krunalbauskar ?

Krunal Bauskar added a comment - 2021-08-02 07:51

@Marko,

I went over the article. It suggests checking for the variable before attempting to write it. Fortunately, our server and most new generation system uses compare_exchange_strong there-by stimulating the same behavior but more efficiently at the processor level.

just to add a note for a wider audience ... compare_exchange can be looked upon as 2 ops:
compare (if the desired value is present) then try to exchange ... RARE event while the lock is being acquired
compare (if the desired value is absent) return immediately .. common scenario leading to the loop.

Krunal Bauskar added a comment - 2021-08-02 07:51 @Marko, I went over the article. It suggests checking for the variable before attempting to write it. Fortunately, our server and most new generation system uses compare_exchange_strong there-by stimulating the same behavior but more efficiently at the processor level. just to add a note for a wider audience ... compare_exchange can be looked upon as 2 ops: compare (if the desired value is present) then try to exchange ... RARE event while the lock is being acquired compare (if the desired value is absent) return immediately .. common scenario leading to the loop.

People

Assignee:: Unassigned

Reporter:: Krunal Bauskar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2020-09-01 06:00

Updated:: 2021-08-02 07:51

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration