[MDEV-23635] Add notional delay (using existing ut_delay) while spinning for registering reader in rw-locks Created: 2020-09-01  Updated: 2021-08-02

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: None

Type: Task Priority: Major
Reporter: Krunal Bauskar Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: ARM, ARMv8, performance

Attachments: PDF File MDEV-23635_Analysis.pdf    
Issue Links:
Blocks
is blocked by MDEV-23633 MY_RELAX_CPU performs unnecessary com... Closed

 Description   
  • Currently when latches are acquired by the flow it spins w/o any wait before the latch is acquired. Check the logic here rw_lock_lock_word_decr
  • If we introduce a short delay to this logic using ut_delay before trying the next iteration it has shown to have +ve effect on performance.


 Comments   
Comment by Krunal Bauskar [ 2020-09-01 ]

Said task is blocked by MDEV-23633 given the optimization or performance difference is observed once we correct the ARM to use optimal ut_delay logic.

Comment by Krunal Bauskar [ 2020-09-01 ]

Attached performance graphs are with ut_delay (ARM optimization) to use compiler barrier (vs CAS) + delay in acquiring latches.

Comment by Krunal Bauskar [ 2020-09-08 ]
  • MDEV-23633 was folded to trunk (10.4) last week
  • Using that as a base we started evaluating the said idea and finally found that the said idea has the potential to improve performance further.
  • Detailed report of the same has been attached along with benchmarking number in form of pdf in the files section.
Comment by Marko Mäkelä [ 2020-09-09 ]

Back in May, I tried a similar idea to reduce contention on the latches that protect buf_pool.page_hash. In the end, in MDEV-22871 I implemented a simpler type of rw-lock for that use case. The contention point that we saw in benchmarks was related to ‘fake conflicts’ between shared latch acquisitions, which are invoking the following code:

UNIV_INLINE
bool
rw_lock_lock_word_decr(
/*===================*/
	rw_lock_t*	lock,		/*!< in/out: rw-lock */
	int32_t		amount,		/*!< in: amount to decrement */
	int32_t		threshold)	/*!< in: threshold of judgement */
{
	int32_t lock_copy = lock->lock_word;
 
	while (lock_copy > threshold) {
		if (lock->lock_word.compare_exchange_strong(
			lock_copy,
			lock_copy - amount,
			std::memory_order_acquire,
			std::memory_order_relaxed)) {
 
			return(true);
		}
	}
	return(false);
}

Before implementing MDEV-22871, I tried adding ut_delay() to the while loop, and it seemed to help a little in some but not all benchmarks. I find the idea of delaying in the middle of the non-conflicting S-latch acquisition somewhat counter-intuitive. The compare-and-swap in the above code has two possible outcomes: either a real conflict was detected, and we should abort, or a ‘fake conflict’ was detected and we should retry. If we delay too long, we could get into a continuous stream of ‘fake conflicts’.

Side note: in rw_lock_lock_word_decr() we could avoid the initial read of lock->lock_word by starting with the constant value

	int32_t lock_copy = X_LOCK_DECR;

and let the while loop take care of correcting our ‘guess’.

I would rather fix the S-latch acquisition to be a simple std::atomic::fetch_add(), with no possibility of ‘soft’ conflicts at all. This would require adjusting all of the rw_lock_t code so that the X-latch and SX-latch holders would tolerate concurrent ‘probes’ made by readers, similar to how the MDEV-22871 rw_lock::read_trylock() and rw_lock::read_lock_yield() work.

Comment by Marko Mäkelä [ 2020-09-29 ]

I think that we should first try replacing the two os_event_t in rw_lock_t with a mutex and two condition variables, and see if that would help. Yes, that code is outside the spinloop, but it could affect some memory bus traffic. Also, perhaps we should try to ensure that some of the fields are either in the same or in a different cache line.

Comment by Marko Mäkelä [ 2021-01-20 ]

krunalbauskar, in MDEV-24167 and MDEV-24142 in 10.6, the old rw_lock_t was replaced with srw_lock and ssux_lock. Have you run benchmarks on that code?

My idea to make S-latch acquisition a simple fetch_add() had to be abandoned, because it caused writer starvation (MDEV-24271). The compare-and-swap loop does look ugly, but I was not able to come up with anything better.

Comment by Krunal Bauskar [ 2021-01-20 ]

@marko

Not yet. First thing to try is plain 10.6 benchmark and then explore based on the finding.
Said patch may still continue to help old releases like 10.4 if we plan to consider optimizing them.

Comment by Marko Mäkelä [ 2021-01-20 ]

I think that it should be fine to add an ARM-specific optimization to older releases with an ARM-specific #ifdef.
After the lesson of MDEV-23475 (and MDEV-24272), I would be wary to change anything on AMD64 in GA releases (10.5 or older). It could only be done after some extensive benchmarking, with different CPU microarchitectures. We know that the latency of the PAUSE instruction (which is a critical part of MY_RELAX_CPU or ut_delay on IA-32 and AMD64) has been drastically changed by Intel in Skylake, and possibly after that as well.

Comment by Marko Mäkelä [ 2021-07-30 ]

Somewhat related to this (and possibly improving the 10.6 implementation), https://rigtorp.se/spinlock/ discusses spinlocks (mutexes) claims that spinning on a read-modify-write instruction is less efficient than spinning on a read instruction. It is not immediately obvious to me how that could be applied to rw-locks, but maybe you could experiment with that, krunalbauskar?

Comment by Krunal Bauskar [ 2021-08-02 ]

@Marko,

I went over the article. It suggests checking for the variable before attempting to write it. Fortunately, our server and most new generation system uses compare_exchange_strong there-by stimulating the same behavior but more efficiently at the processor level.


just to add a note for a wider audience ... compare_exchange can be looked upon as 2 ops:
compare (if the desired value is present) then try to exchange ... RARE event while the lock is being acquired
compare (if the desired value is absent) return immediately .. common scenario leading to the loop.

Generated at Thu Feb 08 09:23:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.