[MDEV-22850] Reduce buf_pool.page_hash latch contention Created: 2020-06-10  Updated: 2020-07-15  Resolved: 2020-06-11

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5
Fix Version/s: 10.5.4

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: Scalability, performance

Issue Links:
Relates
relates to MDEV-19929 Add a startup message about PAUSE ins... Open
relates to MDEV-22859 Possible hang in InnoDB rw_lock_lock_... Closed
relates to MDEV-22871 Contention on the buf_pool.page_hash Closed
relates to MDEV-14659 Innodb scalibility issue found in Mar... Closed
relates to MDEV-15053 Reduce buf_pool_t::mutex contention Closed

 Description   

The changes that were introduced in MDEV-15053 are sometimes causing severe performance loss when there is contention on buf_pool.page_hash. At the root we have the invocation of rw_lock_lock_word_decr(lock, 1, 0)) in rw_lock_s_lock_low(). That function is looping:

UNIV_INLINE
bool
rw_lock_lock_word_decr(
/*===================*/
	rw_lock_t*	lock,		/*!< in/out: rw-lock */
	int32_t		amount,		/*!< in: amount to decrement */
	int32_t		threshold)	/*!< in: threshold of judgement */
{
	int32_t lock_copy = lock->lock_word;
 
	while (lock_copy > threshold) {
		if (lock->lock_word.compare_exchange_strong(
			lock_copy,
			lock_copy - amount,
			std::memory_order_acquire,
			std::memory_order_relaxed)) {
 
			return(true);
		}
		(void) LF_BACKOFF();
	}
	return(false);
}

Note: the LF_BACKOFF() was added in an attempt to reduce the problem.

It feels strange that we are first loading lock_copy and then insisting that we will get an exact match on that lock_copy value. It seems that this could cause unnecessary contention between non-conflicting S-latch requests. The following naïve patch should alleviate that:

diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
index 18a6c69ece4..cec796c8892 100644
--- a/storage/innobase/include/sync0rw.ic
+++ b/storage/innobase/include/sync0rw.ic
@@ -209,20 +209,18 @@ rw_lock_lock_word_decr(
 	int32_t		amount,		/*!< in: amount to decrement */
 	int32_t		threshold)	/*!< in: threshold of judgement */
 {
-	int32_t lock_copy = lock->lock_word;
-
-	while (lock_copy > threshold) {
-		if (lock->lock_word.compare_exchange_strong(
-			lock_copy,
-			lock_copy - amount,
-			std::memory_order_acquire,
-			std::memory_order_relaxed)) {
+  for (;;)
+  {
+    int32_t lock_copy= lock->lock_word;
 
-			return(true);
-		}
-		(void) LF_BACKOFF();
-	}
-	return(false);
+    if (lock_copy <= threshold)
+      return false;
+    if (lock->lock_word.compare_exchange_strong(lock_copy, lock_copy - amount,
+						std::memory_order_acquire,
+						std::memory_order_relaxed))
+      return true;
+    (void) LF_BACKOFF();
+  }
 }
 
 /******************************************************************//**

For even better results, we might want to rewrite the S-latch acquisition, as in the following incomplete patch. At least rw_lock_x_unlock_func() would have to be adjusted as well:

diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
index 18a6c69ece4..32ad6497a35 100644
--- a/storage/innobase/include/sync0rw.ic
+++ b/storage/innobase/include/sync0rw.ic
@@ -239,8 +239,9 @@ rw_lock_s_lock_low(
 	const char*	file_name, /*!< in: file name where lock requested */
 	unsigned	line)	/*!< in: line where requested */
 {
-	if (!rw_lock_lock_word_decr(lock, 1, 0)) {
+	if (lock->lock_word.fetch_sub(1, std::memory_order_acquire) <= 0) {
 		/* Locking did not succeed */
+		lock->lock_word.fetch_add(1, std::memory_order_release);
 		return(FALSE);
 	}
 

Also, somewhat related to this, we might want to replace the rw_lock_s_lock(l) call in buf_pool_t::page_hash_lock() with a non-inlined function call, because that code is extremely unlikely to be executed, only when the buffer pool is being resized.



 Comments   
Comment by Marko Mäkelä [ 2020-06-10 ]

Related to this, measurements on two Intel Cascade Lake based systems suggest that the PAUSE instruction latency was reverted from the Skylake microarchitecture’s roughly 120 clock cycles to pre-Skylake levels (about 12 clock cycles). To compensate that, it seems that a larger value of my_cpu_relax_multiplier would be useful. We should also finally implement a startup message about it (MDEV-19929).

Comment by Marko Mäkelä [ 2020-06-10 ]

I filed and rejected MDEV-22859 for the suspected hang due to failing to reload lock_copy.

My suggested patch cannot be helpful after all. The lock_copy is actually being reloaded in the loop, by the compare_exchange_strong() operation.

Adding a redundant load of the lock_copy to the loop hurt performance, not improve it.

The performance regression on Intel Cascade Lake remains unresolved. And we should still review how we could reduce the buf_pool.page_hash X-latch hold time, which should be the only factor that contributes to this contention.

Comment by Marko Mäkelä [ 2020-06-10 ]

I believe that we can rely solely on buf_pool.mutex on some code paths where we are currently acquiring the rw-lock on buf_pool.page_hash. Currently, my fix is causing hangs especially in encryption tests.

I now believe there is no need to have a LF_BACKOFF() call in rw_lock_lock_word_decr(). After all, we are terminating the spinloop as soon as a conflict is detected.

Comment by Marko Mäkelä [ 2020-06-11 ]

I was able to reduce the buf_pool.page_hash X-latch hold time in buf_page_init_for_read() and buf_page_create(), which I hope should address the root cause of this.

That change includes the following:

diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
index 18a6c69ece4..c881c2fcdf1 100644
--- a/storage/innobase/include/sync0rw.ic
+++ b/storage/innobase/include/sync0rw.ic
@@ -220,8 +220,14 @@ rw_lock_lock_word_decr(
 
 			return(true);
 		}
-		(void) LF_BACKOFF();
+
+		/* Note that lock_copy was reloaded above. We will
+		keep trying if a spurious conflict occurred, typically
+		caused by concurrent executions of
+		rw_lock_s_lock(). */
 	}
+
+	/* A real conflict was detected. */
 	return(false);
 }
 

It remains to be seen whether having the LF_BACKOFF() could still be beneficial. After all, we could still have many concurrent S-latch requests on the buf_pool.page_hash.

A proper fix for the spurious S-latch contention (if it still remains an issue) might be to implement a simpler variant of rw_lock_t that does not support SX-latches (we only need those for dict_index_t::lock, buf_block_t::lock and possibly fil_space_t::latch) nor any recursive latches, and implements the S-latch acquisition by fetch_add(1).

Comment by Marko Mäkelä [ 2020-06-11 ]

To be on the safe side with regard to performance when the number of concurrently executing threads exceeds the number of available CPU cores, I decided to retain the LF_BACKOFF() call after all, even though krunalbauskar found it to reduce performance on one system. That call will be removed in MDEV-22871 once we have migrated buf_pool.page_hash and possibly other hot rw-locks to use a more efficient S-latch acquisition protocol.

In buf_page_init_for_read(), we must acquire the buf_pool.page_hash latch early to prevent a race condition with buf_pool_t::watch_remove(). So, basically this fix is only reducing contention in buf_page_create(), which should mostly benefit read-write use cases.

Generated at Thu Feb 08 09:17:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.