Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22850

Reduce buf_pool.page_hash latch contention

Details

    Description

      The changes that were introduced in MDEV-15053 are sometimes causing severe performance loss when there is contention on buf_pool.page_hash. At the root we have the invocation of rw_lock_lock_word_decr(lock, 1, 0)) in rw_lock_s_lock_low(). That function is looping:

      UNIV_INLINE
      bool
      rw_lock_lock_word_decr(
      /*===================*/
      	rw_lock_t*	lock,		/*!< in/out: rw-lock */
      	int32_t		amount,		/*!< in: amount to decrement */
      	int32_t		threshold)	/*!< in: threshold of judgement */
      {
      	int32_t lock_copy = lock->lock_word;
       
      	while (lock_copy > threshold) {
      		if (lock->lock_word.compare_exchange_strong(
      			lock_copy,
      			lock_copy - amount,
      			std::memory_order_acquire,
      			std::memory_order_relaxed)) {
       
      			return(true);
      		}
      		(void) LF_BACKOFF();
      	}
      	return(false);
      }
      

      Note: the LF_BACKOFF() was added in an attempt to reduce the problem.

      It feels strange that we are first loading lock_copy and then insisting that we will get an exact match on that lock_copy value. It seems that this could cause unnecessary contention between non-conflicting S-latch requests. The following naïve patch should alleviate that:

      diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
      index 18a6c69ece4..cec796c8892 100644
      --- a/storage/innobase/include/sync0rw.ic
      +++ b/storage/innobase/include/sync0rw.ic
      @@ -209,20 +209,18 @@ rw_lock_lock_word_decr(
       	int32_t		amount,		/*!< in: amount to decrement */
       	int32_t		threshold)	/*!< in: threshold of judgement */
       {
      -	int32_t lock_copy = lock->lock_word;
      -
      -	while (lock_copy > threshold) {
      -		if (lock->lock_word.compare_exchange_strong(
      -			lock_copy,
      -			lock_copy - amount,
      -			std::memory_order_acquire,
      -			std::memory_order_relaxed)) {
      +  for (;;)
      +  {
      +    int32_t lock_copy= lock->lock_word;
       
      -			return(true);
      -		}
      -		(void) LF_BACKOFF();
      -	}
      -	return(false);
      +    if (lock_copy <= threshold)
      +      return false;
      +    if (lock->lock_word.compare_exchange_strong(lock_copy, lock_copy - amount,
      +						std::memory_order_acquire,
      +						std::memory_order_relaxed))
      +      return true;
      +    (void) LF_BACKOFF();
      +  }
       }
       
       /******************************************************************//**
      

      For even better results, we might want to rewrite the S-latch acquisition, as in the following incomplete patch. At least rw_lock_x_unlock_func() would have to be adjusted as well:

      diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
      index 18a6c69ece4..32ad6497a35 100644
      --- a/storage/innobase/include/sync0rw.ic
      +++ b/storage/innobase/include/sync0rw.ic
      @@ -239,8 +239,9 @@ rw_lock_s_lock_low(
       	const char*	file_name, /*!< in: file name where lock requested */
       	unsigned	line)	/*!< in: line where requested */
       {
      -	if (!rw_lock_lock_word_decr(lock, 1, 0)) {
      +	if (lock->lock_word.fetch_sub(1, std::memory_order_acquire) <= 0) {
       		/* Locking did not succeed */
      +		lock->lock_word.fetch_add(1, std::memory_order_release);
       		return(FALSE);
       	}
       
      

      Also, somewhat related to this, we might want to replace the rw_lock_s_lock(l) call in buf_pool_t::page_hash_lock() with a non-inlined function call, because that code is extremely unlikely to be executed, only when the buffer pool is being resized.

      Attachments

        Issue Links

          Activity

            Related to this, measurements on two Intel Cascade Lake based systems suggest that the PAUSE instruction latency was reverted from the Skylake microarchitecture’s roughly 120 clock cycles to pre-Skylake levels (about 12 clock cycles). To compensate that, it seems that a larger value of my_cpu_relax_multiplier would be useful. We should also finally implement a startup message about it (MDEV-19929).

            marko Marko Mäkelä added a comment - Related to this, measurements on two Intel Cascade Lake based systems suggest that the PAUSE instruction latency was reverted from the Skylake microarchitecture’s roughly 120 clock cycles to pre-Skylake levels (about 12 clock cycles). To compensate that, it seems that a larger value of my_cpu_relax_multiplier would be useful. We should also finally implement a startup message about it ( MDEV-19929 ).

            I filed and rejected MDEV-22859 for the suspected hang due to failing to reload lock_copy.

            My suggested patch cannot be helpful after all. The lock_copy is actually being reloaded in the loop, by the compare_exchange_strong() operation.

            Adding a redundant load of the lock_copy to the loop hurt performance, not improve it.

            The performance regression on Intel Cascade Lake remains unresolved. And we should still review how we could reduce the buf_pool.page_hash X-latch hold time, which should be the only factor that contributes to this contention.

            marko Marko Mäkelä added a comment - I filed and rejected MDEV-22859 for the suspected hang due to failing to reload lock_copy . My suggested patch cannot be helpful after all. The lock_copy is actually being reloaded in the loop, by the compare_exchange_strong() operation. Adding a redundant load of the lock_copy to the loop hurt performance, not improve it. The performance regression on Intel Cascade Lake remains unresolved. And we should still review how we could reduce the buf_pool.page_hash X-latch hold time, which should be the only factor that contributes to this contention.

            I believe that we can rely solely on buf_pool.mutex on some code paths where we are currently acquiring the rw-lock on buf_pool.page_hash. Currently, my fix is causing hangs especially in encryption tests.

            I now believe there is no need to have a LF_BACKOFF() call in rw_lock_lock_word_decr(). After all, we are terminating the spinloop as soon as a conflict is detected.

            marko Marko Mäkelä added a comment - I believe that we can rely solely on buf_pool.mutex on some code paths where we are currently acquiring the rw-lock on buf_pool.page_hash . Currently, my fix is causing hangs especially in encryption tests. I now believe there is no need to have a LF_BACKOFF() call in rw_lock_lock_word_decr() . After all, we are terminating the spinloop as soon as a conflict is detected.

            I was able to reduce the buf_pool.page_hash X-latch hold time in buf_page_init_for_read() and buf_page_create(), which I hope should address the root cause of this.

            That change includes the following:

            diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
            index 18a6c69ece4..c881c2fcdf1 100644
            --- a/storage/innobase/include/sync0rw.ic
            +++ b/storage/innobase/include/sync0rw.ic
            @@ -220,8 +220,14 @@ rw_lock_lock_word_decr(
             
             			return(true);
             		}
            -		(void) LF_BACKOFF();
            +
            +		/* Note that lock_copy was reloaded above. We will
            +		keep trying if a spurious conflict occurred, typically
            +		caused by concurrent executions of
            +		rw_lock_s_lock(). */
             	}
            +
            +	/* A real conflict was detected. */
             	return(false);
             }
             
            

            It remains to be seen whether having the LF_BACKOFF() could still be beneficial. After all, we could still have many concurrent S-latch requests on the buf_pool.page_hash.

            A proper fix for the spurious S-latch contention (if it still remains an issue) might be to implement a simpler variant of rw_lock_t that does not support SX-latches (we only need those for dict_index_t::lock, buf_block_t::lock and possibly fil_space_t::latch) nor any recursive latches, and implements the S-latch acquisition by fetch_add(1).

            marko Marko Mäkelä added a comment - I was able to reduce the buf_pool.page_hash X-latch hold time in buf_page_init_for_read() and buf_page_create() , which I hope should address the root cause of this. That change includes the following: diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic index 18a6c69ece4..c881c2fcdf1 100644 --- a/storage/innobase/include/sync0rw.ic +++ b/storage/innobase/include/sync0rw.ic @@ -220,8 +220,14 @@ rw_lock_lock_word_decr( return(true); } - (void) LF_BACKOFF(); + + /* Note that lock_copy was reloaded above. We will + keep trying if a spurious conflict occurred, typically + caused by concurrent executions of + rw_lock_s_lock(). */ } + + /* A real conflict was detected. */ return(false); } It remains to be seen whether having the LF_BACKOFF() could still be beneficial. After all, we could still have many concurrent S-latch requests on the buf_pool.page_hash . A proper fix for the spurious S-latch contention (if it still remains an issue) might be to implement a simpler variant of rw_lock_t that does not support SX-latches (we only need those for dict_index_t::lock , buf_block_t::lock and possibly fil_space_t::latch ) nor any recursive latches, and implements the S-latch acquisition by fetch_add(1) .

            To be on the safe side with regard to performance when the number of concurrently executing threads exceeds the number of available CPU cores, I decided to retain the LF_BACKOFF() call after all, even though krunalbauskar found it to reduce performance on one system. That call will be removed in MDEV-22871 once we have migrated buf_pool.page_hash and possibly other hot rw-locks to use a more efficient S-latch acquisition protocol.

            In buf_page_init_for_read(), we must acquire the buf_pool.page_hash latch early to prevent a race condition with buf_pool_t::watch_remove(). So, basically this fix is only reducing contention in buf_page_create(), which should mostly benefit read-write use cases.

            marko Marko Mäkelä added a comment - To be on the safe side with regard to performance when the number of concurrently executing threads exceeds the number of available CPU cores, I decided to retain the LF_BACKOFF() call after all, even though krunalbauskar found it to reduce performance on one system. That call will be removed in MDEV-22871 once we have migrated buf_pool.page_hash and possibly other hot rw-locks to use a more efficient S-latch acquisition protocol. In buf_page_init_for_read() , we must acquire the buf_pool.page_hash latch early to prevent a race condition with buf_pool_t::watch_remove() . So, basically this fix is only reducing contention in buf_page_create() , which should mostly benefit read-write use cases.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.