[MDEV-22871] Contention on the buf_pool.page_hash Created: 2020-06-11  Updated: 2021-10-14  Resolved: 2020-06-18

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5
Fix Version/s: 10.5.4

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: Scalability, performance

Issue Links:
Blocks
blocks MDEV-19929 Add a startup message about PAUSE ins... Open
blocks MDEV-21452 Use condition variables and normal mu... Closed
is blocked by MDEV-22877 Avoid unnecessary buf_pool.page_hash ... Closed
Problem/Incident
causes MDEV-23070 Duplicated counting of MONITOR_ADAPTI... Closed
causes MDEV-23369 False sharing in page_hash_latch::rea... Closed
causes MDEV-23998 Race condition between buf_page_optim... Closed
causes MDEV-24271 rw_lock::read_lock_yield() may cause ... Closed
causes MDEV-25815 mariabackup crash or debug assert wit... Closed
causes MDEV-26033 Race condition between buf_pool.page_... Closed
causes MDEV-26828 Spinning on buf_pool.page_hash is was... Closed
Relates
relates to MDEV-24090 Optimize buf_page_optimistic_get() Open
relates to MDEV-26609 Avoid deriving ELEMENT_PER_LATCH from... Closed
relates to MDEV-26826 Duplicated computations of buf_pool.p... Closed
relates to MDEV-14659 Innodb scalibility issue found in Mar... Closed
relates to MDEV-15053 Reduce buf_pool_t::mutex contention Closed
relates to MDEV-22850 Reduce buf_pool.page_hash latch conte... Closed

 Description   

MDEV-15053 reduced some buf_pool.mutex contention, shifting some contention to the buf_pool.page_hash latches.

MDEV-22850 removed some genuine contention on the latches by reducing the X-latch hold time.

But, based on earlier perf record analysis we seem to have some spurious contention on concurrent S-latch requests. I believe that this is due to the spin loop rw_lock_lock_word_decr() that rw_lock_s_lock() is executing.

Spurious S-latch conflicts could be addressed by implementing a bare-bones variant of rw-locks using std::atomic::fetch_add() for the S-latch and compare_exchange_strong() for the X-latch acquisition. This variant would not need to support any recursive latches at all.

As far as I understand, only dict_index_t::lock, buf_block_t::lock, fil_space_t::latch, dict_sys.latch, purge_sys.latch may truly require some additional ‘bells and whistles’, such as recursive X-locks or SX-locks, instrumentation, and event-based wakeup of waiting requests.

Most other types of InnoDB rw-locks should be held for extremely short durations only, and we might best implement both S and X latch acquisition with simple spin loops (inlined atomic operation possibly followed by a subroutine call for the spin loop).



 Comments   
Comment by Marko Mäkelä [ 2020-06-11 ]

Rough idea for implementation:

#include <atomic>
#include <cassert>
#define ut_ad assert
#define ut_d(x)
inline int LF_BACKOFF() {return 0;}
 
class rw_lock
{
  std::atomic<uint32_t> lock;
  static constexpr uint32_t WRITER= 1 << 31;
 
  rw_lock() : lock(0) {}
 
  void read_lock_wait_for_write_unlock();
  void write_lock_wait_for_read_unlock();
public:
  void read_lock()
  {
    if (lock.fetch_add(1, std::memory_order_acquire) & WRITER)
      read_lock_wait_for_write_unlock();
  }
  void read_unlock()
  {
    ut_d(auto l=) lock.fetch_sub(1, std::memory_order_release);
    // ut_ad(l …); /* at least one S-lock must have been pending */
  }
  void write_lock()
  {
    uint32_t l= 0;
    if (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed))
      return;
    write_lock_wait_for_read_unlock();
  }
};
 
void rw_lock::read_lock_wait_for_write_unlock()
{
#if 0 // FIXME: starves write_lock_wait_for_read_unlock()
  while (lock.load(std::memory_order_relaxed) & WRITER))
    (void) LF_BACKOFF();
#else
  do
  {
    lock.fetch_sub(1, std::memory_order_relaxed);
    (void) LF_BACKOFF();
  }
  while (lock.fetch_add(1, std::memory_order_acquire) & WRITER);
#endif
}
void rw_lock::write_lock_wait_for_read_unlock()
{
  uint32_t l= 0;
    
  do
  {
    l= 0;
    (void) LF_BACKOFF();
  }
  while (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed));
}

The spin loops would be located in non-inlined functions in order to minimize the code footprint.

Comment by Marko Mäkelä [ 2020-06-12 ]

My prototype passed the stress testing of mleich, but according to krunalbauskar it did not improve performance yet when the number of concurrently executing threads exceeds the number of processor cores. Also, he reported a possible writer_lock() related shutdown hang.

I have some further ideas to address the problems:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index ddeeb7da6d3..3c06279ce0c 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -281,6 +281,8 @@ the read requests for the whole area.
 #ifndef UNIV_INNOCHECKSUM
 void rw_lock::read_lock_wait_for_write_unlock()
 {
+  ut_ad(!mutex_own(&buf_pool.mutex));
+
   auto l= lock.fetch_sub(1, std::memory_order_relaxed);
   DBUG_ASSERT(l & ~(WRITER | WRITER_WAITING));
   static_assert(UNLOCKED == 0, "compatibility");
@@ -289,6 +291,7 @@ void rw_lock::read_lock_wait_for_write_unlock()
   {
     if (!(l & (WRITER | WRITER_WAITING)))
     {
+retry:
       l= lock.fetch_add(1, std::memory_order_acquire);
       if (!(l & (WRITER | WRITER_WAITING)))
         return;
@@ -298,14 +301,26 @@ void rw_lock::read_lock_wait_for_write_unlock()
     }
     else
     {
-      (void) LF_BACKOFF();
-      l= lock.load(std::memory_order_relaxed);
+      /* Wait longer until write_unlock(), and do not starve any
+      write_lock_wait_for_unlock() */
+#if 1
+      os_thread_sleep(100);
+#else
+      /* All write_lock() holders are also holding
+      buf_pool.mutex, so we might use an event-based wait here.
+      But, do we really want to add contention to that busy mutex? */
+      mutex_enter(&buf_pool.mutex);
+      mutex_exit(&buf_pool.mutex);
+#endif
+      goto retry;
     }
   }
 }
 
 void rw_lock::write_lock_wait_for_unlock()
 {
+  ut_ad(mutex_own(&buf_pool.mutex));
+
   auto l= lock.fetch_or(WRITER_WAITING, std::memory_order_relaxed);
   do
   {

Because currently the sole user of rw_lock is the buf_pool.page_hash, such tight coupling is possible.

Extensive benchmarking with various experiments will be needed to move forward.

Comment by Marko Mäkelä [ 2020-06-13 ]

More observations and ideas:

  • We might implement some zero-overhead wrapper around the rw_lock so that each user may specify their own spin loop logic. (But we might also defer this until there are other users than the buf_pool.page_hash.)
  • For the read_lock_wait_for_write_unlock() spin loop, we might want to introduce a special primitive on buf_pool.mutex that would only wait for the mutex to be released, but not actually try acquiring it.
  • Instead of padding each rw_lock to occupy one cache line (to avoid false sharing between independent latches), we might want to store some of the hash table ‘payload’. Non-conflicting read_lock() acquisitions would lead to invalidating each other’s cache lines (including some of those hash table elements), but maybe it is not an issue. write_lock() must lead to invalidating those entries anyway.
  • Perhaps we would want to make the rw_lock an integral part of the hash table? That is, we could divide the hash table into partitions, each one sized and aligned to the cache line width. The first element would be the rw_lock object, and the remaining elements would be hash table payload. The srv_n_page_hash_locks would be removed.
Comment by Marko Mäkelä [ 2020-06-14 ]

It turns out that buf_pool_t::watch_set() is not holding buf_pool.mutex while acquiring an exclusive latch on buf_pool.page_hash. Therefore, we cannot use buf_pool.mutex to make read_lock() operations wait for write_unlock().

I pushed a revised version that implements most of my ideas. The spin loops can be tuned with innodb_sync_spin_loops and innodb_spin_wait_delay.

This revision seems to reduce read/write performance further. Deeper analysis will be needed to identify the remaining bottlenecks. If the problem is memory bus contention, I hope that eliminating page_hash_latch::pad and storing page_hash_latch interleaved with the hash table elements would address that.

Comment by Marko Mäkelä [ 2020-06-16 ]

To prepare for the elimination of page_hash_latch::pad, I simplified the InnoDB hash_table_t. It turns out that there were many unused data fields and unnecessarily allocated memory, such as page_hash_t::heaps. The reduction of memory footprint did not lead to notable performance improvements in my test.

The final step would be to move page_hash_latch to buf_pool.page_hash.array.

Comment by Marko Mäkelä [ 2020-06-17 ]

It seems that dedicating every 1024th element in buf_pool.page_hash.array to the page_hash_latch (which replaces the fat rw_lock_t) may have addressed the problem. Some benchmarks are still in progress.
I also removed unnecessary pointer indirection for InnoDB hash_table_t objects and cleaned up the partitioning of the adaptive hash index.

Comment by Matthias Leich [ 2020-06-17 ]

The tree
origin/bb-10.5-release 377aabb94d9a426ecd5fc31f07086c2d582b2b2d 2020-06-17T18:13:39+03:00
behaved well when running the RQG test battery for broad range functional coverage.
No new asserts/crashes. Problems which showed up are known and not related to MDEV-22781.

Comment by Marko Mäkelä [ 2020-06-18 ]

It seems that the crucial step was to store a std::atomic<uint32_t> based rw-lock directly in buf_pool.page_hash.array, interleaved with the hash table payload.
I restructured my work into 4 commits that I tested independently, because the changes are rather extensive:
5155a300fab MDEV-22871: Reduce InnoDB buf_pool.page_hash contention
cfd3d70ccbb MDEV-22871: Remove pointer indirection for InnoDB hash_table_t
bf3c862faa8 MDEV-22871: Clean up btr_search_sys
9159b8976f7 MDEV-22871: Clean up hash_table_t

I examined some perf record traces, and the buf_pool.page_hash is not an obvious bottleneck at all. In a read-only test that I conducted myself, I did not see any waiting happening at all, when running 50 threads against the server that was running on a 10-core processor with 2×hyperthreading. With smaller buffer pool and lots of page eviction going on, contention cannot be avoided, but even then it seems to be somewhere else (such as buf_pool.mutex).

Thanks to krunalbauskar and axel for the performance testing, and mleich for the stress testing.

Generated at Thu Feb 08 09:18:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.