[MDEV-22871] Contention on the buf_pool.page_hash - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.5
Fix Version/s: 10.5.4
Component/s: Storage Engine - InnoDB
Labels:
- Scalability
- performance

Description

~~MDEV-15053~~ reduced some buf_pool.mutex contention, shifting some contention to the buf_pool.page_hash latches.

~~MDEV-22850~~ removed some genuine contention on the latches by reducing the X-latch hold time.

But, based on earlier perf record analysis we seem to have some spurious contention on concurrent S-latch requests. I believe that this is due to the spin loop rw_lock_lock_word_decr() that rw_lock_s_lock() is executing.

Spurious S-latch conflicts could be addressed by implementing a bare-bones variant of rw-locks using std::atomic::fetch_add() for the S-latch and compare_exchange_strong() for the X-latch acquisition. This variant would not need to support any recursive latches at all.

As far as I understand, only dict_index_t::lock, buf_block_t::lock, fil_space_t::latch, dict_sys.latch, purge_sys.latch may truly require some additional ‘bells and whistles’, such as recursive X-locks or SX-locks, instrumentation, and event-based wakeup of waiting requests.

Most other types of InnoDB rw-locks should be held for extremely short durations only, and we might best implement both S and X latch acquisition with simple spin loops (inlined atomic operation possibly followed by a subroutine call for the spin loop).

Attachments

Issue Links

blocks

MDEV-19929 Add a startup message about PAUSE instruction timing

In Review

MDEV-21452 Use condition variables and normal mutexes instead of InnoDB os_event and mutex

Closed

causes

MDEV-23070 Duplicated counting of MONITOR_ADAPTIVE_HASH_ROW_ADDED

Closed

MDEV-23369 False sharing in page_hash_latch::read_lock_wait()

Closed

MDEV-23998 Race condition between buf_page_optimistic_get() and buf_page_t::init()

Closed

MDEV-24271 rw_lock::read_lock_yield() may cause writer starvation

Closed

MDEV-25815 mariabackup crash or debug assert with --backup --databases-exclude

Closed

MDEV-26033 Race condition between buf_pool.page_hash and buffer pool resizing

Closed

MDEV-26828 Spinning on buf_pool.page_hash is wasting CPU cycles

Closed

is blocked by

MDEV-22877 Avoid unnecessary buf_pool.page_hash S-latch acquisition

Closed

relates to

MDEV-24090 Optimize buf_page_optimistic_get()

Open

MDEV-26609 Avoid deriving ELEMENT_PER_LATCH from cacheline

Closed

MDEV-26826 Duplicated computations of buf_pool.page_hash addresses

Closed

MDEV-14659 Innodb scalibility issue found in Mariadb code for complex 'select' queries in Arm platform

Closed

MDEV-15053 Reduce buf_pool_t::mutex contention

Closed

MDEV-22850 Reduce buf_pool.page_hash latch contention

Closed

(4 causes, 1 is blocked by, 6 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2020-06-11 15:23

Rough idea for implementation:

#include <atomic>

#include <cassert>

#define ut_ad assert

#define ut_d(x)

inline int LF_BACKOFF() {return 0;}

class rw_lock

  std::atomic<uint32_t> lock;

  static constexpr uint32_t WRITER= 1 << 31;

  rw_lock() : lock(0) {}

  void read_lock_wait_for_write_unlock();

  void write_lock_wait_for_read_unlock();

public:

  void read_lock()

    if (lock.fetch_add(1, std::memory_order_acquire) & WRITER)

      read_lock_wait_for_write_unlock();

  void read_unlock()

    ut_d(auto l=) lock.fetch_sub(1, std::memory_order_release);

    // ut_ad(l …); /* at least one S-lock must have been pending */

  void write_lock()

    uint32_t l= 0;

    if (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed))

      return;

    write_lock_wait_for_read_unlock();

};

void rw_lock::read_lock_wait_for_write_unlock()

#if 0 // FIXME: starves write_lock_wait_for_read_unlock()

  while (lock.load(std::memory_order_relaxed) & WRITER))

    (void) LF_BACKOFF();

#else

do

    lock.fetch_sub(1, std::memory_order_relaxed);

    (void) LF_BACKOFF();

  while (lock.fetch_add(1, std::memory_order_acquire) & WRITER);

#endif

void rw_lock::write_lock_wait_for_read_unlock()

  uint32_t l= 0;

do

    l= 0;

    (void) LF_BACKOFF();

  while (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed));

The spin loops would be located in non-inlined functions in order to minimize the code footprint.

Marko Mäkelä added a comment - 2020-06-11 15:23 Rough idea for implementation: #include <atomic> #include <cassert> #define ut_ad assert #define ut_d(x) inline int LF_BACKOFF() { return 0;} class rw_lock { std::atomic<uint32_t> lock; static constexpr uint32_t WRITER= 1 << 31; rw_lock() : lock(0) {} void read_lock_wait_for_write_unlock(); void write_lock_wait_for_read_unlock(); public : void read_lock() { if (lock.fetch_add(1, std::memory_order_acquire) & WRITER) read_lock_wait_for_write_unlock(); } void read_unlock() { ut_d(auto l=) lock.fetch_sub(1, std::memory_order_release); // ut_ad(l …); /* at least one S-lock must have been pending */ } void write_lock() { uint32_t l= 0; if (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed)) return ; write_lock_wait_for_read_unlock(); } }; void rw_lock::read_lock_wait_for_write_unlock() { #if 0 // FIXME: starves write_lock_wait_for_read_unlock() while (lock.load(std::memory_order_relaxed) & WRITER)) ( void ) LF_BACKOFF(); #else do { lock.fetch_sub(1, std::memory_order_relaxed); ( void ) LF_BACKOFF(); } while (lock.fetch_add(1, std::memory_order_acquire) & WRITER); #endif } void rw_lock::write_lock_wait_for_read_unlock() { uint32_t l= 0; do { l= 0; ( void ) LF_BACKOFF(); } while (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed)); } The spin loops would be located in non-inlined functions in order to minimize the code footprint.

Marko Mäkelä added a comment - 2020-06-12 15:52

My prototype passed the stress testing of mleich, but according to krunalbauskar it did not improve performance yet when the number of concurrently executing threads exceeds the number of processor cores. Also, he reported a possible writer_lock() related shutdown hang.

I have some further ideas to address the problems:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc

index ddeeb7da6d3..3c06279ce0c 100644

--- a/storage/innobase/buf/buf0buf.cc

+++ b/storage/innobase/buf/buf0buf.cc

@@ -281,6 +281,8 @@ the read requests for the whole area.

 #ifndef UNIV_INNOCHECKSUM

 void rw_lock::read_lock_wait_for_write_unlock()

+  ut_ad(!mutex_own(&buf_pool.mutex));

   auto l= lock.fetch_sub(1, std::memory_order_relaxed);

   DBUG_ASSERT(l & ~(WRITER | WRITER_WAITING));

   static_assert(UNLOCKED == 0, "compatibility");

@@ -289,6 +291,7 @@ void rw_lock::read_lock_wait_for_write_unlock()

     if (!(l & (WRITER | WRITER_WAITING)))

+retry:

       l= lock.fetch_add(1, std::memory_order_acquire);

       if (!(l & (WRITER | WRITER_WAITING)))

         return;

@@ -298,14 +301,26 @@ void rw_lock::read_lock_wait_for_write_unlock()

     else

-      (void) LF_BACKOFF();

-      l= lock.load(std::memory_order_relaxed);

+      /* Wait longer until write_unlock(), and do not starve any

+      write_lock_wait_for_unlock() */

+#if 1

+      os_thread_sleep(100);

+#else

+      /* All write_lock() holders are also holding

+      buf_pool.mutex, so we might use an event-based wait here.

+      But, do we really want to add contention to that busy mutex? */

+      mutex_enter(&buf_pool.mutex);

+      mutex_exit(&buf_pool.mutex);

+#endif

+      goto retry;

 void rw_lock::write_lock_wait_for_unlock()

+  ut_ad(mutex_own(&buf_pool.mutex));

   auto l= lock.fetch_or(WRITER_WAITING, std::memory_order_relaxed);

do

Because currently the sole user of rw_lock is the buf_pool.page_hash, such tight coupling is possible.

Extensive benchmarking with various experiments will be needed to move forward.

Marko Mäkelä added a comment - 2020-06-12 15:52 My prototype passed the stress testing of mleich , but according to krunalbauskar it did not improve performance yet when the number of concurrently executing threads exceeds the number of processor cores. Also, he reported a possible writer_lock() related shutdown hang. I have some further ideas to address the problems: diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index ddeeb7da6d3..3c06279ce0c 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -281,6 +281,8 @@ the read requests for the whole area. #ifndef UNIV_INNOCHECKSUM void rw_lock::read_lock_wait_for_write_unlock() { + ut_ad(!mutex_own(&buf_pool.mutex)); + auto l= lock.fetch_sub(1, std::memory_order_relaxed); DBUG_ASSERT(l & ~(WRITER | WRITER_WAITING)); static_assert(UNLOCKED == 0, "compatibility"); @@ -289,6 +291,7 @@ void rw_lock::read_lock_wait_for_write_unlock() { if (!(l & (WRITER | WRITER_WAITING))) { +retry: l= lock.fetch_add(1, std::memory_order_acquire); if (!(l & (WRITER | WRITER_WAITING))) return; @@ -298,14 +301,26 @@ void rw_lock::read_lock_wait_for_write_unlock() } else { - (void) LF_BACKOFF(); - l= lock.load(std::memory_order_relaxed); + /* Wait longer until write_unlock(), and do not starve any + write_lock_wait_for_unlock() */ +#if 1 + os_thread_sleep(100); +#else + /* All write_lock() holders are also holding + buf_pool.mutex, so we might use an event-based wait here. + But, do we really want to add contention to that busy mutex? */ + mutex_enter(&buf_pool.mutex); + mutex_exit(&buf_pool.mutex); +#endif + goto retry; } } } void rw_lock::write_lock_wait_for_unlock() { + ut_ad(mutex_own(&buf_pool.mutex)); + auto l= lock.fetch_or(WRITER_WAITING, std::memory_order_relaxed); do { Because currently the sole user of rw_lock is the buf_pool.page_hash , such tight coupling is possible. Extensive benchmarking with various experiments will be needed to move forward.

Marko Mäkelä added a comment - 2020-06-13 16:44

More observations and ideas:

We might implement some zero-overhead wrapper around the rw_lock so that each user may specify their own spin loop logic. (But we might also defer this until there are other users than the buf_pool.page_hash.)
For the read_lock_wait_for_write_unlock() spin loop, we might want to introduce a special primitive on buf_pool.mutex that would only wait for the mutex to be released, but not actually try acquiring it.
Instead of padding each rw_lock to occupy one cache line (to avoid false sharing between independent latches), we might want to store some of the hash table ‘payload’. Non-conflicting read_lock() acquisitions would lead to invalidating each other’s cache lines (including some of those hash table elements), but maybe it is not an issue. write_lock() must lead to invalidating those entries anyway.
Perhaps we would want to make the rw_lock an integral part of the hash table? That is, we could divide the hash table into partitions, each one sized and aligned to the cache line width. The first element would be the rw_lock object, and the remaining elements would be hash table payload. The srv_n_page_hash_locks would be removed.

Marko Mäkelä added a comment - 2020-06-13 16:44 More observations and ideas: We might implement some zero-overhead wrapper around the rw_lock so that each user may specify their own spin loop logic. (But we might also defer this until there are other users than the buf_pool.page_hash .) For the read_lock_wait_for_write_unlock() spin loop, we might want to introduce a special primitive on buf_pool.mutex that would only wait for the mutex to be released, but not actually try acquiring it. Instead of padding each rw_lock to occupy one cache line (to avoid false sharing between independent latches), we might want to store some of the hash table ‘payload’. Non-conflicting read_lock() acquisitions would lead to invalidating each other’s cache lines (including some of those hash table elements), but maybe it is not an issue. write_lock() must lead to invalidating those entries anyway. Perhaps we would want to make the rw_lock an integral part of the hash table? That is, we could divide the hash table into partitions, each one sized and aligned to the cache line width. The first element would be the rw_lock object, and the remaining elements would be hash table payload. The srv_n_page_hash_locks would be removed.

Marko Mäkelä added a comment - 2020-06-14 18:40 - edited

It turns out that buf_pool_t::watch_set() is not holding buf_pool.mutex while acquiring an exclusive latch on buf_pool.page_hash. Therefore, we cannot use buf_pool.mutex to make read_lock() operations wait for write_unlock().

I pushed a revised version that implements most of my ideas. The spin loops can be tuned with innodb_sync_spin_loops and innodb_spin_wait_delay.

This revision seems to reduce read/write performance further. Deeper analysis will be needed to identify the remaining bottlenecks. If the problem is memory bus contention, I hope that eliminating page_hash_latch::pad and storing page_hash_latch interleaved with the hash table elements would address that.

Marko Mäkelä added a comment - 2020-06-14 18:40 - edited It turns out that buf_pool_t::watch_set() is not holding buf_pool.mutex while acquiring an exclusive latch on buf_pool.page_hash . Therefore, we cannot use buf_pool.mutex to make read_lock() operations wait for write_unlock() . I pushed a revised version that implements most of my ideas. The spin loops can be tuned with innodb_sync_spin_loops and innodb_spin_wait_delay . This revision seems to reduce read/write performance further. Deeper analysis will be needed to identify the remaining bottlenecks. If the problem is memory bus contention, I hope that eliminating page_hash_latch::pad and storing page_hash_latch interleaved with the hash table elements would address that.

Marko Mäkelä added a comment - 2020-06-16 17:05

To prepare for the elimination of page_hash_latch::pad, I simplified the InnoDB hash_table_t. It turns out that there were many unused data fields and unnecessarily allocated memory, such as page_hash_t::heaps. The reduction of memory footprint did not lead to notable performance improvements in my test.

The final step would be to move page_hash_latch to buf_pool.page_hash.array.

Marko Mäkelä added a comment - 2020-06-16 17:05 To prepare for the elimination of page_hash_latch::pad , I simplified the InnoDB hash_table_t . It turns out that there were many unused data fields and unnecessarily allocated memory, such as page_hash_t::heaps . The reduction of memory footprint did not lead to notable performance improvements in my test. The final step would be to move page_hash_latch to buf_pool.page_hash.array .

Marko Mäkelä added a comment - 2020-06-17 17:41

It seems that dedicating every 1024th element in buf_pool.page_hash.array to the page_hash_latch (which replaces the fat rw_lock_t) may have addressed the problem. Some benchmarks are still in progress.
I also removed unnecessary pointer indirection for InnoDB hash_table_t objects and cleaned up the partitioning of the adaptive hash index.

Marko Mäkelä added a comment - 2020-06-17 17:41 It seems that dedicating every 1024th element in buf_pool.page_hash.array to the page_hash_latch (which replaces the fat rw_lock_t ) may have addressed the problem. Some benchmarks are still in progress. I also removed unnecessary pointer indirection for InnoDB hash_table_t objects and cleaned up the partitioning of the adaptive hash index.

Matthias Leich added a comment - 2020-06-17 17:53

The tree
origin/bb-10.5-release 377aabb94d9a426ecd5fc31f07086c2d582b2b2d 2020-06-17T18:13:39+03:00
behaved well when running the RQG test battery for broad range functional coverage.
No new asserts/crashes. Problems which showed up are known and not related to ~~MDEV-22781~~.

Matthias Leich added a comment - 2020-06-17 17:53 The tree origin/bb-10.5-release 377aabb94d9a426ecd5fc31f07086c2d582b2b2d 2020-06-17T18:13:39+03:00 behaved well when running the RQG test battery for broad range functional coverage. No new asserts/crashes. Problems which showed up are known and not related to MDEV-22781 .

Marko Mäkelä added a comment - 2020-06-18 11:32

It seems that the crucial step was to store a std::atomic<uint32_t> based rw-lock directly in buf_pool.page_hash.array, interleaved with the hash table payload.
I restructured my work into 4 commits that I tested independently, because the changes are rather extensive:
5155a300fab MDEV-22871: Reduce InnoDB buf_pool.page_hash contention
cfd3d70ccbb MDEV-22871: Remove pointer indirection for InnoDB hash_table_t
bf3c862faa8 MDEV-22871: Clean up btr_search_sys
9159b8976f7 MDEV-22871: Clean up hash_table_t

I examined some perf record traces, and the buf_pool.page_hash is not an obvious bottleneck at all. In a read-only test that I conducted myself, I did not see any waiting happening at all, when running 50 threads against the server that was running on a 10-core processor with 2×hyperthreading. With smaller buffer pool and lots of page eviction going on, contention cannot be avoided, but even then it seems to be somewhere else (such as buf_pool.mutex).

Thanks to krunalbauskar and axel for the performance testing, and mleich for the stress testing.

Marko Mäkelä added a comment - 2020-06-18 11:32 It seems that the crucial step was to store a std::atomic<uint32_t> based rw-lock directly in buf_pool.page_hash.array , interleaved with the hash table payload. I restructured my work into 4 commits that I tested independently, because the changes are rather extensive: 5155a300fab MDEV-22871: Reduce InnoDB buf_pool.page_hash contention cfd3d70ccbb MDEV-22871: Remove pointer indirection for InnoDB hash_table_t bf3c862faa8 MDEV-22871: Clean up btr_search_sys 9159b8976f7 MDEV-22871: Clean up hash_table_t I examined some perf record traces, and the buf_pool.page_hash is not an obvious bottleneck at all. In a read-only test that I conducted myself, I did not see any waiting happening at all, when running 50 threads against the server that was running on a 10-core processor with 2×hyperthreading. With smaller buffer pool and lots of page eviction going on, contention cannot be avoided, but even then it seems to be somewhere else (such as buf_pool.mutex ). Thanks to krunalbauskar and axel for the performance testing, and mleich for the stress testing.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2020-06-11 11:05

Updated:: 2021-10-14 12:54

Resolved:: 2020-06-18 11:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server