Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22871

Contention on the buf_pool.page_hash

Details

    Description

      MDEV-15053 reduced some buf_pool.mutex contention, shifting some contention to the buf_pool.page_hash latches.

      MDEV-22850 removed some genuine contention on the latches by reducing the X-latch hold time.

      But, based on earlier perf record analysis we seem to have some spurious contention on concurrent S-latch requests. I believe that this is due to the spin loop rw_lock_lock_word_decr() that rw_lock_s_lock() is executing.

      Spurious S-latch conflicts could be addressed by implementing a bare-bones variant of rw-locks using std::atomic::fetch_add() for the S-latch and compare_exchange_strong() for the X-latch acquisition. This variant would not need to support any recursive latches at all.

      As far as I understand, only dict_index_t::lock, buf_block_t::lock, fil_space_t::latch, dict_sys.latch, purge_sys.latch may truly require some additional ‘bells and whistles’, such as recursive X-locks or SX-locks, instrumentation, and event-based wakeup of waiting requests.

      Most other types of InnoDB rw-locks should be held for extremely short durations only, and we might best implement both S and X latch acquisition with simple spin loops (inlined atomic operation possibly followed by a subroutine call for the spin loop).

      Attachments

        Issue Links

          Activity

            Rough idea for implementation:

            #include <atomic>
            #include <cassert>
            #define ut_ad assert
            #define ut_d(x)
            inline int LF_BACKOFF() {return 0;}
             
            class rw_lock
            {
              std::atomic<uint32_t> lock;
              static constexpr uint32_t WRITER= 1 << 31;
             
              rw_lock() : lock(0) {}
             
              void read_lock_wait_for_write_unlock();
              void write_lock_wait_for_read_unlock();
            public:
              void read_lock()
              {
                if (lock.fetch_add(1, std::memory_order_acquire) & WRITER)
                  read_lock_wait_for_write_unlock();
              }
              void read_unlock()
              {
                ut_d(auto l=) lock.fetch_sub(1, std::memory_order_release);
                // ut_ad(l …); /* at least one S-lock must have been pending */
              }
              void write_lock()
              {
                uint32_t l= 0;
                if (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed))
                  return;
                write_lock_wait_for_read_unlock();
              }
            };
             
            void rw_lock::read_lock_wait_for_write_unlock()
            {
            #if 0 // FIXME: starves write_lock_wait_for_read_unlock()
              while (lock.load(std::memory_order_relaxed) & WRITER))
                (void) LF_BACKOFF();
            #else
              do
              {
                lock.fetch_sub(1, std::memory_order_relaxed);
                (void) LF_BACKOFF();
              }
              while (lock.fetch_add(1, std::memory_order_acquire) & WRITER);
            #endif
            }
            void rw_lock::write_lock_wait_for_read_unlock()
            {
              uint32_t l= 0;
                
              do
              {
                l= 0;
                (void) LF_BACKOFF();
              }
              while (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed));
            }
            

            The spin loops would be located in non-inlined functions in order to minimize the code footprint.

            marko Marko Mäkelä added a comment - Rough idea for implementation: #include <atomic> #include <cassert> #define ut_ad assert #define ut_d(x) inline int LF_BACKOFF() { return 0;}   class rw_lock { std::atomic<uint32_t> lock; static constexpr uint32_t WRITER= 1 << 31;   rw_lock() : lock(0) {}   void read_lock_wait_for_write_unlock(); void write_lock_wait_for_read_unlock(); public : void read_lock() { if (lock.fetch_add(1, std::memory_order_acquire) & WRITER) read_lock_wait_for_write_unlock(); } void read_unlock() { ut_d(auto l=) lock.fetch_sub(1, std::memory_order_release); // ut_ad(l …); /* at least one S-lock must have been pending */ } void write_lock() { uint32_t l= 0; if (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed)) return ; write_lock_wait_for_read_unlock(); } };   void rw_lock::read_lock_wait_for_write_unlock() { #if 0 // FIXME: starves write_lock_wait_for_read_unlock() while (lock.load(std::memory_order_relaxed) & WRITER)) ( void ) LF_BACKOFF(); #else do { lock.fetch_sub(1, std::memory_order_relaxed); ( void ) LF_BACKOFF(); } while (lock.fetch_add(1, std::memory_order_acquire) & WRITER); #endif } void rw_lock::write_lock_wait_for_read_unlock() { uint32_t l= 0; do { l= 0; ( void ) LF_BACKOFF(); } while (lock.compare_exchange_strong(l, WRITER, std::memory_order_acquire, std::memory_order_relaxed)); } The spin loops would be located in non-inlined functions in order to minimize the code footprint.

            My prototype passed the stress testing of mleich, but according to krunalbauskar it did not improve performance yet when the number of concurrently executing threads exceeds the number of processor cores. Also, he reported a possible writer_lock() related shutdown hang.

            I have some further ideas to address the problems:

            diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
            index ddeeb7da6d3..3c06279ce0c 100644
            --- a/storage/innobase/buf/buf0buf.cc
            +++ b/storage/innobase/buf/buf0buf.cc
            @@ -281,6 +281,8 @@ the read requests for the whole area.
             #ifndef UNIV_INNOCHECKSUM
             void rw_lock::read_lock_wait_for_write_unlock()
             {
            +  ut_ad(!mutex_own(&buf_pool.mutex));
            +
               auto l= lock.fetch_sub(1, std::memory_order_relaxed);
               DBUG_ASSERT(l & ~(WRITER | WRITER_WAITING));
               static_assert(UNLOCKED == 0, "compatibility");
            @@ -289,6 +291,7 @@ void rw_lock::read_lock_wait_for_write_unlock()
               {
                 if (!(l & (WRITER | WRITER_WAITING)))
                 {
            +retry:
                   l= lock.fetch_add(1, std::memory_order_acquire);
                   if (!(l & (WRITER | WRITER_WAITING)))
                     return;
            @@ -298,14 +301,26 @@ void rw_lock::read_lock_wait_for_write_unlock()
                 }
                 else
                 {
            -      (void) LF_BACKOFF();
            -      l= lock.load(std::memory_order_relaxed);
            +      /* Wait longer until write_unlock(), and do not starve any
            +      write_lock_wait_for_unlock() */
            +#if 1
            +      os_thread_sleep(100);
            +#else
            +      /* All write_lock() holders are also holding
            +      buf_pool.mutex, so we might use an event-based wait here.
            +      But, do we really want to add contention to that busy mutex? */
            +      mutex_enter(&buf_pool.mutex);
            +      mutex_exit(&buf_pool.mutex);
            +#endif
            +      goto retry;
                 }
               }
             }
             
             void rw_lock::write_lock_wait_for_unlock()
             {
            +  ut_ad(mutex_own(&buf_pool.mutex));
            +
               auto l= lock.fetch_or(WRITER_WAITING, std::memory_order_relaxed);
               do
               {
            

            Because currently the sole user of rw_lock is the buf_pool.page_hash, such tight coupling is possible.

            Extensive benchmarking with various experiments will be needed to move forward.

            marko Marko Mäkelä added a comment - My prototype passed the stress testing of mleich , but according to krunalbauskar it did not improve performance yet when the number of concurrently executing threads exceeds the number of processor cores. Also, he reported a possible writer_lock() related shutdown hang. I have some further ideas to address the problems: diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index ddeeb7da6d3..3c06279ce0c 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -281,6 +281,8 @@ the read requests for the whole area. #ifndef UNIV_INNOCHECKSUM void rw_lock::read_lock_wait_for_write_unlock() { + ut_ad(!mutex_own(&buf_pool.mutex)); + auto l= lock.fetch_sub(1, std::memory_order_relaxed); DBUG_ASSERT(l & ~(WRITER | WRITER_WAITING)); static_assert(UNLOCKED == 0, "compatibility"); @@ -289,6 +291,7 @@ void rw_lock::read_lock_wait_for_write_unlock() { if (!(l & (WRITER | WRITER_WAITING))) { +retry: l= lock.fetch_add(1, std::memory_order_acquire); if (!(l & (WRITER | WRITER_WAITING))) return; @@ -298,14 +301,26 @@ void rw_lock::read_lock_wait_for_write_unlock() } else { - (void) LF_BACKOFF(); - l= lock.load(std::memory_order_relaxed); + /* Wait longer until write_unlock(), and do not starve any + write_lock_wait_for_unlock() */ +#if 1 + os_thread_sleep(100); +#else + /* All write_lock() holders are also holding + buf_pool.mutex, so we might use an event-based wait here. + But, do we really want to add contention to that busy mutex? */ + mutex_enter(&buf_pool.mutex); + mutex_exit(&buf_pool.mutex); +#endif + goto retry; } } } void rw_lock::write_lock_wait_for_unlock() { + ut_ad(mutex_own(&buf_pool.mutex)); + auto l= lock.fetch_or(WRITER_WAITING, std::memory_order_relaxed); do { Because currently the sole user of rw_lock is the buf_pool.page_hash , such tight coupling is possible. Extensive benchmarking with various experiments will be needed to move forward.

            More observations and ideas:

            • We might implement some zero-overhead wrapper around the rw_lock so that each user may specify their own spin loop logic. (But we might also defer this until there are other users than the buf_pool.page_hash.)
            • For the read_lock_wait_for_write_unlock() spin loop, we might want to introduce a special primitive on buf_pool.mutex that would only wait for the mutex to be released, but not actually try acquiring it.
            • Instead of padding each rw_lock to occupy one cache line (to avoid false sharing between independent latches), we might want to store some of the hash table ‘payload’. Non-conflicting read_lock() acquisitions would lead to invalidating each other’s cache lines (including some of those hash table elements), but maybe it is not an issue. write_lock() must lead to invalidating those entries anyway.
            • Perhaps we would want to make the rw_lock an integral part of the hash table? That is, we could divide the hash table into partitions, each one sized and aligned to the cache line width. The first element would be the rw_lock object, and the remaining elements would be hash table payload. The srv_n_page_hash_locks would be removed.
            marko Marko Mäkelä added a comment - More observations and ideas: We might implement some zero-overhead wrapper around the rw_lock so that each user may specify their own spin loop logic. (But we might also defer this until there are other users than the buf_pool.page_hash .) For the read_lock_wait_for_write_unlock() spin loop, we might want to introduce a special primitive on buf_pool.mutex that would only wait for the mutex to be released, but not actually try acquiring it. Instead of padding each rw_lock to occupy one cache line (to avoid false sharing between independent latches), we might want to store some of the hash table ‘payload’. Non-conflicting read_lock() acquisitions would lead to invalidating each other’s cache lines (including some of those hash table elements), but maybe it is not an issue. write_lock() must lead to invalidating those entries anyway. Perhaps we would want to make the rw_lock an integral part of the hash table? That is, we could divide the hash table into partitions, each one sized and aligned to the cache line width. The first element would be the rw_lock object, and the remaining elements would be hash table payload. The srv_n_page_hash_locks would be removed.
            marko Marko Mäkelä added a comment - - edited

            It turns out that buf_pool_t::watch_set() is not holding buf_pool.mutex while acquiring an exclusive latch on buf_pool.page_hash. Therefore, we cannot use buf_pool.mutex to make read_lock() operations wait for write_unlock().

            I pushed a revised version that implements most of my ideas. The spin loops can be tuned with innodb_sync_spin_loops and innodb_spin_wait_delay.

            This revision seems to reduce read/write performance further. Deeper analysis will be needed to identify the remaining bottlenecks. If the problem is memory bus contention, I hope that eliminating page_hash_latch::pad and storing page_hash_latch interleaved with the hash table elements would address that.

            marko Marko Mäkelä added a comment - - edited It turns out that buf_pool_t::watch_set() is not holding buf_pool.mutex while acquiring an exclusive latch on buf_pool.page_hash . Therefore, we cannot use buf_pool.mutex to make read_lock() operations wait for write_unlock() . I pushed a revised version that implements most of my ideas. The spin loops can be tuned with innodb_sync_spin_loops and innodb_spin_wait_delay . This revision seems to reduce read/write performance further. Deeper analysis will be needed to identify the remaining bottlenecks. If the problem is memory bus contention, I hope that eliminating page_hash_latch::pad and storing page_hash_latch interleaved with the hash table elements would address that.

            To prepare for the elimination of page_hash_latch::pad, I simplified the InnoDB hash_table_t. It turns out that there were many unused data fields and unnecessarily allocated memory, such as page_hash_t::heaps. The reduction of memory footprint did not lead to notable performance improvements in my test.

            The final step would be to move page_hash_latch to buf_pool.page_hash.array.

            marko Marko Mäkelä added a comment - To prepare for the elimination of page_hash_latch::pad , I simplified the InnoDB hash_table_t . It turns out that there were many unused data fields and unnecessarily allocated memory, such as page_hash_t::heaps . The reduction of memory footprint did not lead to notable performance improvements in my test. The final step would be to move page_hash_latch to buf_pool.page_hash.array .

            It seems that dedicating every 1024th element in buf_pool.page_hash.array to the page_hash_latch (which replaces the fat rw_lock_t) may have addressed the problem. Some benchmarks are still in progress.
            I also removed unnecessary pointer indirection for InnoDB hash_table_t objects and cleaned up the partitioning of the adaptive hash index.

            marko Marko Mäkelä added a comment - It seems that dedicating every 1024th element in buf_pool.page_hash.array to the page_hash_latch (which replaces the fat rw_lock_t ) may have addressed the problem. Some benchmarks are still in progress. I also removed unnecessary pointer indirection for InnoDB hash_table_t objects and cleaned up the partitioning of the adaptive hash index.

            The tree
            origin/bb-10.5-release 377aabb94d9a426ecd5fc31f07086c2d582b2b2d 2020-06-17T18:13:39+03:00
            behaved well when running the RQG test battery for broad range functional coverage.
            No new asserts/crashes. Problems which showed up are known and not related to MDEV-22781.

            mleich Matthias Leich added a comment - The tree origin/bb-10.5-release 377aabb94d9a426ecd5fc31f07086c2d582b2b2d 2020-06-17T18:13:39+03:00 behaved well when running the RQG test battery for broad range functional coverage. No new asserts/crashes. Problems which showed up are known and not related to MDEV-22781 .

            It seems that the crucial step was to store a std::atomic<uint32_t> based rw-lock directly in buf_pool.page_hash.array, interleaved with the hash table payload.
            I restructured my work into 4 commits that I tested independently, because the changes are rather extensive:
            5155a300fab MDEV-22871: Reduce InnoDB buf_pool.page_hash contention
            cfd3d70ccbb MDEV-22871: Remove pointer indirection for InnoDB hash_table_t
            bf3c862faa8 MDEV-22871: Clean up btr_search_sys
            9159b8976f7 MDEV-22871: Clean up hash_table_t

            I examined some perf record traces, and the buf_pool.page_hash is not an obvious bottleneck at all. In a read-only test that I conducted myself, I did not see any waiting happening at all, when running 50 threads against the server that was running on a 10-core processor with 2×hyperthreading. With smaller buffer pool and lots of page eviction going on, contention cannot be avoided, but even then it seems to be somewhere else (such as buf_pool.mutex).

            Thanks to krunalbauskar and axel for the performance testing, and mleich for the stress testing.

            marko Marko Mäkelä added a comment - It seems that the crucial step was to store a std::atomic<uint32_t> based rw-lock directly in buf_pool.page_hash.array , interleaved with the hash table payload. I restructured my work into 4 commits that I tested independently, because the changes are rather extensive: 5155a300fab MDEV-22871: Reduce InnoDB buf_pool.page_hash contention cfd3d70ccbb MDEV-22871: Remove pointer indirection for InnoDB hash_table_t bf3c862faa8 MDEV-22871: Clean up btr_search_sys 9159b8976f7 MDEV-22871: Clean up hash_table_t I examined some perf record traces, and the buf_pool.page_hash is not an obvious bottleneck at all. In a read-only test that I conducted myself, I did not see any waiting happening at all, when running 50 threads against the server that was running on a 10-core processor with 2×hyperthreading. With smaller buffer pool and lots of page eviction going on, contention cannot be avoided, but even then it seems to be somewhere else (such as buf_pool.mutex ). Thanks to krunalbauskar and axel for the performance testing, and mleich for the stress testing.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.