Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23369

False sharing in page_hash_latch::read_lock_wait()

    XMLWordPrintable

Details

    Description

      MDEV-22871 refactored the InnoDB buf_pool.page_hash to use a simple rw-lock implementation that avoids a spinloop between non-contended read-lock requests, simply using std::atomic::fetch_add() for the lock acquisition.

      Alas, wlad noticed that a write-heavy stress test on a 56-core system with 1000 concurrent client connections would every few seconds indicate that the server appears to halt, delivering 0 transactions per second. It is not a permanent hang; the performance will resume after some time.

      I attached GDB to the server during one such apparent halt, and I saw 22 of the 1,033 threads trying to access the same object:

      10.5 8ddebb33c28b0aeaa6550ac0e825beccd367bb2c

      #1  0x00005628f70317d5 in page_hash_latch::read_lock_wait (
          this=this@entry=0x7f2ae590d040)
          at /home/marko/server/storage/innobase/buf/buf0buf.cc:298
      

      In each of the calls, the page_id is distinct, and each invocation is for an undo log page:

      10.5 8ddebb33c28b0aeaa6550ac0e825beccd367bb2c

      Thread 5 (Thread 0x7f23a9d8f700 (LWP 296467)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193651}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 25 (Thread 0x7f23aa60e700 (LWP 296445)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194308}, …)
      #7  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 45 (Thread 0x7f23aadf7700 (LWP 296406)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193982}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 46 (Thread 0x7f23aae42700 (LWP 296403)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193818}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 67 (Thread 0x7f23ab70c700 (LWP 296362)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194076}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 81 (Thread 0x7f23abef5700 (LWP 296332)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193656}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 85 (Thread 0x7f23ac198700 (LWP 296324)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194312}, …)
      #6  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 88 (Thread 0x7f23ac2c4700 (LWP 296319)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193831}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 104 (Thread 0x7f23ac9cc700 (LWP 296286)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193952}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 106 (Thread 0x7f23acbd9700 (LWP 296283)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194298}, …)
      #7  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 129 (Thread 0x7f23ad61a700 (LWP 296238)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194272}, …)
      #6  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 147 (Thread 0x7f23ae010700 (LWP 296201)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193899}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 150 (Thread 0x7f23ae1d2700 (LWP 296195)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194106}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 158 (Thread 0x7f23ae5a1700 (LWP 296180)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193795}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 162 (Thread 0x7f23ae844700 (LWP 296171)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193947}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 176 (Thread 0x7f23aee6b700 (LWP 296144)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194302}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 195 (Thread 0x7f23af816700 (LWP 296105)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194233}, …)
      #6  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 200 (Thread 0x7f23afa6e700 (LWP 296095)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194014}, …)
      #6  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 207 (Thread 0x7f23afe3d700 (LWP 296082)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193891}, …)
      #6  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 212 (Thread 0x7f23b4102700 (LWP 296071)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194293}, …)
      #7  0x00005628f6fea9b4 in trx_undo_page_get
      

      The false sharing is completely eliminated by the following:

      diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h
      index 2677d42..20bf8d5 100644
      --- a/storage/innobase/include/buf0buf.h
      +++ b/storage/innobase/include/buf0buf.h
      @@ -1824,7 +1824,8 @@ class buf_pool_t
         {
           /** Number of array[] elements per page_hash_latch.
           Must be one less than a power of 2. */
      -    static constexpr size_t ELEMENTS_PER_LATCH= 1023;
      +    static constexpr size_t ELEMENTS_PER_LATCH= CPU_LEVEL1_DCACHE_LINESIZE /
      +      sizeof(void*) - 1;
       
           /** number of payload elements in array[] */
           Atomic_relaxed<ulint> n_cells;
      

      The practical minimum value of CPU_LEVEL1_DCACHE_LINESIZE appears to be 64 bytes, and the practical maximum value of sizeof(void*) is 8 bytes. Those are the exact values on the AMD64 a.k.a. Intel EM64T a.k.a. x86_64 ISA.

      With the above fix, we would use at most 1/8 of buf_pool.page_hash.array for the page_hash_latch. The payload size of the array is the number of pages in the buffer pool (innodb_buffer_pool_size/innodb_page_size). This number would be rounded up to a slightly larger prime number, and with the above patch, multiplied by 8/7, or about 15%, and finally multiplied by sizeof(void*).

      For example, a 50GiB buffer pool would comprise at most 3276800 pieces of 16KiB pages, and the raw payload size of the buf_pool.page_hash.array would be 25MiB. The above fix would increase the memory usage to 28.6MiB, which seems acceptable to me.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.