Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23369

False sharing in page_hash_latch::read_lock_wait()

    XMLWordPrintable

    Details

      Description

      MDEV-22871 refactored the InnoDB buf_pool.page_hash to use a simple rw-lock implementation that avoids a spinloop between non-contended read-lock requests, simply using std::atomic::fetch_add() for the lock acquisition.

      Alas, Vladislav Vaintroub noticed that a write-heavy stress test on a 56-core system with 1000 concurrent client connections would every few seconds indicate that the server appears to halt, delivering 0 transactions per second. It is not a permanent hang; the performance will resume after some time.

      I attached GDB to the server during one such apparent halt, and I saw 22 of the 1,033 threads trying to access the same object:

      10.5 8ddebb33c28b0aeaa6550ac0e825beccd367bb2c

      #1  0x00005628f70317d5 in page_hash_latch::read_lock_wait (
          this=this@entry=0x7f2ae590d040)
          at /home/marko/server/storage/innobase/buf/buf0buf.cc:298
      

      In each of the calls, the page_id is distinct, and each invocation is for an undo log page:

      10.5 8ddebb33c28b0aeaa6550ac0e825beccd367bb2c

      Thread 5 (Thread 0x7f23a9d8f700 (LWP 296467)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193651}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 25 (Thread 0x7f23aa60e700 (LWP 296445)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194308}, …)
      #7  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 45 (Thread 0x7f23aadf7700 (LWP 296406)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193982}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 46 (Thread 0x7f23aae42700 (LWP 296403)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193818}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 67 (Thread 0x7f23ab70c700 (LWP 296362)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194076}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 81 (Thread 0x7f23abef5700 (LWP 296332)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193656}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 85 (Thread 0x7f23ac198700 (LWP 296324)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194312}, …)
      #6  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 88 (Thread 0x7f23ac2c4700 (LWP 296319)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193831}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 104 (Thread 0x7f23ac9cc700 (LWP 296286)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193952}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 106 (Thread 0x7f23acbd9700 (LWP 296283)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194298}, …)
      #7  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 129 (Thread 0x7f23ad61a700 (LWP 296238)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194272}, …)
      #6  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 147 (Thread 0x7f23ae010700 (LWP 296201)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193899}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 150 (Thread 0x7f23ae1d2700 (LWP 296195)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194106}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 158 (Thread 0x7f23ae5a1700 (LWP 296180)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193795}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 162 (Thread 0x7f23ae844700 (LWP 296171)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193947}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 176 (Thread 0x7f23aee6b700 (LWP 296144)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194302}, …)
      #7  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 195 (Thread 0x7f23af816700 (LWP 296105)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194233}, …)
      #6  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 200 (Thread 0x7f23afa6e700 (LWP 296095)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194014}, …)
      #6  0x00005628f6fea9b4 in trx_undo_page_get
      Thread 207 (Thread 0x7f23afe3d700 (LWP 296082)):
      #5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193891}, …)
      #6  0x00005628f6fefc91 in trx_undo_reuse_cached
      Thread 212 (Thread 0x7f23b4102700 (LWP 296071)):
      #6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194293}, …)
      #7  0x00005628f6fea9b4 in trx_undo_page_get
      

      The false sharing is completely eliminated by the following:

      diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h
      index 2677d42..20bf8d5 100644
      --- a/storage/innobase/include/buf0buf.h
      +++ b/storage/innobase/include/buf0buf.h
      @@ -1824,7 +1824,8 @@ class buf_pool_t
         {
           /** Number of array[] elements per page_hash_latch.
           Must be one less than a power of 2. */
      -    static constexpr size_t ELEMENTS_PER_LATCH= 1023;
      +    static constexpr size_t ELEMENTS_PER_LATCH= CPU_LEVEL1_DCACHE_LINESIZE /
      +      sizeof(void*) - 1;
       
           /** number of payload elements in array[] */
           Atomic_relaxed<ulint> n_cells;
      

      The practical minimum value of CPU_LEVEL1_DCACHE_LINESIZE appears to be 64 bytes, and the practical maximum value of sizeof(void*) is 8 bytes. Those are the exact values on the AMD64 a.k.a. Intel EM64T a.k.a. x86_64 ISA.

      With the above fix, we would use at most 1/8 of buf_pool.page_hash.array for the page_hash_latch. The payload size of the array is the number of pages in the buffer pool (innodb_buffer_pool_size/innodb_page_size). This number would be rounded up to a slightly larger prime number, and with the above patch, multiplied by 8/7, or about 15%, and finally multiplied by sizeof(void*).

      For example, a 50GiB buffer pool would comprise at most 3276800 pieces of 16KiB pages, and the raw payload size of the buf_pool.page_hash.array would be 25MiB. The above fix would increase the memory usage to 28.6MiB, which seems acceptable to me.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              marko Marko Mäkelä
              Reporter:
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: