[MDEV-23369] False sharing in page_hash_latch::read_lock_wait() Created: 2020-08-02  Updated: 2020-08-03  Resolved: 2020-08-02

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.4
Fix Version/s: 10.5.5

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Problem/Incident
is caused by MDEV-22871 Contention on the buf_pool.page_hash Closed
Relates
relates to MDEV-23379 Deprecate and ignore options for Inno... Closed

 Description   

MDEV-22871 refactored the InnoDB buf_pool.page_hash to use a simple rw-lock implementation that avoids a spinloop between non-contended read-lock requests, simply using std::atomic::fetch_add() for the lock acquisition.

Alas, wlad noticed that a write-heavy stress test on a 56-core system with 1000 concurrent client connections would every few seconds indicate that the server appears to halt, delivering 0 transactions per second. It is not a permanent hang; the performance will resume after some time.

I attached GDB to the server during one such apparent halt, and I saw 22 of the 1,033 threads trying to access the same object:

10.5 8ddebb33c28b0aeaa6550ac0e825beccd367bb2c

#1  0x00005628f70317d5 in page_hash_latch::read_lock_wait (
    this=this@entry=0x7f2ae590d040)
    at /home/marko/server/storage/innobase/buf/buf0buf.cc:298

In each of the calls, the page_id is distinct, and each invocation is for an undo log page:

10.5 8ddebb33c28b0aeaa6550ac0e825beccd367bb2c

Thread 5 (Thread 0x7f23a9d8f700 (LWP 296467)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193651}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 25 (Thread 0x7f23aa60e700 (LWP 296445)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194308}, …)
#7  0x00005628f6fea9b4 in trx_undo_page_get
Thread 45 (Thread 0x7f23aadf7700 (LWP 296406)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193982}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 46 (Thread 0x7f23aae42700 (LWP 296403)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193818}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 67 (Thread 0x7f23ab70c700 (LWP 296362)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194076}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 81 (Thread 0x7f23abef5700 (LWP 296332)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193656}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 85 (Thread 0x7f23ac198700 (LWP 296324)):
#5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194312}, …)
#6  0x00005628f6fea9b4 in trx_undo_page_get
Thread 88 (Thread 0x7f23ac2c4700 (LWP 296319)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193831}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 104 (Thread 0x7f23ac9cc700 (LWP 296286)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193952}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 106 (Thread 0x7f23acbd9700 (LWP 296283)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194298}, …)
#7  0x00005628f6fea9b4 in trx_undo_page_get
Thread 129 (Thread 0x7f23ad61a700 (LWP 296238)):
#5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194272}, …)
#6  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 147 (Thread 0x7f23ae010700 (LWP 296201)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193899}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 150 (Thread 0x7f23ae1d2700 (LWP 296195)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194106}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 158 (Thread 0x7f23ae5a1700 (LWP 296180)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193795}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 162 (Thread 0x7f23ae844700 (LWP 296171)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193947}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 176 (Thread 0x7f23aee6b700 (LWP 296144)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194302}, …)
#7  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 195 (Thread 0x7f23af816700 (LWP 296105)):
#5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194233}, …)
#6  0x00005628f6fea9b4 in trx_undo_page_get
Thread 200 (Thread 0x7f23afa6e700 (LWP 296095)):
#5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194014}, …)
#6  0x00005628f6fea9b4 in trx_undo_page_get
Thread 207 (Thread 0x7f23afe3d700 (LWP 296082)):
#5  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 193891}, …)
#6  0x00005628f6fefc91 in trx_undo_reuse_cached
Thread 212 (Thread 0x7f23b4102700 (LWP 296071)):
#6  0x00005628f7037127 in buf_page_get_gen (page_id={m_id = 194293}, …)
#7  0x00005628f6fea9b4 in trx_undo_page_get

The false sharing is completely eliminated by the following:

diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h
index 2677d42..20bf8d5 100644
--- a/storage/innobase/include/buf0buf.h
+++ b/storage/innobase/include/buf0buf.h
@@ -1824,7 +1824,8 @@ class buf_pool_t
   {
     /** Number of array[] elements per page_hash_latch.
     Must be one less than a power of 2. */
-    static constexpr size_t ELEMENTS_PER_LATCH= 1023;
+    static constexpr size_t ELEMENTS_PER_LATCH= CPU_LEVEL1_DCACHE_LINESIZE /
+      sizeof(void*) - 1;
 
     /** number of payload elements in array[] */
     Atomic_relaxed<ulint> n_cells;

The practical minimum value of CPU_LEVEL1_DCACHE_LINESIZE appears to be 64 bytes, and the practical maximum value of sizeof(void*) is 8 bytes. Those are the exact values on the AMD64 a.k.a. Intel EM64T a.k.a. x86_64 ISA.

With the above fix, we would use at most 1/8 of buf_pool.page_hash.array for the page_hash_latch. The payload size of the array is the number of pages in the buffer pool (innodb_buffer_pool_size/innodb_page_size). This number would be rounded up to a slightly larger prime number, and with the above patch, multiplied by 8/7, or about 15%, and finally multiplied by sizeof(void*).

For example, a 50GiB buffer pool would comprise at most 3276800 pieces of 16KiB pages, and the raw payload size of the buf_pool.page_hash.array would be 25MiB. The above fix would increase the memory usage to 28.6MiB, which seems acceptable to me.


Generated at Thu Feb 08 09:21:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.