[MDEV-29445] reorganise innodb buffer pool (and remove buffer pool chunks) Created: 2022-09-02  Updated: 2023-12-21

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 11.5

Type: Task Priority: Critical
Reporter: Daniel Black Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 1
Labels: energy, performance

Attachments: File MDEV-29445-sizes.gnumeric    
Issue Links:
Blocks
blocks MDEV-21203 Bad value for the variable "Buffer po... Open
blocks MDEV-28805 SET GLOBAL innodb_buffer_pool_size=12... Confirmed
is blocked by MDEV-25340 Server startup with large innodb_buff... Open
Relates
relates to MDEV-29432 innodb huge pages reclaim Open
relates to MDEV-31976 buf_pool.unzip_LRU wastes memory and CPU Stalled
relates to MDEV-32175 References to buf_page_t::frame may c... Stalled
relates to MDEV-32544 Setting innodb_buffer_pool_size to th... Open
relates to MDEV-9236 Dramatically overallocation of InnoDB... Open
relates to MDEV-25341 innodb buffer pool soft decommit of m... Closed
relates to MDEV-32339 decreasing innodb_buffer_pool_size at... Open

 Description   

copied from MDEV-25341:

  • The buf_pool.free as well as the buffer pool blocks that are backing store for the AHI or lock_sys could be doubly linked with each other via bytes allocated within the page frame itself. We do not need a dummy buf_page_t for such blocks.
  • We could allocate a contiguous virtual address range for the maximum supported size of buffer pool, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.


 Comments   
Comment by Marko Mäkelä [ 2023-03-31 ]

When testing crash recovery with a 30GiB buffer pool in MDEV-29911, it was divided into 64 chunks of 480MiB each. I noticed that recv_sys_t::free() (inlined in recv_sys_t::recover_low()) is consuming a significant amount of CPU time. Having a single buffer pool chunk should make that code much faster.

Comment by Marko Mäkelä [ 2023-08-15 ]

I think that the buffer pool needs to be divided into logical chunks, with an array of buf_block_t being allocated at the start of the chunk, to cover the uncompressed page at the end of the chunk.

To best achieve this, it would be beneficial to shrink sizeof(buf_block_t) to 128 bytes or less. Currently, on a CMAKE_BUILD_TYPE=RelWithDebInfo build of 10.6 or later (after MDEV-27058), we have sizeof(buf_page_t)=112 and sizeof(buf_block_t)=160. By replacing the data member buf_page_t::frame with a member function we could shrink each descriptor by 8 more bytes. The buf_block_t comprises the following:

struct buf_block_t {
    buf_page_t page; // page descriptor
    ut_list_node<buf_block_t> unzip_LRU; // 2*sizeof(void*), related to ROW_FORMAT=COMPRESSED
    ib_uint64_t modify_clock; // 8 bytes
    volatile uint16_t n_bytes; // 2 bytes
    volatile uint16_t n_fields; // 2 bytes
    uint16_t n_hash_helps; // 2 bytes
    volatile bool left_side; // 1 byte + 1 byte alignment loss
    unsigned int curr_n_fields : 10;
    unsigned int curr_n_bytes : 15;
    unsigned int curr_left_side : 1; // 32 bytes (including alignment loss)
    dict_index_t *index; // 8 bytes
};

All fields after unzip_LRU are related to the adaptive hash index. Their total size is 32 bytes. The adaptive hash index was disabled by default in MDEV-20487. If we introduce a pointer, say, buf_block_t::ahi, which points to a structure that contains the adaptive hash index information, and at the same time remove the buf_page_t::frame pointer, we would shrink sizeof(buf_block_t) to exactly 128 or 2⁷ bytes. This should keep the arithmetics simple.

Let us consider a few sizes, assuming sizeof(buf_block_t)=128. I calculated some more sizes in MDEV-29445-sizes.gnumeric:

hugepage size/KiB innodb_page_size/KiB descriptor pages data pages wasted space/bytes waste %
2048 4 16 512-16 4096*16-(512-16)*128=2048 0.0977%
2048 16 1 128-1 16384*1-(128-1)*128=128 0.0061%
2048 64 1 32-1 65536*1-(32-1)*128=61568 2.936%
1048576 4 7944 262144-7944 4096*7944-(262144-7944)*128=1024 0.0000934%
1048576 16 509 65536-509 16384*509-(65536-509)*128=16000 0.00149%
1048576 64 32 16384-32 65536*128-(16384-32)*128=4096 0.000381%

When the largest hugepage size that is supported by the MMU is small, it might make sense to retain the parameter innodb_buffer_pool_chunk_size and allow it to be an integer power-of-2 multiple of the hugepage size.

Comment by Marko Mäkelä [ 2023-08-16 ]

The field buf_block_t::modify_clock is not related to the adaptive hash index after all. Its purpose is to identify that an optimistic btr_pcur_t::restore_pos() is not possible. The counter will be incremented whenever a record is deleted from a page, or a page is freed or evicted from the buffer pool. This would cause a comparison to btr_pcur_t::modify_clock to fail. We might add the field btr_pcur_t::page_id (to compare to what buf_page_t::id() would return when we attempt optimistic restoration) and simply store the FIL_PAGE_LSN contents of the page frame in btr_pcur_t. Replacing modify_clock with FIL_PAGE_LSN and page_id_t would make the optimistic btr_pcur_t::restore_pos() less likely, because the FIL_PAGE_LSN in an index page would be updated on any insert or update, not only when records are being deleted or pages being evicted or freed.

I computed a table for three block descriptor sizes:

sizeof(buf_block_t) scenario
152 removing buf_page_t::frame only
136 also moving the adaptive hash index behind a pointer
128 also removing modify_clock
hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 128-byte
2048 4 512 19 17 16
2048 8 256 5 5 4
2048 16 128 2 2 1
2048 32 64 1 1 1
2048 64 32 1 1 1
1048576 4 262144 9380 8425 7944
1048576 8 131072 2388 2141 2017
1048576 16 65536 603 540 509
1048576 32 32768 152 136 128
1048576 64 16384 38 34 32

The biggest overhead difference above occurs with 2MiB hugepages and the default innodb_page_size=16k: We would use 1/128 of the memory for 128-byte block descriptors, or 1/64 when using larger block descriptors.

I think that we can live with the current sizeof(buf_block_t), only removing buf_page_t::frame.

Comment by Marko Mäkelä [ 2023-08-18 ]

I created a constexpr function that should allow us to calculate the mappings between page frame addresses and block descriptors at compilation time, with the innodb_page_size being the only run-time parameter. We might generate a small number sets of mapping functions for each supported innodb_page_size (5 values) and innodb_buffer_pool_chunk_size (limited to a small number of sizes) and set function pointers based on the chosen start-up parameters.

In C++11, a constexpr function body must consist of a single return statement. Both Clang and GCC limit the recursion depth to 512 by default. The following naïve attempt requires 351 recursion steps, and it works in all compilers that I tried: GCC 4.8.5 or later; clang 3.1 or later; ICC 16.0.3 or later; not too old MSVC:

static constexpr size_t fix(size_t pages, size_t bs, size_t ps, size_t b)
{
  return ((ps * b - (pages - b) * bs) > ps)
    ? fix(pages, bs, ps, b - 1)
    : b;
}
 
static constexpr size_t b(size_t pages, size_t bs, size_t ps)
{
  return fix(pages, bs, ps, (pages * bs + (ps - 1)) / ps);
}
 
static constexpr size_t bpp(size_t hugepagesize, size_t bs, size_t ps)
{
  return b(hugepagesize * 1024 / ps, bs, ps);
}
 
constexpr size_t big = 152; // sizeof(buf_block_t)
 
constexpr static size_t sizes[] = {
  bpp(2048, big, 4096),
  bpp(2048, big, 8192),
  bpp(2048, big, 16384),
  bpp(2048, big, 32768),
  bpp(2048, big, 65536),
  bpp(1048576, big, 4096),
  bpp(1048576, big, 8192),
  bpp(1048576, big, 16384),
  bpp(1048576, big, 32768),
  bpp(1048576, big, 65536)
};

Comment by Marko Mäkelä [ 2023-08-21 ]

Implementing MDEV-31976 would shrink buf_block_t by 2 pointers. If we also remove the redundant buf_page_t::frame pointer, we would end up with sizeof(buf_block_t) being 136 bytes on 64-bit systems or 100 bytes on 32-bit systems. The sizeof(buf_page_t) is unaffected by that: 80 bytes on 32-bit and 112 bytes on 64-bit.

One more thing that we can do is to replace all pointers in buf_block_t or buf_page_t with 32-bit integers, counting buffer page frame slots from the start of the contiguous buffer pool memory, divided by innodb_page_size. Null pointers can trivially be mapped to the value 0, because at the start of the memory we will always have a buf_block_t and never a valid page frame. The smallest valid nonzero value for the integer would be 2048k/16k=128, which would be equivalent to the buf_block_t starting at the first address of buffer pool memory. There is only one pointer that we cannot replace in this way: buf_block_t::index. That is, sizeof(buf_block_t) would have to be 104 (0x68) bytes on 64-bit systems.

The pointer page_zip_des_t::data would require up to 4 extra bits (ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 blocks within a innodb_page_size=16k frame). We have exactly the required amount of 2+2 spare bits available in m_end and n_blobs.

With the minimum innodb_page_size=4k (2¹² bytes), the 32-bit “pointers” would allow innodb_buffer_pool_size to reach up to 2¹²·2³²=2⁴⁴=16TiB. At the maximum innodb_page_size=64k we would reach 2⁴⁸=256TiB, which is the maximum virtual address space size of contemporary 64-bit processors.

Here is an updated table:

hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte
2048 4 512 19 17 13 13
2048 8 256 5 5 4 4
2048 16 128 2 2 1 1
2048 32 64 1 1 1 1
2048 64 32 1 1 1 1
1048576 4 262144 9380 8425 6248 6492
1048576 8 131072 2388 2141 1581 1644
1048576 16 65536 603 540 398 414
1048576 32 32768 152 136 100 104
1048576 64 16384 38 34 25 26

The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 13/512=2.54% or 6492/262144=2.48%. With the default innodb_page_size=16k the overhead is 1/128=0.78% or 414/65536=0.63%.

Comment by Marko Mäkelä [ 2023-08-21 ]

The current minimum value of innodb_buffer_pool_size is 2MiB, which coincidentally is equal to the smaller IA-32 or AMD64 hugepage size. In each 2MiB segment, we would allocate the 13 first page frames (52 kilobytes) for block descriptors. When using innodb_buffer_pool_size=3m we would reserve a total of 26*4KiB=104 KiB for page descriptors, wasting 6½*4KiB for the last 1MiB for which we are not going to allocate page frames.

When innodb_buffer_pool_chunk_size=1GiB, at every 1GiB we would use 6492*4KiB=25 MiB for innodb_page_size=4k page descriptors, or 6.5 MiB for innodb_page_size=16k, or 1.63 MiB for innodb_page_size=64k.

Comment by Marko Mäkelä [ 2023-08-22 ]

I realized that the trick of replacing 64-bit pointers with 32-bit integers will not work, because the buf_page_t descriptors of compressed-only ROW_FORMAT=COMPRESSED blocks in the buffer pool would be allocated by malloc(), outside the contiguous virtual address range that is associated with the buf_block_t descriptors of uncompressed pages as well as page frames. If we were to remove ROW_FORMAT=COMPRESSED support altogether (which we won’t; see MDEV-22367), then sizeof(buf_block_t) would be shrunk further to 88 bytes on 32-bit systems, and possibly 96 on 64-bit. By further removing the adaptive hash index we would come down to 72 bytes (on both 32-bit and 64-bit systems):

Here is an updated table that includes these hypothetical cases:

hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte 88-byte 96-byte 72-byte
2048 4 512 19 17 13 13 11 12 9
2048 8 256 5 5 4 4 3 3 3
2048 16 128 2 2 1 1 1 1 1
2048 32 64 1 1 1 1 1 1 1
2048 64 32 1 1 1 1 1 1 1
1048576 4 262144 9380 8425 6248 6492 5514 6004 4529
1048576 8 131072 2388 2141 1581 1644 1394 1519 1142
1048576 16 65536 603 540 398 414 351 382 287
1048576 32 32768 152 136 100 104 88 96 72
1048576 64 16384 38 34 25 26 22 24 18

The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 9/512=1.76% (instead of 17/512=3.32%) or 4529/262144=1.73% (instead of 8425/262144=3.21%). With the default innodb_page_size=16k the overhead is 1/128=0.78% or 287/65536=0.44% (or 540/65536=0.83%). Nearly halving the size of the block descriptor from 136 to 72 bytes would roughly halve the memory overhead. For now, we can only shrink the block descriptor by 3 pointers (1 if we do not implement MDEV-31976).

Generated at Thu Feb 08 10:08:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.