[MDEV-29445] reorganise innodb buffer pool (and remove buffer pool chunks) Created: 2022-09-02 Updated: 2023-12-21 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 11.5 |
| Type: | Task | Priority: | Critical |
| Reporter: | Daniel Black | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | energy, performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
copied from MDEV-25341:
|
| Comments |
| Comment by Marko Mäkelä [ 2023-03-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
When testing crash recovery with a 30GiB buffer pool in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that the buffer pool needs to be divided into logical chunks, with an array of buf_block_t being allocated at the start of the chunk, to cover the uncompressed page at the end of the chunk. To best achieve this, it would be beneficial to shrink sizeof(buf_block_t) to 128 bytes or less. Currently, on a CMAKE_BUILD_TYPE=RelWithDebInfo build of 10.6 or later (after
All fields after unzip_LRU are related to the adaptive hash index. Their total size is 32 bytes. The adaptive hash index was disabled by default in Let us consider a few sizes, assuming sizeof(buf_block_t)=128. I calculated some more sizes in MDEV-29445-sizes.gnumeric
When the largest hugepage size that is supported by the MMU is small, it might make sense to retain the parameter innodb_buffer_pool_chunk_size and allow it to be an integer power-of-2 multiple of the hugepage size. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The field buf_block_t::modify_clock is not related to the adaptive hash index after all. Its purpose is to identify that an optimistic btr_pcur_t::restore_pos() is not possible. The counter will be incremented whenever a record is deleted from a page, or a page is freed or evicted from the buffer pool. This would cause a comparison to btr_pcur_t::modify_clock to fail. We might add the field btr_pcur_t::page_id (to compare to what buf_page_t::id() would return when we attempt optimistic restoration) and simply store the FIL_PAGE_LSN contents of the page frame in btr_pcur_t. Replacing modify_clock with FIL_PAGE_LSN and page_id_t would make the optimistic btr_pcur_t::restore_pos() less likely, because the FIL_PAGE_LSN in an index page would be updated on any insert or update, not only when records are being deleted or pages being evicted or freed. I computed a table for three block descriptor sizes:
The biggest overhead difference above occurs with 2MiB hugepages and the default innodb_page_size=16k: We would use 1/128 of the memory for 128-byte block descriptors, or 1/64 when using larger block descriptors. I think that we can live with the current sizeof(buf_block_t), only removing buf_page_t::frame. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I created a constexpr function that should allow us to calculate the mappings between page frame addresses and block descriptors at compilation time, with the innodb_page_size being the only run-time parameter. We might generate a small number sets of mapping functions for each supported innodb_page_size (5 values) and innodb_buffer_pool_chunk_size (limited to a small number of sizes) and set function pointers based on the chosen start-up parameters. In C++11, a constexpr function body must consist of a single return statement. Both Clang and GCC limit the recursion depth to 512 by default. The following naïve attempt requires 351 recursion steps, and it works in all compilers that I tried: GCC 4.8.5 or later; clang 3.1 or later; ICC 16.0.3 or later; not too old MSVC:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Implementing MDEV-31976 would shrink buf_block_t by 2 pointers. If we also remove the redundant buf_page_t::frame pointer, we would end up with sizeof(buf_block_t) being 136 bytes on 64-bit systems or 100 bytes on 32-bit systems. The sizeof(buf_page_t) is unaffected by that: 80 bytes on 32-bit and 112 bytes on 64-bit. One more thing that we can do is to replace all pointers in buf_block_t or buf_page_t with 32-bit integers, counting buffer page frame slots from the start of the contiguous buffer pool memory, divided by innodb_page_size. Null pointers can trivially be mapped to the value 0, because at the start of the memory we will always have a buf_block_t and never a valid page frame. The smallest valid nonzero value for the integer would be 2048k/16k=128, which would be equivalent to the buf_block_t starting at the first address of buffer pool memory. There is only one pointer that we cannot replace in this way: buf_block_t::index. That is, sizeof(buf_block_t) would have to be 104 (0x68) bytes on 64-bit systems. The pointer page_zip_des_t::data would require up to 4 extra bits (ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 blocks within a innodb_page_size=16k frame). We have exactly the required amount of 2+2 spare bits available in m_end and n_blobs. With the minimum innodb_page_size=4k (2¹² bytes), the 32-bit “pointers” would allow innodb_buffer_pool_size to reach up to 2¹²·2³²=2⁴⁴=16TiB. At the maximum innodb_page_size=64k we would reach 2⁴⁸=256TiB, which is the maximum virtual address space size of contemporary 64-bit processors. Here is an updated table:
The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 13/512=2.54% or 6492/262144=2.48%. With the default innodb_page_size=16k the overhead is 1/128=0.78% or 414/65536=0.63%. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The current minimum value of innodb_buffer_pool_size is 2MiB, which coincidentally is equal to the smaller IA-32 or AMD64 hugepage size. In each 2MiB segment, we would allocate the 13 first page frames (52 kilobytes) for block descriptors. When using innodb_buffer_pool_size=3m we would reserve a total of 26*4KiB=104 KiB for page descriptors, wasting 6½*4KiB for the last 1MiB for which we are not going to allocate page frames. When innodb_buffer_pool_chunk_size=1GiB, at every 1GiB we would use 6492*4KiB=25 MiB for innodb_page_size=4k page descriptors, or 6.5 MiB for innodb_page_size=16k, or 1.63 MiB for innodb_page_size=64k. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I realized that the trick of replacing 64-bit pointers with 32-bit integers will not work, because the buf_page_t descriptors of compressed-only ROW_FORMAT=COMPRESSED blocks in the buffer pool would be allocated by malloc(), outside the contiguous virtual address range that is associated with the buf_block_t descriptors of uncompressed pages as well as page frames. If we were to remove ROW_FORMAT=COMPRESSED support altogether (which we won’t; see Here is an updated table that includes these hypothetical cases:
The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 9/512=1.76% (instead of 17/512=3.32%) or 4529/262144=1.73% (instead of 8425/262144=3.21%). With the default innodb_page_size=16k the overhead is 1/128=0.78% or 287/65536=0.44% (or 540/65536=0.83%). Nearly halving the size of the block descriptor from 136 to 72 bytes would roughly halve the memory overhead. For now, we can only shrink the block descriptor by 3 pointers (1 if we do not implement MDEV-31976). |