[MDEV-29445] reorganise innodb buffer pool (and remove buffer pool chunks) - Jira

Daniel Black created issue - 2022-09-02 09:16

Daniel Black made changes - 2022-09-02 09:16

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-25341~~ [ ~~MDEV-25341~~ ]

Daniel Black made changes - 2022-09-02 23:06

Link

This issue relates to MDEV-29432 [ MDEV-29432 ]

Sergei Golubchik made changes - 2022-09-06 21:11

Priority

Major [ 3 ]

Minor [ 4 ]

Marko Mäkelä added a comment - 2023-03-31 09:55

When testing crash recovery with a 30GiB buffer pool in ~~MDEV-29911~~, it was divided into 64 chunks of 480MiB each. I noticed that recv_sys_t::free() (inlined in recv_sys_t::recover_low()) is consuming a significant amount of CPU time. Having a single buffer pool chunk should make that code much faster.

Marko Mäkelä added a comment - 2023-03-31 09:55 When testing crash recovery with a 30GiB buffer pool in MDEV-29911 , it was divided into 64 chunks of 480MiB each. I noticed that recv_sys_t::free() (inlined in recv_sys_t::recover_low() ) is consuming a significant amount of CPU time. Having a single buffer pool chunk should make that code much faster.

Marko Mäkelä made changes - 2023-03-31 09:55

Link

This issue blocks ~~MDEV-28805~~ [ ~~MDEV-28805~~ ]

Marko Mäkelä made changes - 2023-03-31 09:55

Link

This issue blocks ~~MDEV-29911~~ [ ~~MDEV-29911~~ ]

Marko Mäkelä made changes - 2023-03-31 09:55

Fix Version/s		11.1 [ 28549 ]
Assignee		Marko Mäkelä [ marko ]
Labels	energy	energy performance
Priority	Minor [ 4 ]	Major [ 3 ]

Julien Fritsch made changes - 2023-03-31 16:39

Priority

Major [ 3 ]

Critical [ 2 ]

Marko Mäkelä made changes - 2023-04-11 09:47

Link

This issue relates to MDEV-9236 [ MDEV-9236 ]

Marko Mäkelä made changes - 2023-04-14 13:59

Link

This issue blocks ~~MDEV-21203~~ [ ~~MDEV-21203~~ ]

Ralf Gebhardt made changes - 2023-05-04 18:43

Fix Version/s		11.2 [ 28603 ]
Fix Version/s	11.1 [ 28549 ]

Ralf Gebhardt made changes - 2023-05-30 08:43

Link

This issue blocks ~~MDEV-29911~~ [ ~~MDEV-29911~~ ]

Ralf Gebhardt made changes - 2023-08-03 10:33

Fix Version/s		11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]

Marko Mäkelä made changes - 2023-08-07 06:49

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä made changes - 2023-08-07 07:23

Link

This issue is blocked by ~~MDEV-25340~~ [ ~~MDEV-25340~~ ]

Marko Mäkelä made changes - 2023-08-15 14:35

Attachment

MDEV-29445-sizes.gnumeric [ 71618 ]

Marko Mäkelä added a comment - 2023-08-15 14:36

I think that the buffer pool needs to be divided into logical chunks, with an array of buf_block_t being allocated at the start of the chunk, to cover the uncompressed page at the end of the chunk.

To best achieve this, it would be beneficial to shrink sizeof(buf_block_t) to 128 bytes or less. Currently, on a CMAKE_BUILD_TYPE=RelWithDebInfo build of 10.6 or later (after ~~MDEV-27058~~), we have sizeof(buf_page_t)=112 and sizeof(buf_block_t)=160. By replacing the data member buf_page_t::frame with a member function we could shrink each descriptor by 8 more bytes. The buf_block_t comprises the following:

struct buf_block_t {

    buf_page_t page; // page descriptor

    ut_list_node<buf_block_t> unzip_LRU; // 2*sizeof(void*), related to ROW_FORMAT=COMPRESSED

    ib_uint64_t modify_clock; // 8 bytes

    volatile uint16_t n_bytes; // 2 bytes

    volatile uint16_t n_fields; // 2 bytes

    uint16_t n_hash_helps; // 2 bytes

    volatile bool left_side; // 1 byte + 1 byte alignment loss

    unsigned int curr_n_fields : 10;

    unsigned int curr_n_bytes : 15;

    unsigned int curr_left_side : 1; // 32 bytes (including alignment loss)

    dict_index_t *index; // 8 bytes

};

All fields after unzip_LRU are related to the adaptive hash index. Their total size is 32 bytes. The adaptive hash index was disabled by default in ~~MDEV-20487~~. If we introduce a pointer, say, buf_block_t::ahi, which points to a structure that contains the adaptive hash index information, and at the same time remove the buf_page_t::frame pointer, we would shrink sizeof(buf_block_t) to exactly 128 or 2⁷ bytes. This should keep the arithmetics simple.

Let us consider a few sizes, assuming sizeof(buf_block_t)=128. I calculated some more sizes in MDEV-29445-sizes.gnumeric:

hugepage size/KiB	innodb_page_size/KiB	descriptor pages	data pages	wasted space/bytes	waste %
2048	4	16	512-16	409616-(512-16)128=2048	0.0977%
2048	16	1	128-1	163841-(128-1)128=128	0.0061%
2048	64	1	32-1	655361-(32-1)128=61568	2.936%
1048576	4	7944	262144-7944	40967944-(262144-7944)128=1024	0.0000934%
1048576	16	509	65536-509	16384509-(65536-509)128=16000	0.00149%
1048576	64	32	16384-32	65536128-(16384-32)128=4096	0.000381%

When the largest hugepage size that is supported by the MMU is small, it might make sense to retain the parameter innodb_buffer_pool_chunk_size and allow it to be an integer power-of-2 multiple of the hugepage size.

Marko Mäkelä added a comment - 2023-08-15 14:36 I think that the buffer pool needs to be divided into logical chunks, with an array of buf_block_t being allocated at the start of the chunk, to cover the uncompressed page at the end of the chunk. To best achieve this, it would be beneficial to shrink sizeof(buf_block_t) to 128 bytes or less. Currently, on a CMAKE_BUILD_TYPE=RelWithDebInfo build of 10.6 or later (after MDEV-27058 ), we have sizeof(buf_page_t)=112 and sizeof(buf_block_t)=160 . By replacing the data member buf_page_t::frame with a member function we could shrink each descriptor by 8 more bytes. The buf_block_t comprises the following: struct buf_block_t { buf_page_t page; // page descriptor ut_list_node<buf_block_t> unzip_LRU; // 2*sizeof(void*), related to ROW_FORMAT=COMPRESSED ib_uint64_t modify_clock; // 8 bytes volatile uint16_t n_bytes; // 2 bytes volatile uint16_t n_fields; // 2 bytes uint16_t n_hash_helps; // 2 bytes volatile bool left_side; // 1 byte + 1 byte alignment loss unsigned int curr_n_fields : 10; unsigned int curr_n_bytes : 15; unsigned int curr_left_side : 1; // 32 bytes (including alignment loss) dict_index_t *index; // 8 bytes }; All fields after unzip_LRU are related to the adaptive hash index. Their total size is 32 bytes. The adaptive hash index was disabled by default in MDEV-20487 . If we introduce a pointer, say, buf_block_t::ahi , which points to a structure that contains the adaptive hash index information, and at the same time remove the buf_page_t::frame pointer, we would shrink sizeof(buf_block_t) to exactly 128 or 2⁷ bytes. This should keep the arithmetics simple. Let us consider a few sizes, assuming sizeof(buf_block_t)=128 . I calculated some more sizes in MDEV-29445-sizes.gnumeric : hugepage size/KiB innodb_page_size/KiB descriptor pages data pages wasted space/bytes waste % 2048 4 16 512-16 4096*16-(512-16)*128=2048 0.0977% 2048 16 1 128-1 16384*1-(128-1)*128=128 0.0061% 2048 64 1 32-1 65536*1-(32-1)*128=61568 2.936% 1048576 4 7944 262144-7944 4096*7944-(262144-7944)*128=1024 0.0000934% 1048576 16 509 65536-509 16384*509-(65536-509)*128=16000 0.00149% 1048576 64 32 16384-32 65536*128-(16384-32)*128=4096 0.000381% When the largest hugepage size that is supported by the MMU is small, it might make sense to retain the parameter innodb_buffer_pool_chunk_size and allow it to be an integer power-of-2 multiple of the hugepage size.

Marko Mäkelä added a comment - 2023-08-16 10:42

The field buf_block_t::modify_clock is not related to the adaptive hash index after all. Its purpose is to identify that an optimistic btr_pcur_t::restore_pos() is not possible. The counter will be incremented whenever a record is deleted from a page, or a page is freed or evicted from the buffer pool. This would cause a comparison to btr_pcur_t::modify_clock to fail. We might add the field btr_pcur_t::page_id (to compare to what buf_page_t::id() would return when we attempt optimistic restoration) and simply store the FIL_PAGE_LSN contents of the page frame in btr_pcur_t. Replacing modify_clock with FIL_PAGE_LSN and page_id_t would make the optimistic btr_pcur_t::restore_pos() less likely, because the FIL_PAGE_LSN in an index page would be updated on any insert or update, not only when records are being deleted or pages being evicted or freed.

I computed a table for three block descriptor sizes:

`sizeof(buf_block_t)`	scenario
152	removing `buf_page_t::frame` only
136	also moving the adaptive hash index behind a pointer
128	also removing `modify_clock`

hugepage/KiB	innodb_page_size/KiB	pages/hugepage	152-byte	136-byte	128-byte
2048	4	512	19	17	16
2048	8	256	5	5	4
2048	16	128	2	2	1
2048	32	64	1	1	1
2048	64	32	1	1	1
1048576	4	262144	9380	8425	7944
1048576	8	131072	2388	2141	2017
1048576	16	65536	603	540	509
1048576	32	32768	152	136	128
1048576	64	16384	38	34	32

The biggest overhead difference above occurs with 2MiB hugepages and the default innodb_page_size=16k: We would use 1/128 of the memory for 128-byte block descriptors, or 1/64 when using larger block descriptors.

I think that we can live with the current sizeof(buf_block_t), only removing buf_page_t::frame.

Marko Mäkelä added a comment - 2023-08-16 10:42 The field buf_block_t::modify_clock is not related to the adaptive hash index after all. Its purpose is to identify that an optimistic btr_pcur_t::restore_pos() is not possible. The counter will be incremented whenever a record is deleted from a page, or a page is freed or evicted from the buffer pool. This would cause a comparison to btr_pcur_t::modify_clock to fail. We might add the field btr_pcur_t::page_id (to compare to what buf_page_t::id() would return when we attempt optimistic restoration) and simply store the FIL_PAGE_LSN contents of the page frame in btr_pcur_t . Replacing modify_clock with FIL_PAGE_LSN and page_id_t would make the optimistic btr_pcur_t::restore_pos() less likely, because the FIL_PAGE_LSN in an index page would be updated on any insert or update, not only when records are being deleted or pages being evicted or freed. I computed a table for three block descriptor sizes: sizeof(buf_block_t) scenario 152 removing buf_page_t::frame only 136 also moving the adaptive hash index behind a pointer 128 also removing modify_clock hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 128-byte 2048 4 512 19 17 16 2048 8 256 5 5 4 2048 16 128 2 2 1 2048 32 64 1 1 1 2048 64 32 1 1 1 1048576 4 262144 9380 8425 7944 1048576 8 131072 2388 2141 2017 1048576 16 65536 603 540 509 1048576 32 32768 152 136 128 1048576 64 16384 38 34 32 The biggest overhead difference above occurs with 2MiB hugepages and the default innodb_page_size=16k : We would use 1/128 of the memory for 128-byte block descriptors, or 1/64 when using larger block descriptors. I think that we can live with the current sizeof(buf_block_t) , only removing buf_page_t::frame .

Marko Mäkelä added a comment - 2023-08-18 14:42

I created a constexpr function that should allow us to calculate the mappings between page frame addresses and block descriptors at compilation time, with the innodb_page_size being the only run-time parameter. We might generate a small number sets of mapping functions for each supported innodb_page_size (5 values) and innodb_buffer_pool_chunk_size (limited to a small number of sizes) and set function pointers based on the chosen start-up parameters.

In C++11, a constexpr function body must consist of a single return statement. Both Clang and GCC limit the recursion depth to 512 by default. The following naïve attempt requires 351 recursion steps, and it works in all compilers that I tried: GCC 4.8.5 or later; clang 3.1 or later; ICC 16.0.3 or later; not too old MSVC:

static constexpr size_t fix(size_t pages, size_t bs, size_t ps, size_t b)

  return ((ps * b - (pages - b) * bs) > ps)

    ? fix(pages, bs, ps, b - 1)

    : b;

static constexpr size_t b(size_t pages, size_t bs, size_t ps)

  return fix(pages, bs, ps, (pages * bs + (ps - 1)) / ps);

static constexpr size_t bpp(size_t hugepagesize, size_t bs, size_t ps)

  return b(hugepagesize * 1024 / ps, bs, ps);

constexpr size_t big = 152; // sizeof(buf_block_t)

constexpr static size_t sizes[] = {

  bpp(2048, big, 4096),

  bpp(2048, big, 8192),

  bpp(2048, big, 16384),

  bpp(2048, big, 32768),

  bpp(2048, big, 65536),

  bpp(1048576, big, 4096),

  bpp(1048576, big, 8192),

  bpp(1048576, big, 16384),

  bpp(1048576, big, 32768),

  bpp(1048576, big, 65536)

};

Marko Mäkelä added a comment - 2023-08-18 14:42 I created a constexpr function that should allow us to calculate the mappings between page frame addresses and block descriptors at compilation time, with the innodb_page_size being the only run-time parameter. We might generate a small number sets of mapping functions for each supported innodb_page_size (5 values) and innodb_buffer_pool_chunk_size (limited to a small number of sizes) and set function pointers based on the chosen start-up parameters. In C++11, a constexpr function body must consist of a single return statement. Both Clang and GCC limit the recursion depth to 512 by default. The following naïve attempt requires 351 recursion steps, and it works in all compilers that I tried: GCC 4.8.5 or later; clang 3.1 or later; ICC 16.0.3 or later; not too old MSVC: static constexpr size_t fix( size_t pages, size_t bs, size_t ps, size_t b) { return ((ps * b - (pages - b) * bs) > ps) ? fix(pages, bs, ps, b - 1) : b; } static constexpr size_t b( size_t pages, size_t bs, size_t ps) { return fix(pages, bs, ps, (pages * bs + (ps - 1)) / ps); } static constexpr size_t bpp( size_t hugepagesize, size_t bs, size_t ps) { return b(hugepagesize * 1024 / ps, bs, ps); } constexpr size_t big = 152; // sizeof(buf_block_t) constexpr static size_t sizes[] = { bpp(2048, big, 4096), bpp(2048, big, 8192), bpp(2048, big, 16384), bpp(2048, big, 32768), bpp(2048, big, 65536), bpp(1048576, big, 4096), bpp(1048576, big, 8192), bpp(1048576, big, 16384), bpp(1048576, big, 32768), bpp(1048576, big, 65536) };

Marko Mäkelä made changes - 2023-08-21 12:13

Link

This issue relates to MDEV-31976 [ MDEV-31976 ]

Marko Mäkelä added a comment - 2023-08-21 13:52

Implementing MDEV-31976 would shrink buf_block_t by 2 pointers. If we also remove the redundant buf_page_t::frame pointer, we would end up with sizeof(buf_block_t) being 136 bytes on 64-bit systems or 100 bytes on 32-bit systems. The sizeof(buf_page_t) is unaffected by that: 80 bytes on 32-bit and 112 bytes on 64-bit.

One more thing that we can do is to replace all pointers in buf_block_t or buf_page_t with 32-bit integers, counting buffer page frame slots from the start of the contiguous buffer pool memory, divided by innodb_page_size. Null pointers can trivially be mapped to the value 0, because at the start of the memory we will always have a buf_block_t and never a valid page frame. The smallest valid nonzero value for the integer would be 2048k/16k=128, which would be equivalent to the buf_block_t starting at the first address of buffer pool memory. There is only one pointer that we cannot replace in this way: buf_block_t::index. That is, sizeof(buf_block_t) would have to be 104 (0x68) bytes on 64-bit systems.

The pointer page_zip_des_t::data would require up to 4 extra bits (ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 blocks within a innodb_page_size=16k frame). We have exactly the required amount of 2+2 spare bits available in m_end and n_blobs.

With the minimum innodb_page_size=4k (2¹² bytes), the 32-bit “pointers” would allow innodb_buffer_pool_size to reach up to 2¹²·2³²=2⁴⁴=16TiB. At the maximum innodb_page_size=64k we would reach 2⁴⁸=256TiB, which is the maximum virtual address space size of contemporary 64-bit processors.

Here is an updated table:

hugepage/KiB	innodb_page_size/KiB	pages/hugepage	152-byte	136-byte	100-byte	104-byte
2048	4	512	19	17	13	13
2048	8	256	5	5	4	4
2048	16	128	2	2	1	1
2048	32	64	1	1	1	1
2048	64	32	1	1	1	1
1048576	4	262144	9380	8425	6248	6492
1048576	8	131072	2388	2141	1581	1644
1048576	16	65536	603	540	398	414
1048576	32	32768	152	136	100	104
1048576	64	16384	38	34	25	26

The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 13/512=2.54% or 6492/262144=2.48%. With the default innodb_page_size=16k the overhead is 1/128=0.78% or 414/65536=0.63%.

Marko Mäkelä added a comment - 2023-08-21 13:52 Implementing MDEV-31976 would shrink buf_block_t by 2 pointers. If we also remove the redundant buf_page_t::frame pointer, we would end up with sizeof(buf_block_t) being 136 bytes on 64-bit systems or 100 bytes on 32-bit systems. The sizeof(buf_page_t) is unaffected by that: 80 bytes on 32-bit and 112 bytes on 64-bit. One more thing that we can do is to replace all pointers in buf_block_t or buf_page_t with 32-bit integers, counting buffer page frame slots from the start of the contiguous buffer pool memory, divided by innodb_page_size . Null pointers can trivially be mapped to the value 0, because at the start of the memory we will always have a buf_block_t and never a valid page frame. The smallest valid nonzero value for the integer would be 2048k/16k=128, which would be equivalent to the buf_block_t starting at the first address of buffer pool memory. There is only one pointer that we cannot replace in this way: buf_block_t::index . That is, sizeof(buf_block_t) would have to be 104 (0x68) bytes on 64-bit systems. The pointer page_zip_des_t::data would require up to 4 extra bits ( ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 blocks within a innodb_page_size=16k frame). We have exactly the required amount of 2+2 spare bits available in m_end and n_blobs . With the minimum innodb_page_size=4k (2¹² bytes), the 32-bit “pointers” would allow innodb_buffer_pool_size to reach up to 2¹²·2³²=2⁴⁴=16TiB. At the maximum innodb_page_size=64k we would reach 2⁴⁸=256TiB, which is the maximum virtual address space size of contemporary 64-bit processors. Here is an updated table: hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte 2048 4 512 19 17 13 13 2048 8 256 5 5 4 4 2048 16 128 2 2 1 1 2048 32 64 1 1 1 1 2048 64 32 1 1 1 1 1048576 4 262144 9380 8425 6248 6492 1048576 8 131072 2388 2141 1581 1644 1048576 16 65536 603 540 398 414 1048576 32 32768 152 136 100 104 1048576 64 16384 38 34 25 26 The worst-case overhead of allocating block descriptors (at innodb_page_size=4k ) would be 13/512=2.54% or 6492/262144=2.48%. With the default innodb_page_size=16k the overhead is 1/128=0.78% or 414/65536=0.63%.

Marko Mäkelä added a comment - 2023-08-21 14:12

The current minimum value of innodb_buffer_pool_size is 2MiB, which coincidentally is equal to the smaller IA-32 or AMD64 hugepage size. In each 2MiB segment, we would allocate the 13 first page frames (52 kilobytes) for block descriptors. When using innodb_buffer_pool_size=3m we would reserve a total of 26*4KiB=104 KiB for page descriptors, wasting 6½*4KiB for the last 1MiB for which we are not going to allocate page frames.

When innodb_buffer_pool_chunk_size=1GiB, at every 1GiB we would use 6492*4KiB=25 MiB for innodb_page_size=4k page descriptors, or 6.5 MiB for innodb_page_size=16k, or 1.63 MiB for innodb_page_size=64k.

Marko Mäkelä added a comment - 2023-08-21 14:12 The current minimum value of innodb_buffer_pool_size is 2MiB, which coincidentally is equal to the smaller IA-32 or AMD64 hugepage size. In each 2MiB segment, we would allocate the 13 first page frames (52 kilobytes) for block descriptors. When using innodb_buffer_pool_size=3m we would reserve a total of 26*4KiB=104 KiB for page descriptors, wasting 6½*4KiB for the last 1MiB for which we are not going to allocate page frames. When innodb_buffer_pool_chunk_size=1GiB , at every 1GiB we would use 6492*4KiB=25 MiB for innodb_page_size=4k page descriptors, or 6.5 MiB for innodb_page_size=16k , or 1.63 MiB for innodb_page_size=64k .

Marko Mäkelä added a comment - 2023-08-22 09:20

I realized that the trick of replacing 64-bit pointers with 32-bit integers will not work, because the buf_page_t descriptors of compressed-only ROW_FORMAT=COMPRESSED blocks in the buffer pool would be allocated by malloc(), outside the contiguous virtual address range that is associated with the buf_block_t descriptors of uncompressed pages as well as page frames. If we were to remove ROW_FORMAT=COMPRESSED support altogether (which we won’t; see ~~MDEV-22367~~), then sizeof(buf_block_t) would be shrunk further to 88 bytes on 32-bit systems, and possibly 96 on 64-bit. By further removing the adaptive hash index we would come down to 72 bytes (on both 32-bit and 64-bit systems):

Here is an updated table that includes these hypothetical cases:

hugepage/KiB	innodb_page_size/KiB	pages/hugepage	152-byte	136-byte	100-byte	104-byte	88-byte	96-byte	72-byte
2048	4	512	19	17	13	13	11	12	9
2048	8	256	5	5	4	4	3	3	3
2048	16	128	2	2	1	1	1	1	1
2048	32	64	1	1	1	1	1	1	1
2048	64	32	1	1	1	1	1	1	1
1048576	4	262144	9380	8425	6248	6492	5514	6004	4529
1048576	8	131072	2388	2141	1581	1644	1394	1519	1142
1048576	16	65536	603	540	398	414	351	382	287
1048576	32	32768	152	136	100	104	88	96	72
1048576	64	16384	38	34	25	26	22	24	18

The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 9/512=1.76% (instead of 17/512=3.32%) or 4529/262144=1.73% (instead of 8425/262144=3.21%). With the default innodb_page_size=16k the overhead is 1/128=0.78% or 287/65536=0.44% (or 540/65536=0.83%). Nearly halving the size of the block descriptor from 136 to 72 bytes would roughly halve the memory overhead. For now, we can only shrink the block descriptor by 3 pointers (1 if we do not implement MDEV-31976).

Marko Mäkelä added a comment - 2023-08-22 09:20 I realized that the trick of replacing 64-bit pointers with 32-bit integers will not work, because the buf_page_t descriptors of compressed-only ROW_FORMAT=COMPRESSED blocks in the buffer pool would be allocated by malloc() , outside the contiguous virtual address range that is associated with the buf_block_t descriptors of uncompressed pages as well as page frames. If we were to remove ROW_FORMAT=COMPRESSED support altogether (which we won’t; see MDEV-22367 ), then sizeof(buf_block_t) would be shrunk further to 88 bytes on 32-bit systems, and possibly 96 on 64-bit. By further removing the adaptive hash index we would come down to 72 bytes (on both 32-bit and 64-bit systems): Here is an updated table that includes these hypothetical cases: hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte 88-byte 96-byte 72-byte 2048 4 512 19 17 13 13 11 12 9 2048 8 256 5 5 4 4 3 3 3 2048 16 128 2 2 1 1 1 1 1 2048 32 64 1 1 1 1 1 1 1 2048 64 32 1 1 1 1 1 1 1 1048576 4 262144 9380 8425 6248 6492 5514 6004 4529 1048576 8 131072 2388 2141 1581 1644 1394 1519 1142 1048576 16 65536 603 540 398 414 351 382 287 1048576 32 32768 152 136 100 104 88 96 72 1048576 64 16384 38 34 25 26 22 24 18 The worst-case overhead of allocating block descriptors (at innodb_page_size=4k ) would be 9/512=1.76% (instead of 17/512=3.32%) or 4529/262144=1.73% (instead of 8425/262144=3.21%). With the default innodb_page_size=16k the overhead is 1/128=0.78% or 287/65536=0.44% (or 540/65536=0.83%). Nearly halving the size of the block descriptor from 136 to 72 bytes would roughly halve the memory overhead. For now, we can only shrink the block descriptor by 3 pointers (1 if we do not implement MDEV-31976 ).

Julien Fritsch made changes - 2023-09-15 09:12

Link

This issue blocks ~~MDEV-31953~~ [ ~~MDEV-31953~~ ]

Marko Mäkelä made changes - 2023-09-15 09:30

Link

This issue relates to ~~MDEV-32175~~ [ ~~MDEV-32175~~ ]

Sergei Golubchik made changes - 2023-09-17 18:03

Fix Version/s		11.4 [ 29301 ]
Fix Version/s	11.3 [ 28565 ]

Marko Mäkelä made changes - 2023-10-02 15:36

Link

This issue relates to ~~MDEV-32339~~ [ ~~MDEV-32339~~ ]

Marko Mäkelä made changes - 2023-10-23 07:24

Link

This issue blocks ~~MDEV-31953~~ [ ~~MDEV-31953~~ ]

Marko Mäkelä made changes - 2023-10-23 08:09

Link

This issue relates to MDEV-32544 [ MDEV-32544 ]

Sergei Golubchik made changes - 2023-10-30 12:59

Fix Version/s		11.5 [ 29506 ]
Fix Version/s	11.4 [ 29301 ]

Julien Fritsch made changes - 2023-12-07 10:26

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Marko Mäkelä made changes - 2024-02-21 15:00

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2024-02-26 13:53

I see that madvise(addr, length, MADV_FREE) is available on Linux, FreeBSD, OpenBSD, NetBSD, Dragonfly BSD and Solaris. On IBM AIX, all forms of madvise() are ignored. On macOS, we would want MADV_FREE_REUSABLE.

I think that we will need a new start-up parameter innodb_buffer_pool_max_size that specifies the virtual address range size that will be allocated for the InnoDB buffer pool. The parameter innodb_buffer_pool_chunk_size would be deprecated and have no effect. The innodb_buffer_pool_size may be set up to the predecladed maximum size. At al times, the usable size of the buffer pool (in terms of page frames) would be innodb_buffer_pool_size minus the overhead of allocating the buf_block_t.

Marko Mäkelä added a comment - 2024-02-26 13:53 I see that madvise(addr, length, MADV_FREE) is available on Linux, FreeBSD, OpenBSD, NetBSD, Dragonfly BSD and Solaris. On IBM AIX, all forms of madvise() are ignored. On macOS, we would want MADV_FREE_REUSABLE . I think that we will need a new start-up parameter innodb_buffer_pool_max_size that specifies the virtual address range size that will be allocated for the InnoDB buffer pool. The parameter innodb_buffer_pool_chunk_size would be deprecated and have no effect. The innodb_buffer_pool_size may be set up to the predecladed maximum size. At al times, the usable size of the buffer pool (in terms of page frames) would be innodb_buffer_pool_size minus the overhead of allocating the buf_block_t .

Marko Mäkelä made changes - 2024-02-28 14:42

Link

This issue is blocked by ~~MDEV-33559~~ [ ~~MDEV-33559~~ ]

Marko Mäkelä added a comment - 2024-02-29 14:45

To start with, it could be simplest to double the granularity of innodb_buffer_pool_size from 1 to 2 megabytes, which coincidentally is the smallest hugepage size on many AMD64 implementations, and to make the mapping of block descriptors to block addresses independent of the MMU or TLB page size.

After removing the buf_page_t::frame pointer, we have sizeof(buf_block_t) of 152 bytes in a non-debug 64-bit build, or 200 bytes in a debug build. Within each 2MiB slice of the innodb_buffer_pool_size we would have block descriptors and then the corresponding page frames. Let us look at the part of the previously constructed table:

innodb_page_size/KiB	pages/2MiB	152-byte
4	512	19
8	256	5
16	128	2
32	64	1
64	32	1

The first line means that at innodb_page_size=4k, we would have 512 page frames per 2MiB. But, we will allocate the first 19 of those page frames for the 152-byte buf_block_t descriptors, that is, 19*4096/152 = 77824/152 = up to 512 block descriptors. We actually use 512-19=493 block descriptors.

Similarly, at the default innodb_page_size=16k we would need 2 page frames = 32768 bytes for allocating the 128-2=126 block descriptors (126*152=19152 bytes). This would be the same amount also on debug builds: 126*200=25200 bytes resides between 16384 and 32768 bytes.

Marko Mäkelä added a comment - 2024-02-29 14:45 To start with, it could be simplest to double the granularity of innodb_buffer_pool_size from 1 to 2 megabytes, which coincidentally is the smallest hugepage size on many AMD64 implementations, and to make the mapping of block descriptors to block addresses independent of the MMU or TLB page size. After removing the buf_page_t::frame pointer, we have sizeof(buf_block_t) of 152 bytes in a non-debug 64-bit build, or 200 bytes in a debug build. Within each 2MiB slice of the innodb_buffer_pool_size we would have block descriptors and then the corresponding page frames. Let us look at the part of the previously constructed table: innodb_page_size/KiB pages/2MiB 152-byte 4 512 19 8 256 5 16 128 2 32 64 1 64 32 1 The first line means that at innodb_page_size=4k , we would have 512 page frames per 2MiB. But, we will allocate the first 19 of those page frames for the 152-byte buf_block_t descriptors, that is, 19*4096/152 = 77824/152 = up to 512 block descriptors. We actually use 512-19=493 block descriptors. Similarly, at the default innodb_page_size=16k we would need 2 page frames = 32768 bytes for allocating the 128-2=126 block descriptors (126*152=19152 bytes). This would be the same amount also on debug builds: 126*200=25200 bytes resides between 16384 and 32768 bytes.

Marko Mäkelä added a comment - 2024-03-01 07:41

If we kept doubling the size of an extent (the allocation granularity) of innodb_buffer_pool_size further, the allocation of block descriptors would incur even less overhead. Here are a few numbers, corresponding to sizeof(buf_block_t) for 64-bit and 32-bit non-debug builds:

extent	`innodb_page_size`	pages/extent	152-byte	108-byte
2MiB	4KiB	512	19	14
2MiB	8KiB	256	5	4
2MiB	16KiB	128	2	1
2MiB	32KiB	64	1	1
2MiB	64KiB	32	1	1
4MiB	4KiB	1024	37	27
4MiB	8KiB	512	10	7
4MiB	16KiB	256	3	2
4MiB	32KiB	128	1	1
4MiB	64KiB	64	1	1
8MiB	4KiB	2048	74	53
8MiB	8KiB	1024	19	14
8MiB	16KiB	512	5	4
8MiB	32KiB	256	2	1
8MiB	64KiB	128	1	1
16MiB	4KiB	4096	147	106
16MiB	8KiB	2048	38	27
16MiB	16KiB	1024	10	7
16MiB	32KiB	512	3	2
16MiB	64KiB	256	1	1

Each time the number of block descriptor page frames per extent is odd, halving the size of the extent would incur more overhead. But, we would not want to unnecessarily increase the granularity of innodb_buffer_pool_size. Below is the above information represented in more compact format, which may be harder to understand but easier to compare. The first and (second) choice for each page size is highlighted:

`sizeof(buf_block_t)`	extent	4KiB	8KiB	16KiB	32KiB	64KiB
152	2MiB	19	5	2	1	1
152	4MiB	(37)	10	(3)	(1)	1
152	8MiB	74	19	5	2	(1)
152	16MiB	147	38	10	3	1
108	2MiB	14	4	(1)	1	1
108	4MiB	(27)	(7)	2	(1)	1
108	8MiB	53	14	4	1	(1)
108	16MiB	106	27	7	2	1

At the default innodb_page_size=16k, we can see that out of these numbers, we get the minimal overhead for 64-bit systems (152-byte descriptors) with 8MiB extents, using 5*16KiB of descriptors to cover 512-5=507 pages. That is an overhead of 5/512=0.98%. For innodb_page_size=4k the overhead would be 74/2048=3.61%. By using 16MiB extents we could lower that to 147/4096=3.59%, which is not much better. At innodb_page_size=64k we would halve the overhead from 1/128=0.78% to 1/256=0.39%, but that is a small overhead to begin with and not a default page size.

On 32-bit systems, the 8MiB extent size would be close to optimal, but I think that we should go with 2MiB extent size, doubling the previous granularity of 1MiB. For innodb_page_size=16k we would use 1 page frame to cover 128-1=127 pages, corresponding to an overhead of 1/128=0.78%. With a 16MiB extent size, the overhead would drop to 7/1024=0.68%, which is not significantly better.

Marko Mäkelä added a comment - 2024-03-01 07:41 If we kept doubling the size of an extent (the allocation granularity) of innodb_buffer_pool_size further, the allocation of block descriptors would incur even less overhead. Here are a few numbers, corresponding to sizeof(buf_block_t) for 64-bit and 32-bit non-debug builds: extent innodb_page_size pages/extent 152-byte 108-byte 2MiB 4KiB 512 19 14 2MiB 8KiB 256 5 4 2MiB 16KiB 128 2 1 2MiB 32KiB 64 1 1 2MiB 64KiB 32 1 1 4MiB 4KiB 1024 37 27 4MiB 8KiB 512 10 7 4MiB 16KiB 256 3 2 4MiB 32KiB 128 1 1 4MiB 64KiB 64 1 1 8MiB 4KiB 2048 74 53 8MiB 8KiB 1024 19 14 8MiB 16KiB 512 5 4 8MiB 32KiB 256 2 1 8MiB 64KiB 128 1 1 16MiB 4KiB 4096 147 106 16MiB 8KiB 2048 38 27 16MiB 16KiB 1024 10 7 16MiB 32KiB 512 3 2 16MiB 64KiB 256 1 1 Each time the number of block descriptor page frames per extent is odd, halving the size of the extent would incur more overhead. But, we would not want to unnecessarily increase the granularity of innodb_buffer_pool_size . Below is the above information represented in more compact format, which may be harder to understand but easier to compare. The first and (second) choice for each page size is highlighted: sizeof(buf_block_t) extent 4KiB 8KiB 16KiB 32KiB 64KiB 152 2MiB 19 5 2 1 1 152 4MiB (37) 10 (3) (1) 1 152 8MiB 74 19 5 2 (1) 152 16MiB 147 38 10 3 1 108 2MiB 14 4 (1) 1 1 108 4MiB (27) (7) 2 (1) 1 108 8MiB 53 14 4 1 (1) 108 16MiB 106 27 7 2 1 At the default innodb_page_size=16k , we can see that out of these numbers, we get the minimal overhead for 64-bit systems (152-byte descriptors) with 8MiB extents, using 5*16KiB of descriptors to cover 512-5=507 pages. That is an overhead of 5/512=0.98%. For innodb_page_size=4k the overhead would be 74/2048=3.61%. By using 16MiB extents we could lower that to 147/4096=3.59%, which is not much better. At innodb_page_size=64k we would halve the overhead from 1/128=0.78% to 1/256=0.39%, but that is a small overhead to begin with and not a default page size. On 32-bit systems, the 8MiB extent size would be close to optimal, but I think that we should go with 2MiB extent size, doubling the previous granularity of 1MiB. For innodb_page_size=16k we would use 1 page frame to cover 128-1=127 pages, corresponding to an overhead of 1/128=0.78%. With a 16MiB extent size, the overhead would drop to 7/1024=0.68%, which is not significantly better.

Vladislav Vaintroub added a comment - 2024-03-01 09:21

Just to not - large pages on Windows are special . can't be reserved, then must be reserved and committed in one go via VirtualAlloc(MEM_RESERVE|MEM_COMMIT).
From what I remember, and my memory might be a bit dated, they are always committed, locked in memory (thus LockPagesInMemory privilege is required to allocate them).

Anyway, it might turn out that bufferpool extending and shrinking functionality only works with chunks, if bufferpool consists of large pages. Perhaps it is not a showstopper, but we at least should have a test for it.

Vladislav Vaintroub added a comment - 2024-03-01 09:21 Just to not - large pages on Windows are special . can't be reserved, then must be reserved and committed in one go via VirtualAlloc(MEM_RESERVE|MEM_COMMIT). From what I remember, and my memory might be a bit dated, they are always committed, locked in memory (thus LockPagesInMemory privilege is required to allocate them). Anyway, it might turn out that bufferpool extending and shrinking functionality only works with chunks, if bufferpool consists of large pages. Perhaps it is not a showstopper, but we at least should have a test for it.

Marko Mäkelä made changes - 2024-03-04 16:22

Link

This issue relates to ~~MDEV-33588~~ [ ~~MDEV-33588~~ ]

Marko Mäkelä added a comment - 2024-03-05 13:35

The minimum buffer pool size was 256*5/4 pages, or 320 pages, which corresponds to exactly 5 MiB when using the default innodb_page_size=16k. After these changes, the innodb_buffer_pool_size will include the block descriptors, and therefore the minimum will increase to innodb_buffer_pool_size=6m for that page size. The allocation granularity will remain 1 MiB. If the last extent is incomplete (not a multiple of 8 MiB on 64-bit systems), the first usable page of the last extent will be after the descriptor page frames. That is, if only 1MiB of the last extent were used, we might reserve 5*16KiB for block descriptors, and only 61 page frames at offset 5‥63 would be available. This setup will allow the buffer pool to be resized freely, completely ignoring innodb_buffer_pool_chunk_size, to anything up to the new parameter innodb_buffer_pool_size_max.

On Microsoft Windows, I think that we must disable buffer pool resizing when using large_pages.

Marko Mäkelä added a comment - 2024-03-05 13:35 The minimum buffer pool size was 256*5/4 pages, or 320 pages, which corresponds to exactly 5 MiB when using the default innodb_page_size=16k . After these changes, the innodb_buffer_pool_size will include the block descriptors, and therefore the minimum will increase to innodb_buffer_pool_size=6m for that page size. The allocation granularity will remain 1 MiB. If the last extent is incomplete (not a multiple of 8 MiB on 64-bit systems), the first usable page of the last extent will be after the descriptor page frames. That is, if only 1MiB of the last extent were used, we might reserve 5*16KiB for block descriptors, and only 61 page frames at offset 5‥63 would be available. This setup will allow the buffer pool to be resized freely, completely ignoring innodb_buffer_pool_chunk_size , to anything up to the new parameter innodb_buffer_pool_size_max . On Microsoft Windows, I think that we must disable buffer pool resizing when using large_pages .

Marko Mäkelä added a comment - 2024-03-08 12:34

I tested this with a 30-second Sysbench oltp_update_index workload, with the following statement executed right before the server shutdown:

SET GLOBAL innodb_buffer_pool_size=10485760, innodb_fast_shutdown=0;

In an attempt to do this with a CMAKE_BUILD_TYPE=RelWithDebInfo server yesterday, this led to a crash because there were race conditions in my initial shrinking algorithm. With today’s fixes, the server does not crash or hang, and the buffer pool resizing will be aborted due to running out of space:

2024-03-08 14:25:34 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 119220)

2024-03-08 14:25:36 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 1013 blocks are in use and 38 free. Consider increasing innodb_buffer_pool_size.

2024-03-08 14:25:36 69 [ERROR] mariadbd: innodb_buffer_pool_size change aborted

2024-03-08 14:25:36 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown

2024-03-08 14:25:36 0 [Note] InnoDB: FTS optimize thread exiting.

2024-03-08 14:25:36 0 [Note] InnoDB: to purge 5942416 transactions

2024-03-08 14:25:46 0 [Note] InnoDB: Starting shutdown...

Before I implemented this logic, I observed a hang where the purge coordinator and 2 purge worker tasks were blocked, waiting to allocate a page frame for reading something into the buffer pool.

Marko Mäkelä added a comment - 2024-03-08 12:34 I tested this with a 30-second Sysbench oltp_update_index workload, with the following statement executed right before the server shutdown: SET GLOBAL innodb_buffer_pool_size=10485760, innodb_fast_shutdown=0; In an attempt to do this with a CMAKE_BUILD_TYPE=RelWithDebInfo server yesterday, this led to a crash because there were race conditions in my initial shrinking algorithm. With today’s fixes, the server does not crash or hang, and the buffer pool resizing will be aborted due to running out of space: 2024-03-08 14:25:34 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 119220) 2024-03-08 14:25:36 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 1013 blocks are in use and 38 free. Consider increasing innodb_buffer_pool_size. 2024-03-08 14:25:36 69 [ERROR] mariadbd: innodb_buffer_pool_size change aborted 2024-03-08 14:25:36 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-03-08 14:25:36 0 [Note] InnoDB: FTS optimize thread exiting. 2024-03-08 14:25:36 0 [Note] InnoDB: to purge 5942416 transactions 2024-03-08 14:25:46 0 [Note] InnoDB: Starting shutdown... Before I implemented this logic, I observed a hang where the purge coordinator and 2 purge worker tasks were blocked, waiting to allocate a page frame for reading something into the buffer pool.

Marko Mäkelä added a comment - 2024-03-14 14:41

It was tricky to get the logic around ROW_FORMAT=COMPRESSED to work when the buffer pool is being shrunk, but I think that I finally made it today. The rewritten test innodb.innodb_buffer_pool_resize was very useful in that, along with rr record of course. Failures were mostly observed in RelWithDebInfo, not Debug, which complicated the debugging.

I ran a quick Sysbench based performance test to compare this to its prerequisite ~~MDEV-33588~~:

revision	throughput/tps	average latency/ms
baseline	198494.96	0.32
MDEV-33588+baseline	196647.46	0.32
work in progress	194563.12	0.33

I think that more extensive testing is needed to see if ~~MDEV-33588~~ actually introduces a performance regression. There undeniably is a clear performance regression for the current work in progress. I have observed it also earlier. I think that it can be helped by not removing the buf_page_t::frame pointer. We can initialize it lazily.

Marko Mäkelä added a comment - 2024-03-14 14:41 It was tricky to get the logic around ROW_FORMAT=COMPRESSED to work when the buffer pool is being shrunk, but I think that I finally made it today. The rewritten test innodb.innodb_buffer_pool_resize was very useful in that, along with rr record of course. Failures were mostly observed in RelWithDebInfo , not Debug , which complicated the debugging. I ran a quick Sysbench based performance test to compare this to its prerequisite MDEV-33588 : revision throughput/tps average latency/ms baseline 198494.96 0.32 MDEV-33588 +baseline 196647.46 0.32 work in progress 194563.12 0.33 I think that more extensive testing is needed to see if MDEV-33588 actually introduces a performance regression. There undeniably is a clear performance regression for the current work in progress. I have observed it also earlier. I think that it can be helped by not removing the buf_page_t::frame pointer. We can initialize it lazily.

Marko Mäkelä added a comment - 2024-03-14 14:44

By the way, before this task, InnoDB could hang if one attempted to shrink the buffer pool too much:

10.6 4ac8c4c820ebcff3571a2c67acc4fc41510b2d33
2024-03-14 16:34:39 69 [Note] InnoDB: Requested to resize buffer pool. (new size: 134217728 bytes)
2024-03-14 16:34:39 0 [Note] InnoDB: Resizing buffer pool from 32212254720 to 134217728 (unit=134217728).
2024-03-14 16:34:39 0 [Note] InnoDB: Disabling adaptive hash index.
2024-03-14 16:34:39 0 [Note] InnoDB: Withdrawing blocks to be shrunken.
2024-03-14 16:34:39 0 [Note] InnoDB: start to withdraw the last 1938768 blocks
2024-03-14 16:34:39 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
2024-03-14 16:34:39 0 [Note] InnoDB: FTS optimize thread exiting.
2024-03-14 16:34:39 0 [Note] InnoDB: to purge 22333135 transactions
2024-03-14 16:34:40 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 383253 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.
2024-03-14 16:34:40 0 [Note] InnoDB: withdrawing blocks. (1563630/1938768)
2024-03-14 16:34:40 0 [Note] InnoDB: withdrew 1414234 blocks from free list. Tried to relocate 0 pages (1563755/1938768)
…
2024-03-14 16:35:02 0 [Note] InnoDB: withdrawing blocks. (1604389/1938768)
2024-03-14 16:35:02 0 [Note] InnoDB: withdrew 0 blocks from free list. Tried to relocate 0 pages (1604389/1938768)
2024-03-14 16:35:02 0 [Note] InnoDB: will retry to withdraw later

Marko Mäkelä added a comment - 2024-03-14 14:44 By the way, before this task, InnoDB could hang if one attempted to shrink the buffer pool too much: 10.6 4ac8c4c820ebcff3571a2c67acc4fc41510b2d33 2024-03-14 16:34:39 69 [Note] InnoDB: Requested to resize buffer pool. (new size: 134217728 bytes) 2024-03-14 16:34:39 0 [Note] InnoDB: Resizing buffer pool from 32212254720 to 134217728 (unit=134217728). 2024-03-14 16:34:39 0 [Note] InnoDB: Disabling adaptive hash index. 2024-03-14 16:34:39 0 [Note] InnoDB: Withdrawing blocks to be shrunken. 2024-03-14 16:34:39 0 [Note] InnoDB: start to withdraw the last 1938768 blocks 2024-03-14 16:34:39 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-03-14 16:34:39 0 [Note] InnoDB: FTS optimize thread exiting. 2024-03-14 16:34:39 0 [Note] InnoDB: to purge 22333135 transactions 2024-03-14 16:34:40 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 383253 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size. 2024-03-14 16:34:40 0 [Note] InnoDB: withdrawing blocks. (1563630/1938768) 2024-03-14 16:34:40 0 [Note] InnoDB: withdrew 1414234 blocks from free list. Tried to relocate 0 pages (1563755/1938768) … 2024-03-14 16:35:02 0 [Note] InnoDB: withdrawing blocks. (1604389/1938768) 2024-03-14 16:35:02 0 [Note] InnoDB: withdrew 0 blocks from free list. Tried to relocate 0 pages (1604389/1938768) 2024-03-14 16:35:02 0 [Note] InnoDB: will retry to withdraw later

Marko Mäkelä made changes - 2024-03-14 14:50

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2024-03-19 18:32

Fix Version/s		11.6 [ 29515 ]
Fix Version/s	11.5 [ 29506 ]

Sergei Golubchik made changes - 2024-06-04 15:40

Fix Version/s		11.7 [ 29815 ]
Fix Version/s	11.6 [ 29515 ]

Jira Automation (IT) made changes - 2024-07-04 01:29

Zendesk Related Tickets		201628
Zendesk active tickets		201628

Sergei Golubchik made changes - 2024-09-24 13:53

Fix Version/s		11.8 [ 29921 ]
Fix Version/s	11.7 [ 29815 ]

Ralf Gebhardt made changes - 2024-10-11 14:21

Priority

Critical [ 2 ]

Major [ 3 ]

Ralf Gebhardt made changes - 2024-10-21 11:19

Priority

Major [ 3 ]

Critical [ 2 ]

Marko Mäkelä made changes - 2024-11-22 14:22

Link

This issue relates to ~~MDEV-35485~~ [ ~~MDEV-35485~~ ]

Sergei Golubchik made changes - 2024-12-10 16:11

Fix Version/s		11.9 [ 29945 ]
Fix Version/s	11.8 [ 29921 ]

Marko Mäkelä made changes - 2025-01-29 07:28

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2025-01-30 13:44

I revived this work, still based on 10.6 so that if any unrelated bugs are found during testing, it will be more convenient to fix them. There currently is an issue with the innodb.doublewrite test, which I am yet to fully diagnose and fix. It might be the case that something around the doublewrite buffer is currently broken.

This will also include ~~MDEV-25340~~. The server start-up time with a large buffer pool size seems to be roughly halved from what it used to be.

I tested a 64-thread, 64-table, 100k row Sysbench oltp_update_index with an initial 30GiB buffer pool, shrinking it to 10 MiB during the workload:

10.6-MDEV-29445 07513d4faba65a074ce64d308f5327cd5954a324
[ 60s ] thds: 64 tps: 188308.82 qps: 188308.82 (r/w/o: 0.00/188308.82/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00
[ 65s ] thds: 64 tps: 188916.86 qps: 188917.06 (r/w/o: 0.00/188917.06/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 64 tps: 191395.07 qps: 191394.87 (r/w/o: 0.00/191394.87/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00
[ 75s ] thds: 64 tps: 78314.97 qps: 78314.97 (r/w/o: 0.00/78314.97/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 64 tps: 772.44 qps: 772.44 (r/w/o: 0.00/772.44/0.00) lat (ms,99%): 5607.61 err/s: 0.00 reconn/s: 0.00
[ 85s ] thds: 64 tps: 1695.37 qps: 1695.37 (r/w/o: 0.00/1695.37/0.00) lat (ms,99%): 179.94 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 64 tps: 1789.03 qps: 1789.03 (r/w/o: 0.00/1789.03/0.00) lat (ms,99%): 153.02 err/s: 0.00 reconn/s: 0.00
[ 95s ] thds: 64 tps: 1742.64 qps: 1742.64 (r/w/o: 0.00/1742.64/0.00) lat (ms,99%): 170.48 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 64 tps: 1736.15 qps: 1736.15 (r/w/o: 0.00/1736.15/0.00) lat (ms,99%): 155.80 err/s: 0.00 reconn/s: 0.00
[ 105s ] thds: 64 tps: 1834.52 qps: 1834.52 (r/w/o: 0.00/1834.52/0.00) lat (ms,99%): 137.35 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 64 tps: 1836.39 qps: 1836.39 (r/w/o: 0.00/1836.39/0.00) lat (ms,99%): 139.85 err/s: 0.00 reconn/s: 0.00
[ 115s ] thds: 64 tps: 1744.06 qps: 1744.06 (r/w/o: 0.00/1744.06/0.00) lat (ms,99%): 176.73 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 64 tps: 1780.16 qps: 1780.16 (r/w/o: 0.00/1780.16/0.00) lat (ms,99%): 161.51 err/s: 0.00 reconn/s: 0.00

In the server error log, I see that shutdown is hanging, so I will have something to do:

2025-01-30 15:36:37 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 328633)

2025-01-30 15:36:46 69 [Note] InnoDB: Resizing hash tables

2025-01-30 15:36:46 69 [Note] InnoDB: innodb_buffer_pool_size=10m (630 pages) resized from 30720m (1946880 pages)

2025-01-30 15:37:25 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown

2025-01-30 15:37:25 0 [Note] InnoDB: FTS optimize thread exiting.

2025-01-30 15:37:25 0 [Note] InnoDB: to purge 11743512 transactions

2025-01-30 15:37:26 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 413 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.

The buf_flush_page_cleaner() thread is busy, invoking buf_flush_LRU() and unable to free any pages. There are no dirty pages in the buffer pool. I checked that all the 413 pages of buf_pool.LRU are undo log pages in state UNFIXED+1, all of them buffer-fixed by trx_purge_attach_undo_recs(). The buffer pool usage limit in that subsystem seems to be misplaced.

Marko Mäkelä added a comment - 2025-01-30 13:44 I revived this work, still based on 10.6 so that if any unrelated bugs are found during testing, it will be more convenient to fix them. There currently is an issue with the innodb.doublewrite test, which I am yet to fully diagnose and fix. It might be the case that something around the doublewrite buffer is currently broken. This will also include MDEV-25340 . The server start-up time with a large buffer pool size seems to be roughly halved from what it used to be. I tested a 64-thread, 64-table, 100k row Sysbench oltp_update_index with an initial 30GiB buffer pool, shrinking it to 10 MiB during the workload: 10.6-MDEV-29445 07513d4faba65a074ce64d308f5327cd5954a324 [ 60s ] thds: 64 tps: 188308.82 qps: 188308.82 (r/w/o: 0.00/188308.82/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00 [ 65s ] thds: 64 tps: 188916.86 qps: 188917.06 (r/w/o: 0.00/188917.06/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 64 tps: 191395.07 qps: 191394.87 (r/w/o: 0.00/191394.87/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00 [ 75s ] thds: 64 tps: 78314.97 qps: 78314.97 (r/w/o: 0.00/78314.97/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 64 tps: 772.44 qps: 772.44 (r/w/o: 0.00/772.44/0.00) lat (ms,99%): 5607.61 err/s: 0.00 reconn/s: 0.00 [ 85s ] thds: 64 tps: 1695.37 qps: 1695.37 (r/w/o: 0.00/1695.37/0.00) lat (ms,99%): 179.94 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 64 tps: 1789.03 qps: 1789.03 (r/w/o: 0.00/1789.03/0.00) lat (ms,99%): 153.02 err/s: 0.00 reconn/s: 0.00 [ 95s ] thds: 64 tps: 1742.64 qps: 1742.64 (r/w/o: 0.00/1742.64/0.00) lat (ms,99%): 170.48 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 64 tps: 1736.15 qps: 1736.15 (r/w/o: 0.00/1736.15/0.00) lat (ms,99%): 155.80 err/s: 0.00 reconn/s: 0.00 [ 105s ] thds: 64 tps: 1834.52 qps: 1834.52 (r/w/o: 0.00/1834.52/0.00) lat (ms,99%): 137.35 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 64 tps: 1836.39 qps: 1836.39 (r/w/o: 0.00/1836.39/0.00) lat (ms,99%): 139.85 err/s: 0.00 reconn/s: 0.00 [ 115s ] thds: 64 tps: 1744.06 qps: 1744.06 (r/w/o: 0.00/1744.06/0.00) lat (ms,99%): 176.73 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 64 tps: 1780.16 qps: 1780.16 (r/w/o: 0.00/1780.16/0.00) lat (ms,99%): 161.51 err/s: 0.00 reconn/s: 0.00 In the server error log, I see that shutdown is hanging, so I will have something to do: 2025-01-30 15:36:37 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 328633) 2025-01-30 15:36:46 69 [Note] InnoDB: Resizing hash tables 2025-01-30 15:36:46 69 [Note] InnoDB: innodb_buffer_pool_size=10m (630 pages) resized from 30720m (1946880 pages) 2025-01-30 15:37:25 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2025-01-30 15:37:25 0 [Note] InnoDB: FTS optimize thread exiting. 2025-01-30 15:37:25 0 [Note] InnoDB: to purge 11743512 transactions 2025-01-30 15:37:26 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 413 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size. The buf_flush_page_cleaner() thread is busy, invoking buf_flush_LRU() and unable to free any pages. There are no dirty pages in the buffer pool. I checked that all the 413 pages of buf_pool.LRU are undo log pages in state UNFIXED+1 , all of them buffer-fixed by trx_purge_attach_undo_recs() . The buffer pool usage limit in that subsystem seems to be misplaced.

Marko Mäkelä added a comment - 2025-01-31 11:29

I applied some tweaks and performed some more load testing. An attempt to shrink the buffer pool too much during a heavy write load will typically fail, because buf_flush_page_cleaner() would invoke buf_pool.LRU_warn() to notify that it is unable to free any blocks.

Marko Mäkelä added a comment - 2025-01-31 11:29 I applied some tweaks and performed some more load testing. An attempt to shrink the buffer pool too much during a heavy write load will typically fail, because buf_flush_page_cleaner() would invoke buf_pool.LRU_warn() to notify that it is unable to free any blocks.

Marko Mäkelä added a comment - 2025-02-04 12:20

I ran a simple performance test on RAM disk on a dual Intel® Xeon® Gold 6230R (26×2 threads per socket), with innodb_buffer_pool_size=5G and innodb_log_file_size=5G:

sysbench oltp_update_index --tables=100 --table_size=10000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run

Compared to the baseline, I observed a 2% regression in the average throughput. My first suspect would be the lazy initialization of the buffer pool (~~MDEV-25340~~), which is part of this change, but I did not analyze it deeper yet.

I also tested crash recovery by killing the workload about 115 seconds into it (5 seconds before it would end), and measuring the time to recover a copy of that data directory, using two settings for innodb_buffer_pool_size: 1 GiB (requiring 2 recovery batches) and 5 GiB (682,236,800 bytes of log processed in 1 batch). The times between baseline and the patch were very similar. I will have to repeat this experiment after diagnosing and addressing the performance regression during the workload.

Marko Mäkelä added a comment - 2025-02-04 12:20 I ran a simple performance test on RAM disk on a dual Intel® Xeon® Gold 6230R (26×2 threads per socket), with innodb_buffer_pool_size=5G and innodb_log_file_size=5G : sysbench oltp_update_index --tables=100 --table_size=10000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run Compared to the baseline, I observed a 2% regression in the average throughput. My first suspect would be the lazy initialization of the buffer pool ( MDEV-25340 ), which is part of this change, but I did not analyze it deeper yet. I also tested crash recovery by killing the workload about 115 seconds into it (5 seconds before it would end), and measuring the time to recover a copy of that data directory, using two settings for innodb_buffer_pool_size : 1 GiB (requiring 2 recovery batches) and 5 GiB (682,236,800 bytes of log processed in 1 batch). The times between baseline and the patch were very similar. I will have to repeat this experiment after diagnosing and addressing the performance regression during the workload.

Marko Mäkelä added a comment - 2025-02-04 13:19

Some regression right after server startup could be expected due to the lazy initialization of the buffer pool. However, once the entire buffer pool corresponding to the working set of the workload has been fully initialized, we could expect that no buf_pool_t::lazy_allocate() will take place. According to perf record -g, only 2.13% of total samples are being recorded in buf_page_get_low(); 0.83% of total samples attributed to waiting for a shared buf_page_t::lock there. Any other buffer pool related functions account for less than 0.02% of the samples each, probably less than 0.2% of the total samples. Because this is a 10.6 based development branch without ~~MDEV-27774~~, a write-heavy benchmark such as sysbench oltp_update_index would be dominated by contention on log_sys.mutex. I will move on to a read-only benchmark.

Marko Mäkelä added a comment - 2025-02-04 13:19 Some regression right after server startup could be expected due to the lazy initialization of the buffer pool. However, once the entire buffer pool corresponding to the working set of the workload has been fully initialized, we could expect that no buf_pool_t::lazy_allocate() will take place. According to perf record -g , only 2.13% of total samples are being recorded in buf_page_get_low() ; 0.83% of total samples attributed to waiting for a shared buf_page_t::lock there. Any other buffer pool related functions account for less than 0.02% of the samples each, probably less than 0.2% of the total samples. Because this is a 10.6 based development branch without MDEV-27774 , a write-heavy benchmark such as sysbench oltp_update_index would be dominated by contention on log_sys.mutex . I will move on to a read-only benchmark.

Marko Mäkelä added a comment - 2025-02-04 13:45

I am observing a small performance improvement with

sysbench oltp_read_only --tables=100 --table_size=100000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run

Both for the baseline and the patched version, the first of two such subsequent runs is faster. The patch is improving throughput (queries per second) by 775537.75/769316.38 = 0.80% for the first run and by 761852.58/758954.99 = 0.38% for the second run.

My benchmark setup is far from reliable. To get more stable numbers, it would help to pin the mariadbd process to a single NUMA node and to disable hyperthreading as well. This can be considered at most as a sanity check before running a broader set of performance tests.

Marko Mäkelä added a comment - 2025-02-04 13:45 I am observing a small performance improvement with sysbench oltp_read_only --tables=100 --table_size=100000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run Both for the baseline and the patched version, the first of two such subsequent runs is faster. The patch is improving throughput (queries per second) by 775537.75/769316.38 = 0.80% for the first run and by 761852.58/758954.99 = 0.38% for the second run. My benchmark setup is far from reliable. To get more stable numbers, it would help to pin the mariadbd process to a single NUMA node and to disable hyperthreading as well. This can be considered at most as a sanity check before running a broader set of performance tests.

Marko Mäkelä made changes - 2025-02-07 16:08

Link

This issue is blocked by ~~MDEV-25340~~ [ ~~MDEV-25340~~ ]

Marko Mäkelä added a comment - 2025-02-07 16:18

After some discussion with wlad, I decided to check if this actually depends on lazy buffer pool initialization (~~MDEV-25340~~), which I had implemented in my development branch. It seems that the logic in many places would be simpler if that change was reverted. I posted some performance test results to ~~MDEV-25340~~. With a 96 GiB buffer pool allocated in 1 GiB MMU pages on a dual Haswell/Broadwell Xeon that has 2×64GiB of RAM, we’re talking about possibly halving the start-up time, but the starting point was less than 1 second. I think that a startup time of about 10 ms/GB should be acceptable.

Today, I removed the lazy allocation. Two loops on startup and resize are now invoking block_descriptors_in_bytes, which makes them extremely slow. I will fix that next week. Then, hopefully, this task is practically done. In the stress testing so far, mleich has reported one mystery crash that does not reproduce under rr. It might end up having been fixed by today’s cleanup.

Marko Mäkelä added a comment - 2025-02-07 16:18 After some discussion with wlad , I decided to check if this actually depends on lazy buffer pool initialization ( MDEV-25340 ), which I had implemented in my development branch. It seems that the logic in many places would be simpler if that change was reverted. I posted some performance test results to MDEV-25340 . With a 96 GiB buffer pool allocated in 1 GiB MMU pages on a dual Haswell/Broadwell Xeon that has 2×64GiB of RAM, we’re talking about possibly halving the start-up time, but the starting point was less than 1 second. I think that a startup time of about 10 ms/GB should be acceptable. Today, I removed the lazy allocation. Two loops on startup and resize are now invoking block_descriptors_in_bytes , which makes them extremely slow. I will fix that next week. Then, hopefully, this task is practically done. In the stress testing so far, mleich has reported one mystery crash that does not reproduce under rr . It might end up having been fixed by today’s cleanup.

Marko Mäkelä added a comment - 2025-02-10 14:56

Today, after fixing the loops, I retested the ~~MDEV-25340~~ scenario. I observed a significantly slower but still kind-of acceptable start-up time compared to the lazy initialization.

I believe that this should ideally target the 10.11 branch, for the following reasons:

This fixes a race condition in the adaptive hash index and therefore fixes ~~MDEV-35485~~.
This might make obsolete the ~~MDEV-24670~~ interface, which has been somewhat problematic and appeared in 10.11.

Marko Mäkelä added a comment - 2025-02-10 14:56 Today, after fixing the loops, I retested the MDEV-25340 scenario. I observed a significantly slower but still kind-of acceptable start-up time compared to the lazy initialization. I believe that this should ideally target the 10.11 branch, for the following reasons: This fixes a race condition in the adaptive hash index and therefore fixes MDEV-35485 . This might make obsolete the MDEV-24670 interface, which has been somewhat problematic and appeared in 10.11.

Marko Mäkelä made changes - 2025-02-11 08:59

Link

This issue relates to ~~MDEV-36061~~ [ ~~MDEV-36061~~ ]

Marko Mäkelä made changes - 2025-02-13 06:53

Assignee	Marko Mäkelä [ marko ]	Debarun Banerjee [ JIRAUSER54513 ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Marko Mäkelä made changes - 2025-02-19 15:19

Link

This issue relates to ~~MDEV-34863~~ [ ~~MDEV-34863~~ ]

Marko Mäkelä added a comment - 2025-02-21 13:53 - edited

Based on some performance tests that I conducted today, I’m seeing a much smaller reduction of the resident set size than expected when adjusting the innodb_buffer_pool_size between 50GiB and 1GiB.

Edit: I started to think if we really need to include the lazy buffer pool allocation (~~MDEV-25340~~) in this, but then I realized that the lazy allocation would only have an impact when increasing innodb_buffer_pool_size. By lazy allocation, we would avoid the immediate pollution of some pages by linking each block descriptor to the buf_pool.free list. This ought to be more prominent when using large_pages (1 GiB or 2 MiB instead of 4 KiB on x86-64).

I figured out a way to artificially limit the available memory when not using large_pages. The following would seem to ‘retract’ 98 GiB RAM on my 128 GiB test system:

echo 96|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

When I was testing some memory pressure related changes (~~MDEV-34863~~) in the same branch, I think that I observed some benefit from invoking madvise(MADV_FREE), but I am unsure if it has any effect on explicit huge pages on Linux.

In pmap -x $(pgrep mariadbd) I can identify the buffer pool allocation:

Address           Kbytes     RSS   Dirty Mode  Mapping

00007f0ae4000000 52443136 18468352 3801088 rw---   [ anon ]

The "Dirty" size is slightly smaller than the current innodb_buffer_pool_size=4m. The virtual size corresponds to the start-up parameter innodb_buffer_pool_size_max=50m. But the resident set size (RSS) is much larger than I would expect.

Marko Mäkelä added a comment - 2025-02-21 13:53 - edited Based on some performance tests that I conducted today, I’m seeing a much smaller reduction of the resident set size than expected when adjusting the innodb_buffer_pool_size between 50GiB and 1GiB. Edit: I started to think if we really need to include the lazy buffer pool allocation ( MDEV-25340 ) in this, but then I realized that the lazy allocation would only have an impact when increasing innodb_buffer_pool_size . By lazy allocation, we would avoid the immediate pollution of some pages by linking each block descriptor to the buf_pool.free list. This ought to be more prominent when using large_pages (1 GiB or 2 MiB instead of 4 KiB on x86-64). I figured out a way to artificially limit the available memory when not using large_pages . The following would seem to ‘retract’ 98 GiB RAM on my 128 GiB test system: echo 96|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages When I was testing some memory pressure related changes ( MDEV-34863 ) in the same branch, I think that I observed some benefit from invoking madvise(MADV_FREE) , but I am unsure if it has any effect on explicit huge pages on Linux. In pmap -x $(pgrep mariadbd) I can identify the buffer pool allocation: Address Kbytes RSS Dirty Mode Mapping 00007f0ae4000000 52443136 18468352 3801088 rw--- [ anon ] The "Dirty" size is slightly smaller than the current innodb_buffer_pool_size=4m . The virtual size corresponds to the start-up parameter innodb_buffer_pool_size_max=50m . But the resident set size (RSS) is much larger than I would expect.

Marko Mäkelä added a comment - 2025-02-21 15:20

I found some claims that Linux would take MADV_FREE as a hint only, to free the memory in the event memory pressure is detected. If I use MADV_DONTNEED instead of MADV_FREE, then the mapping will be shrunk immediately. Indeed, I observe that the RES reported by top would drop by much more than with MADV_FREE, and pmap -x $(pgrep mariadbd) would report exactly the expected innodb_buffer_pool_size=10g:

Address           Kbytes     RSS   Dirty Mode  Mapping

00007f729a800000 52443136 10485760 8321024 rw---   [ anon ]

I will retest large_pages=1 with MADV_DONTNEED to determine the granularity of those allocations.

Marko Mäkelä added a comment - 2025-02-21 15:20 I found some claims that Linux would take MADV_FREE as a hint only, to free the memory in the event memory pressure is detected. If I use MADV_DONTNEED instead of MADV_FREE , then the mapping will be shrunk immediately. Indeed, I observe that the RES reported by top would drop by much more than with MADV_FREE , and pmap -x $(pgrep mariadbd) would report exactly the expected innodb_buffer_pool_size=10g : Address Kbytes RSS Dirty Mode Mapping 00007f729a800000 52443136 10485760 8321024 rw--- [ anon ] I will retest large_pages=1 with MADV_DONTNEED to determine the granularity of those allocations.

Marko Mäkelä added a comment - 2025-02-21 16:02

It turns out that when large_pages=1 are in use, the RES in top will not cover the innodb_buffer_pool_size=47g at all. During my test, it would remain steadily at around 5.5GiB. Also pmap -x $(pgrep mariadbd) would only report the size of the virtual address range:

Address           Kbytes     RSS   Dirty Mode  Mapping

00007f1240000000 50331648       0       0 rw--- anon_hugepage (deleted)

In /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages the size 0 was reported, even after I reduced the innodb_buffer_pool_size of the running process to 4GiB. The MADV_DONTNEED does not appear to have any effect on huge page allocation, except maybe for some TLB pages. After the mariadbd process was shut down, all 48 huge pages that I had configured were reported as available.

For the record, here is the test script that I used:

#!/bin/bash

: ${SRCTREE=/mariadb/10.11}

: ${MDIR=/dev/shm/10.11}

: ${TDIR=/dev/shm/sbtest}

LD_LIBRARY_PATH="$MDIR/libmysql"

MYSQL_SOCK=$TDIR/mysqld.sock

MYSQL_USER=root

#: ${INNODB=--innodb-log-file-size=5g --innodb-undo-tablespaces=3 --innodb-undo-log-truncate=ON}

#: ${INNODB=--innodb-log-file-size=5g --innodb-data-home-dir=/dev/shm}

: ${INNODB=--innodb-log-file-size=5g}

SYSBENCH="sysbench oltp_update_non_index \

  --mysql-socket=$MYSQL_SOCK \

  --mysql-user=$MYSQL_USER \

  --mysql-db=test \

  --percentile=99 \

  --tables=40 \

  --table_size=1000000"

rm -rf "$TDIR"

cd $MDIR

sh scripts/mariadb-install-db --user="$USERNAME" --srcdir="$SRCTREE" --builddir=. --datadir="$TDIR" --auth-root-authentication-method=normal $INNODB

cd ../

#numactl --cpunodebind 1 --localalloc \

$MDIR/sql/mariadbd --no-defaults --gdb --innodb \

   --datadir="$TDIR" --socket=$MYSQL_SOCK \

   --large-pages=1 \

  $INNODB\

  --innodb_buffer_pool_size=47g \

  --innodb_buffer_pool_size_min=5g --innodb-buffer-pool-size-max=47g \

  --innodb_flush_log_at_trx_commit=0 \

  --innodb-fast-shutdown=0 \

  --max-connections=300 \

  --aria-checkpoint-interval=0 > "$TDIR"/mysqld.err 2>&1 &

timeo=600

echo -n "waiting for server to come up "

while [ $timeo -gt 0 ]

do

  $MDIR/client/mariadb-admin -S $MYSQL_SOCK -u $MYSQL_USER -b -s ping && break

  echo -n "."

  timeo=$(($timeo - 1))

  sleep 1

done

if [ $timeo -eq 0 ]

then

  echo " server not starting! Abort!"

  break

fi

#numactl --cpunodebind 0 --localalloc \

$SYSBENCH prepare --threads=40

#numactl --cpunodebind 0 --localalloc \

$SYSBENCH --rand-seed=42 --rand-type=uniform --max-requests=0 --time=2400 --report-interval=5 --threads=40 run

#$SYSBENCH cleanup

$MDIR/client/mariadb-admin -u $MYSQL_USER -S $MYSQL_SOCK shutdown

An interesting observation is that shrinking the buffer pool would very easily fail during the workload if show status like 'innodb_history_list_length' was large. Sometimes, set global innodb_purge_threads=32; would help, other times I had to attach a debugger to the sysbench process to pause the workload. Sometimes I also had to set global innodb_max_dirty_pages_pct=1; in order to improve the chances of shrinking the buffer pool to very small sizes to succeed. I have implemented some logic for aborting the shrinking if InnoDB would seem to run out of buffer pool. If I detached the debugger from sysbench after shrinking the buffer pool to something very small (such as 10MiB), it would report 0.0qps until I’d increase the buffer pool size again.

Marko Mäkelä added a comment - 2025-02-21 16:02 It turns out that when large_pages=1 are in use, the RES in top will not cover the innodb_buffer_pool_size=47g at all. During my test, it would remain steadily at around 5.5GiB. Also pmap -x $(pgrep mariadbd) would only report the size of the virtual address range: Address Kbytes RSS Dirty Mode Mapping 00007f1240000000 50331648 0 0 rw--- anon_hugepage (deleted) In /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages the size 0 was reported, even after I reduced the innodb_buffer_pool_size of the running process to 4GiB. The MADV_DONTNEED does not appear to have any effect on huge page allocation, except maybe for some TLB pages. After the mariadbd process was shut down, all 48 huge pages that I had configured were reported as available. For the record, here is the test script that I used: #!/bin/bash : ${SRCTREE= /mariadb/10 .11} : ${MDIR= /dev/shm/10 .11} : ${TDIR= /dev/shm/sbtest } LD_LIBRARY_PATH= "$MDIR/libmysql" MYSQL_SOCK=$TDIR /mysqld .sock MYSQL_USER=root #: ${INNODB=--innodb-log-file-size=5g --innodb-undo-tablespaces=3 --innodb-undo-log-truncate=ON} #: ${INNODB=--innodb-log-file-size=5g --innodb-data-home-dir=/dev/shm} : ${INNODB=--innodb-log- file -size=5g} SYSBENCH="sysbench oltp_update_non_index \ --mysql-socket=$MYSQL_SOCK \ --mysql-user=$MYSQL_USER \ --mysql-db= test \ --percentile=99 \ --tables=40 \ --table_size=1000000" rm -rf "$TDIR" cd $MDIR sh scripts /mariadb-install-db --user= "$USERNAME" --srcdir= "$SRCTREE" --builddir=. --datadir= "$TDIR" --auth-root-authentication-method=normal $INNODB cd ../ #numactl --cpunodebind 1 --localalloc \ $MDIR /sql/mariadbd --no-defaults --gdb --innodb \ --datadir= "$TDIR" --socket=$MYSQL_SOCK \ --large-pages=1 \ $INNODB\ --innodb_buffer_pool_size=47g \ --innodb_buffer_pool_size_min=5g --innodb-buffer-pool-size-max=47g \ --innodb_flush_log_at_trx_commit=0 \ --innodb-fast- shutdown =0 \ --max-connections=300 \ \ --aria-checkpoint-interval=0 > "$TDIR" /mysqld .err 2>&1 & timeo=600 echo -n "waiting for server to come up " while [ $timeo -gt 0 ] do $MDIR /client/mariadb-admin -S $MYSQL_SOCK -u $MYSQL_USER -b -s ping && break echo -n "." timeo=$(($timeo - 1)) sleep 1 done if [ $timeo - eq 0 ] then echo " server not starting! Abort!" break fi #numactl --cpunodebind 0 --localalloc \ $SYSBENCH prepare --threads=40 #numactl --cpunodebind 0 --localalloc \ $SYSBENCH --rand-seed=42 --rand- type =uniform --max-requests=0 -- time =2400 --report-interval=5 --threads=40 run #$SYSBENCH cleanup $MDIR /client/mariadb-admin -u $MYSQL_USER -S $MYSQL_SOCK shutdown An interesting observation is that shrinking the buffer pool would very easily fail during the workload if show status like 'innodb_history_list_length' was large. Sometimes, set global innodb_purge_threads=32; would help, other times I had to attach a debugger to the sysbench process to pause the workload. Sometimes I also had to set global innodb_max_dirty_pages_pct=1; in order to improve the chances of shrinking the buffer pool to very small sizes to succeed. I have implemented some logic for aborting the shrinking if InnoDB would seem to run out of buffer pool. If I detached the debugger from sysbench after shrinking the buffer pool to something very small (such as 10MiB), it would report 0.0qps until I’d increase the buffer pool size again.

Marko Mäkelä added a comment - 2025-02-24 07:43

I’m considering to replace the use of MADV_FREE with MADV_DONTNEED so that the resident set size of the mariadbd process would immediately reflect the change in innodb_buffer_pool_size. Because shrinking the InnoDB buffer pool can be an intrusive operation, invoking the more expensive madvise() variant should be acceptable. At least it should reduce confusion and related support requests.

It turns out that IBM AIX only documents MADV_DONTNEED but not MADV_FREE. Apple macOS documents both, but the description says that MADV_FREE would free the memory immediately, while MADV_DONTNEED could defer it, which is the exact opposite of what the documentation for other systems (Linux, FreeBSD, NetBSD, OpenBSD, Dragonfly BSD) is saying. My understanding is that MADV_DONTNEED was first, and MADV_FREE was introduced later (for example, Linux 4.5, revised for swapless systems in 4.12) in order to allow a reduction of overhead in implementations of malloc(3) and free(3).

Marko Mäkelä added a comment - 2025-02-24 07:43 I’m considering to replace the use of MADV_FREE with MADV_DONTNEED so that the resident set size of the mariadbd process would immediately reflect the change in innodb_buffer_pool_size . Because shrinking the InnoDB buffer pool can be an intrusive operation, invoking the more expensive madvise() variant should be acceptable. At least it should reduce confusion and related support requests. It turns out that IBM AIX only documents MADV_DONTNEED but not MADV_FREE . Apple macOS documents both , but the description says that MADV_FREE would free the memory immediately, while MADV_DONTNEED could defer it, which is the exact opposite of what the documentation for other systems (Linux, FreeBSD, NetBSD, OpenBSD, Dragonfly BSD) is saying. My understanding is that MADV_DONTNEED was first, and MADV_FREE was introduced later (for example, Linux 4.5, revised for swapless systems in 4.12 ) in order to allow a reduction of overhead in implementations of malloc(3) and free(3) .

Debarun Banerjee made changes - 2025-02-27 08:46

Assignee	Debarun Banerjee [ JIRAUSER54513 ]	Marko Mäkelä [ marko ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Marko Mäkelä made changes - 2025-02-28 09:32

Link

This issue blocks MDEV-36197 [ MDEV-36197 ]

Marko Mäkelä added a comment - 2025-02-28 15:02

debarun, thank you for your thorough review. I clarified some things and fixed others. You reproduced two bugs when the some of the innodb_buffer_pool_size related parameters are not a multiple of the allocation extent size (8 MiB or 2 MiB). I also revised the maximum buffer pool size to be 16EiB-8MiB or 4GiB-2MiB, to simplify the innodb_init_params() logic for adjusting the current, minimum, and maximum values of the parameters with respect to each other.

Marko Mäkelä added a comment - 2025-02-28 15:02 debarun , thank you for your thorough review. I clarified some things and fixed others. You reproduced two bugs when the some of the innodb_buffer_pool_size related parameters are not a multiple of the allocation extent size (8 MiB or 2 MiB). I also revised the maximum buffer pool size to be 16EiB-8MiB or 4GiB-2MiB, to simplify the innodb_init_params() logic for adjusting the current, minimum, and maximum values of the parameters with respect to each other.

Marko Mäkelä made changes - 2025-02-28 15:02

Assignee	Marko Mäkelä [ marko ]	Debarun Banerjee [ JIRAUSER54513 ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Matthias Leich added a comment - 2025-03-03 19:12

Some intermediate result of RQG testing on

origin/10.11-MDEV-29445 4afd83b99d0a161d698f234427f9dbb2a670ff2f 2025-02-28T17:05:09+02:00

# 2025-03-03T07:20:04 [827446] | mariadbd: /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921: int ha_innobase::info_low(uint, bool): Assertion `ib_table->stat_initialized()' failed.

(rr) bt

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44

#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78

#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89

#3  0x00007b987c24526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

#4  0x00007b987c2288ff in __GI_abort () at ./stdlib/abort.c:79

#5  0x00007b987c22881b in __assert_fail_base (fmt=0x7b987c3d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x561de3d83aaf "ib_table->stat_initialized()",

    file=file@entry=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=line@entry=14921, function=function@entry=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)")

    at ./assert/assert.c:94

#6  0x00007b987c23b507 in __assert_fail (assertion=0x561de3d83aaf "ib_table->stat_initialized()", file=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=14921,

    function=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)") at ./assert/assert.c:103

#7  0x0000561de37e57f9 in ha_innobase::info_low (this=0x7b98580dd618, flag=18, is_analyze=<optimized out>, is_analyze@entry=false) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921

#8  0x0000561de37e5d38 in ha_innobase::info (this=<optimized out>, flag=<optimized out>) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:15199

#9  0x0000561de3439bbd in TABLE_LIST::fetch_number_of_rows (this=this@entry=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/table.cc:9955

#10 0x0000561de33b64c7 in make_join_statistics (join=join@entry=0x7b9858016fc0, tables_list=..., keyuse_array=keyuse_array@entry=0x7b9858017318) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5499

#11 0x0000561de33b916d in JOIN::optimize_inner (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:2643

#12 0x0000561de33b93fb in JOIN::optimize (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:1954

#13 0x0000561de33b94dd in mysql_select (thd=thd@entry=0x7b9858002568, tables=0x7b9858015b08, fields=..., conds=0x7b9858016400, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0,

    select_options=<optimized out>, result=0x7b9858016f98, unit=0x7b9858006828, select_lex=0x7b9858015018) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5218

#14 0x0000561de33b98a8 in handle_select (thd=thd@entry=0x7b9858002568, lex=lex@entry=0x7b9858006750, result=result@entry=0x7b9858016f98, setup_tables_done_option=setup_tables_done_option@entry=0)

    at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:600

#15 0x0000561de333d5f1 in execute_sqlcom_select (thd=thd@entry=0x7b9858002568, all_tables=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:6426

#16 0x0000561de3346d1b in mysql_execute_command (thd=thd@entry=0x7b9858002568, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:4012

#17 0x0000561de334d131 in mysql_parse (thd=thd@entry=0x7b9858002568, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7b98795b3400)

    at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:8188

#18 0x0000561de334e79d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7b9858002568,

    packet=packet@entry=0x7b985800c8d9 "SELECT `col_int_nokey` % 10 AS `col_int_nokey`, `col_int_key` % 10 AS `col_int_key` FROM a WHERE `col_int_nokey` <= 6 /* E_R Thread4 QNO 1053 CON_ID 20 */ ",

    packet_length=packet_length@entry=155, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1905

#19 0x0000561de334fc3b in do_command (thd=thd@entry=0x7b9858002568, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1418

#20 0x0000561de34719af in do_handle_one_connection (connect=<optimized out>, connect@entry=0x561de6c00898, put_in_cache=put_in_cache@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1386

#21 0x0000561de3471bc0 in handle_one_connection (arg=0x561de6c00898) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1298

#22 0x00007b987c29ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447

#23 0x00007b987c329a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

(rr)

sdp:/data/results/1741001176/TB-2232$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

The test fiddles with partitioned tables and the server starts with innodb_undo_log_truncate=ON.

Matthias Leich added a comment - 2025-03-03 19:12 Some intermediate result of RQG testing on origin/10.11-MDEV-29445 4afd83b99d0a161d698f234427f9dbb2a670ff2f 2025-02-28T17:05:09+02:00 # 2025-03-03T07:20:04 [827446] | mariadbd: /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921: int ha_innobase::info_low(uint, bool): Assertion `ib_table->stat_initialized()' failed. (rr) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007b987c24526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007b987c2288ff in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007b987c22881b in __assert_fail_base (fmt=0x7b987c3d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x561de3d83aaf "ib_table->stat_initialized()", file=file@entry=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=line@entry=14921, function=function@entry=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)") at ./assert/assert.c:94 #6 0x00007b987c23b507 in __assert_fail (assertion=0x561de3d83aaf "ib_table->stat_initialized()", file=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=14921, function=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)") at ./assert/assert.c:103 #7 0x0000561de37e57f9 in ha_innobase::info_low (this=0x7b98580dd618, flag=18, is_analyze=<optimized out>, is_analyze@entry=false) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921 #8 0x0000561de37e5d38 in ha_innobase::info (this=<optimized out>, flag=<optimized out>) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:15199 #9 0x0000561de3439bbd in TABLE_LIST::fetch_number_of_rows (this=this@entry=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/table.cc:9955 #10 0x0000561de33b64c7 in make_join_statistics (join=join@entry=0x7b9858016fc0, tables_list=..., keyuse_array=keyuse_array@entry=0x7b9858017318) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5499 #11 0x0000561de33b916d in JOIN::optimize_inner (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:2643 #12 0x0000561de33b93fb in JOIN::optimize (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:1954 #13 0x0000561de33b94dd in mysql_select (thd=thd@entry=0x7b9858002568, tables=0x7b9858015b08, fields=..., conds=0x7b9858016400, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=<optimized out>, result=0x7b9858016f98, unit=0x7b9858006828, select_lex=0x7b9858015018) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5218 #14 0x0000561de33b98a8 in handle_select (thd=thd@entry=0x7b9858002568, lex=lex@entry=0x7b9858006750, result=result@entry=0x7b9858016f98, setup_tables_done_option=setup_tables_done_option@entry=0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:600 #15 0x0000561de333d5f1 in execute_sqlcom_select (thd=thd@entry=0x7b9858002568, all_tables=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:6426 #16 0x0000561de3346d1b in mysql_execute_command (thd=thd@entry=0x7b9858002568, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:4012 #17 0x0000561de334d131 in mysql_parse (thd=thd@entry=0x7b9858002568, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7b98795b3400) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:8188 #18 0x0000561de334e79d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7b9858002568, packet=packet@entry=0x7b985800c8d9 "SELECT `col_int_nokey` % 10 AS `col_int_nokey`, `col_int_key` % 10 AS `col_int_key` FROM a WHERE `col_int_nokey` <= 6 /* E_R Thread4 QNO 1053 CON_ID 20 */ ", packet_length=packet_length@entry=155, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1905 #19 0x0000561de334fc3b in do_command (thd=thd@entry=0x7b9858002568, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1418 #20 0x0000561de34719af in do_handle_one_connection (connect=<optimized out>, connect@entry=0x561de6c00898, put_in_cache=put_in_cache@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1386 #21 0x0000561de3471bc0 in handle_one_connection (arg=0x561de6c00898) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1298 #22 0x00007b987c29ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447 #23 0x00007b987c329a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100 (rr) sdp:/data/results/1741001176/TB-2232$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio The test fiddles with partitioned tables and the server starts with innodb_undo_log_truncate=ON.

Marko Mäkelä made changes - 2025-03-04 09:27

Link

This issue blocks ~~MDEV-34677~~ [ ~~MDEV-34677~~ ]

Debarun Banerjee added a comment - 2025-03-12 12:41

marko I am done with the review. Please check my latest comments. I think the one around LRU flush is important to think through.

Debarun Banerjee added a comment - 2025-03-12 12:41 marko I am done with the review. Please check my latest comments. I think the one around LRU flush is important to think through.

Debarun Banerjee made changes - 2025-03-12 12:41

Assignee	Debarun Banerjee [ JIRAUSER54513 ]	Marko Mäkelä [ marko ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Marko Mäkelä added a comment - 2025-03-14 13:06

debarun, thank you, very much appreciated. I was actually duplicating some logic of buf_pool_t::shrink() in buf_flush_LRU_list_batch(). I reverted some changes to the latter and made the former periodically release buf_pool.mutex in order to avoid starvation. I believe that its use of buf_pool.lru_itr should not conflict with buf_LRU_free_from_common_LRU_list(). Even if it did, the worst that could happen is that a buf_pool.LRU traversal is terminated prematurely and an outer loop will eventually handle it.

Marko Mäkelä added a comment - 2025-03-14 13:06 debarun , thank you, very much appreciated. I was actually duplicating some logic of buf_pool_t::shrink() in buf_flush_LRU_list_batch() . I reverted some changes to the latter and made the former periodically release buf_pool.mutex in order to avoid starvation. I believe that its use of buf_pool.lru_itr should not conflict with buf_LRU_free_from_common_LRU_list() . Even if it did, the worst that could happen is that a buf_pool.LRU traversal is terminated prematurely and an outer loop will eventually handle it.

Marko Mäkelä added a comment - 2025-03-17 14:37

I still need to agree with wlad regarding the interface to allocating virtual address space. Based on our discussion so far, I would change the default value of the new parameter innodb_buffer_pool_size_max to a ‘reasonably large’ value on Linux and Windows, instead of defaulting to the specified innodb_buffer_pool_size. In this way, the buffer pool can be extended just like it used to be able to.

On other operating systems such as FreeBSD, OpenBSD, NetBSD, IBM AIX, there does not appear to be a way to overcommit the virtual address space allocation. Hence, on those systems, unless you specify innodb_buffer_pool_size_max on startup, you would only be able to shrink innodb_buffer_pool_size from its initially specified value.

Marko Mäkelä added a comment - 2025-03-17 14:37 I still need to agree with wlad regarding the interface to allocating virtual address space. Based on our discussion so far, I would change the default value of the new parameter innodb_buffer_pool_size_max to a ‘reasonably large’ value on Linux and Windows, instead of defaulting to the specified innodb_buffer_pool_size . In this way, the buffer pool can be extended just like it used to be able to. On other operating systems such as FreeBSD, OpenBSD, NetBSD, IBM AIX, there does not appear to be a way to overcommit the virtual address space allocation. Hence, on those systems, unless you specify innodb_buffer_pool_size_max on startup, you would only be able to shrink innodb_buffer_pool_size from its initially specified value.

Marko Mäkelä made changes - 2025-03-17 14:37

Assignee	Marko Mäkelä [ marko ]	Vladislav Vaintroub [ wlad ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Marko Mäkelä added a comment - 2025-03-19 10:36

mmap(MAP_NORESERVE) on Linux would still allocate MMU page tables. What would be a reasonable default value of innodb_buffer_pool_size_max? Someone could say 64 MiB, someone else 64 GiB or 64 TiB (which is a quarter of the 48-bit address space limit of many contemporary 64-bit ISA implementations). It turns out that if we allocated virtual address space for 64 TiB, we would frequently run out of memory on some workers of our CI system. The MMU page tables to cover 64 TiB of virtual address space are large, possibly in the gigabyte range when using 4096-byte pages.

So, I would revert back to limiting innodb_buffer_pool_size_max to the start-up value of innodb_buffer_pool_size. If someone anticipates a need to increase innodb_buffer_pool_size while the server is running, they can explicitly specify innodb_buffer_pool_size_max in the server configuration.

Marko Mäkelä added a comment - 2025-03-19 10:36 mmap(MAP_NORESERVE) on Linux would still allocate MMU page tables. What would be a reasonable default value of innodb_buffer_pool_size_max ? Someone could say 64 MiB, someone else 64 GiB or 64 TiB (which is a quarter of the 48-bit address space limit of many contemporary 64-bit ISA implementations). It turns out that if we allocated virtual address space for 64 TiB, we would frequently run out of memory on some workers of our CI system. The MMU page tables to cover 64 TiB of virtual address space are large, possibly in the gigabyte range when using 4096-byte pages. So, I would revert back to limiting innodb_buffer_pool_size_max to the start-up value of innodb_buffer_pool_size . If someone anticipates a need to increase innodb_buffer_pool_size while the server is running, they can explicitly specify innodb_buffer_pool_size_max in the server configuration.

Vladislav Vaintroub added a comment - 2025-03-19 11:42

marko I just think you can't reserve "large pages" on Linux, and that MAP_NORESERVE does not work for them. So, the large pages are not resizable, and one should not attempt to reserve that.

Vladislav Vaintroub added a comment - 2025-03-19 11:42 marko I just think you can't reserve "large pages" on Linux, and that MAP_NORESERVE does not work for them. So, the large pages are not resizable, and one should not attempt to reserve that.

Marko Mäkelä added a comment - 2025-03-19 13:51

HugeTLB pages are unavailable by default on Linux. You have to explicitly reserve physical memory for it to be able to use it:

echo 4|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Based on my testing, madvise(MADV_DONTNEED) cannot shrink hugepage mappings and release such mappings to the operating system.

Maybe Microsoft Windows can defer the allocation of page mappings until a TLB miss, but Linux appears to populate the page mappings immediately.

Marko Mäkelä added a comment - 2025-03-19 13:51 HugeTLB pages are unavailable by default on Linux. You have to explicitly reserve physical memory for it to be able to use it: echo 4|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Based on my testing, madvise(MADV_DONTNEED) cannot shrink hugepage mappings and release such mappings to the operating system. Maybe Microsoft Windows can defer the allocation of page mappings until a TLB miss, but Linux appears to populate the page mappings immediately.

Vladislav Vaintroub added a comment - 2025-03-19 14:13 - edited

So, I understand this correctly, that one can't "reserve address space" on Linux for large pages, but only allocate them immediately, i.e among other things, MAP_NORESERVE is a no-op.

There is however a mentioning of madvise(MADV_HUGEPAGE) however, and this sounds like it could be used. It is less explicit, but if internet, and Linux documentation does not lie, it sometimes works, for some Linuxes, thus perhaps can be used, to "commit memory"

Vladislav Vaintroub added a comment - 2025-03-19 14:13 - edited So, I understand this correctly, that one can't "reserve address space" on Linux for large pages, but only allocate them immediately, i.e among other things, MAP_NORESERVE is a no-op. There is however a mentioning of madvise(MADV_HUGEPAGE) however, and this sounds like it could be used. It is less explicit, but if internet, and Linux documentation does not lie, it sometimes works, for some Linuxes, thus perhaps can be used, to "commit memory"

Marko Mäkelä made changes - 2025-03-20 13:38

Link

This issue blocks ~~MDEV-34863~~ [ ~~MDEV-34863~~ ]

Sergei Golubchik made changes - 2025-03-20 22:30

Link

This issue includes ~~MDEV-34863~~ [ ~~MDEV-34863~~ ]

Sergei Golubchik made changes - 2025-03-20 22:30

Link

This issue blocks ~~MDEV-34863~~ [ ~~MDEV-34863~~ ]

Sergei Golubchik made changes - 2025-03-21 16:41

Fix Version/s		12.1 [ 29992 ]
Fix Version/s	12.0 [ 29945 ]

Vladislav Vaintroub made changes - 2025-03-24 13:55

Assignee

Vladislav Vaintroub [ wlad ]

Marko Mäkelä [ marko ]

Marko Mäkelä made changes - 2025-03-24 14:06

Description

copied from [~~MDEV-25341~~|https://jira.mariadb.org/browse/MDEV-25341?focusedCommentId=232177&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-232177]:

* The buf_pool.free as well as the buffer pool blocks that are backing store for the AHI or lock_sys could be doubly linked with each other via bytes allocated within the page frame itself. We do not need a dummy buf_page_t for such blocks.

* We could allocate a contiguous virtual address range for the maximum supported size of buffer pool, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.

The InnoDB buffer pool had been allocated in multiple chunks, because {{SET GLOBAL innodb_buffer_pool_size}} would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below {{innodb_buffer_pool_chunk_size}}.

It would be cleaner to allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter {{innodb_buffer_pool_size_max}}, which defaults to the initially specified {{innodb_buffer_pool_size}}) and to allow the {{innodb_buffer_pool_size}} to be changed in increments of 1 megabyte.

, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.

Marko Mäkelä made changes - 2025-03-24 14:09

Description

The InnoDB buffer pool had been allocated in multiple chunks, because {{SET GLOBAL innodb_buffer_pool_size}} would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below {{innodb_buffer_pool_chunk_size}}.

It would be cleaner to allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter {{innodb_buffer_pool_size_max}}, which defaults to the initially specified {{innodb_buffer_pool_size}}) and to allow the {{innodb_buffer_pool_size}} to be changed in increments of 1 megabyte.

, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.

The InnoDB buffer pool had been allocated in multiple chunks, because {{SET GLOBAL innodb_buffer_pool_size}} would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below {{innodb_buffer_pool_chunk_size}}.

It would be cleaner to:
* allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter {{innodb_buffer_pool_size_max}}, which defaults to the initially specified {{innodb_buffer_pool_size}})
* allow the {{innodb_buffer_pool_size}} to be changed in increments of 1 megabyte
* define a fixed mapping between the virtual memory addresses of buffer page descriptors page frames, to fix bugs like ~~MDEV-34677~~ and ~~MDEV-35485~~
* refactor the shrinking of the buffer pool to provide more meaningful progress output and to avoid hangs

The complicated logic of having multiple buffer pool chunks can be removed, and the parameter {{innodb_buffer_pool_chunk_size}} will be deprecated and ignored.

Marko Mäkelä added a comment - 2025-03-24 14:27

madvise(MADV_HUGEPAGE) is something for enabling Transparent Huge Pages (THP). When the large_pages interface is being used, we are allocating explicit huge pages with mmap(). I think that if we were to experiment with madvise(MADV_HUGEPAGE), it should be tied to a configuration parameter that is disabled by default.

Marko Mäkelä added a comment - 2025-03-24 14:27 madvise(MADV_HUGEPAGE) is something for enabling Transparent Huge Pages (THP). When the large_pages interface is being used, we are allocating explicit huge pages with mmap() . I think that if we were to experiment with madvise(MADV_HUGEPAGE) , it should be tied to a configuration parameter that is disabled by default.

Marko Mäkelä made changes - 2025-03-25 10:43

Fix Version/s		10.11 [ 27614 ]
Fix Version/s		11.4 [ 29301 ]
Fix Version/s		11.8 [ 29921 ]
Fix Version/s	12.1 [ 29992 ]

Marko Mäkelä made changes - 2025-03-25 10:44

Status

In Review [ 10002 ]

In Testing [ 10301 ]

Marko Mäkelä made changes - 2025-03-25 10:44

Assignee

Marko Mäkelä [ marko ]

Matthias Leich [ mleich ]

Marko Mäkelä made changes - 2025-03-26 15:45

issue.field.resolutiondate

2025-03-26 15:45:33.0

2025-03-26 15:45:32.995

Marko Mäkelä made changes - 2025-03-26 15:45

Fix Version/s		10.11.12 [ 29998 ]
Fix Version/s		11.4.6 [ 29999 ]
Fix Version/s		11.8.2 [ 30001 ]
Fix Version/s	10.11 [ 27614 ]
Fix Version/s	11.4 [ 29301 ]
Fix Version/s	11.8 [ 29921 ]
Assignee	Matthias Leich [ mleich ]	Marko Mäkelä [ marko ]
Resolution		Fixed [ 1 ]
Status	In Testing [ 10301 ]	Closed [ 6 ]

Marko Mäkelä added a comment - 2025-03-27 13:26

The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.

Marko Mäkelä added a comment - 2025-03-27 13:26 The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.

Marko Mäkelä made changes - 6 days ago

Link

This issue relates to ~~MDEV-36646~~ [ ~~MDEV-36646~~ ]

Marko Mäkelä made changes - 2 days ago

Link

This issue causes ~~MDEV-36646~~ [ ~~MDEV-36646~~ ]

Marko Mäkelä made changes - 2 days ago

Link

This issue relates to ~~MDEV-36646~~ [ ~~MDEV-36646~~ ]

MariaDB Server

reorganise innodb buffer pool (and remove buffer pool chunks)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration