Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29445

reorganise innodb buffer pool (and remove buffer pool chunks)

Details

    Description

      The InnoDB buffer pool had been allocated in multiple chunks, because SET GLOBAL innodb_buffer_pool_size would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below innodb_buffer_pool_chunk_size.

      It would be cleaner to:

      • allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter innodb_buffer_pool_size_max, which defaults to the initially specified innodb_buffer_pool_size)
      • allow the innodb_buffer_pool_size to be changed in increments of 1 megabyte
      • define a fixed mapping between the virtual memory addresses of buffer page descriptors page frames, to fix bugs like MDEV-34677 and MDEV-35485
      • refactor the shrinking of the buffer pool to provide more meaningful progress output and to avoid hangs

      The complicated logic of having multiple buffer pool chunks can be removed, and the parameter innodb_buffer_pool_chunk_size will be deprecated and ignored.

      Attachments

        Issue Links

          Activity

            danblack Daniel Black created issue -
            danblack Daniel Black made changes -
            Field Original Value New Value
            danblack Daniel Black made changes -
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Minor [ 4 ]

            When testing crash recovery with a 30GiB buffer pool in MDEV-29911, it was divided into 64 chunks of 480MiB each. I noticed that recv_sys_t::free() (inlined in recv_sys_t::recover_low()) is consuming a significant amount of CPU time. Having a single buffer pool chunk should make that code much faster.

            marko Marko Mäkelä added a comment - When testing crash recovery with a 30GiB buffer pool in MDEV-29911 , it was divided into 64 chunks of 480MiB each. I noticed that recv_sys_t::free() (inlined in recv_sys_t::recover_low() ) is consuming a significant amount of CPU time. Having a single buffer pool chunk should make that code much faster.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 11.1 [ 28549 ]
            Assignee Marko Mäkelä [ marko ]
            Labels energy energy performance
            Priority Minor [ 4 ] Major [ 3 ]
            julien.fritsch Julien Fritsch made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s 11.1 [ 28549 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.3 [ 28565 ]
            Fix Version/s 11.2 [ 28603 ]
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Attachment MDEV-29445-sizes.gnumeric [ 71618 ]

            I think that the buffer pool needs to be divided into logical chunks, with an array of buf_block_t being allocated at the start of the chunk, to cover the uncompressed page at the end of the chunk.

            To best achieve this, it would be beneficial to shrink sizeof(buf_block_t) to 128 bytes or less. Currently, on a CMAKE_BUILD_TYPE=RelWithDebInfo build of 10.6 or later (after MDEV-27058), we have sizeof(buf_page_t)=112 and sizeof(buf_block_t)=160. By replacing the data member buf_page_t::frame with a member function we could shrink each descriptor by 8 more bytes. The buf_block_t comprises the following:

            struct buf_block_t {
                buf_page_t page; // page descriptor
                ut_list_node<buf_block_t> unzip_LRU; // 2*sizeof(void*), related to ROW_FORMAT=COMPRESSED
                ib_uint64_t modify_clock; // 8 bytes
                volatile uint16_t n_bytes; // 2 bytes
                volatile uint16_t n_fields; // 2 bytes
                uint16_t n_hash_helps; // 2 bytes
                volatile bool left_side; // 1 byte + 1 byte alignment loss
                unsigned int curr_n_fields : 10;
                unsigned int curr_n_bytes : 15;
                unsigned int curr_left_side : 1; // 32 bytes (including alignment loss)
                dict_index_t *index; // 8 bytes
            };
            

            All fields after unzip_LRU are related to the adaptive hash index. Their total size is 32 bytes. The adaptive hash index was disabled by default in MDEV-20487. If we introduce a pointer, say, buf_block_t::ahi, which points to a structure that contains the adaptive hash index information, and at the same time remove the buf_page_t::frame pointer, we would shrink sizeof(buf_block_t) to exactly 128 or 2⁷ bytes. This should keep the arithmetics simple.

            Let us consider a few sizes, assuming sizeof(buf_block_t)=128. I calculated some more sizes in MDEV-29445-sizes.gnumeric:

            hugepage size/KiB innodb_page_size/KiB descriptor pages data pages wasted space/bytes waste %
            2048 4 16 512-16 4096*16-(512-16)*128=2048 0.0977%
            2048 16 1 128-1 16384*1-(128-1)*128=128 0.0061%
            2048 64 1 32-1 65536*1-(32-1)*128=61568 2.936%
            1048576 4 7944 262144-7944 4096*7944-(262144-7944)*128=1024 0.0000934%
            1048576 16 509 65536-509 16384*509-(65536-509)*128=16000 0.00149%
            1048576 64 32 16384-32 65536*128-(16384-32)*128=4096 0.000381%

            When the largest hugepage size that is supported by the MMU is small, it might make sense to retain the parameter innodb_buffer_pool_chunk_size and allow it to be an integer power-of-2 multiple of the hugepage size.

            marko Marko Mäkelä added a comment - I think that the buffer pool needs to be divided into logical chunks, with an array of buf_block_t being allocated at the start of the chunk, to cover the uncompressed page at the end of the chunk. To best achieve this, it would be beneficial to shrink sizeof(buf_block_t) to 128 bytes or less. Currently, on a CMAKE_BUILD_TYPE=RelWithDebInfo build of 10.6 or later (after MDEV-27058 ), we have sizeof(buf_page_t)=112 and sizeof(buf_block_t)=160 . By replacing the data member buf_page_t::frame with a member function we could shrink each descriptor by 8 more bytes. The buf_block_t comprises the following: struct buf_block_t { buf_page_t page; // page descriptor ut_list_node<buf_block_t> unzip_LRU; // 2*sizeof(void*), related to ROW_FORMAT=COMPRESSED ib_uint64_t modify_clock; // 8 bytes volatile uint16_t n_bytes; // 2 bytes volatile uint16_t n_fields; // 2 bytes uint16_t n_hash_helps; // 2 bytes volatile bool left_side; // 1 byte + 1 byte alignment loss unsigned int curr_n_fields : 10; unsigned int curr_n_bytes : 15; unsigned int curr_left_side : 1; // 32 bytes (including alignment loss) dict_index_t *index; // 8 bytes }; All fields after unzip_LRU are related to the adaptive hash index. Their total size is 32 bytes. The adaptive hash index was disabled by default in MDEV-20487 . If we introduce a pointer, say, buf_block_t::ahi , which points to a structure that contains the adaptive hash index information, and at the same time remove the buf_page_t::frame pointer, we would shrink sizeof(buf_block_t) to exactly 128 or 2⁷ bytes. This should keep the arithmetics simple. Let us consider a few sizes, assuming sizeof(buf_block_t)=128 . I calculated some more sizes in MDEV-29445-sizes.gnumeric : hugepage size/KiB innodb_page_size/KiB descriptor pages data pages wasted space/bytes waste % 2048 4 16 512-16 4096*16-(512-16)*128=2048 0.0977% 2048 16 1 128-1 16384*1-(128-1)*128=128 0.0061% 2048 64 1 32-1 65536*1-(32-1)*128=61568 2.936% 1048576 4 7944 262144-7944 4096*7944-(262144-7944)*128=1024 0.0000934% 1048576 16 509 65536-509 16384*509-(65536-509)*128=16000 0.00149% 1048576 64 32 16384-32 65536*128-(16384-32)*128=4096 0.000381% When the largest hugepage size that is supported by the MMU is small, it might make sense to retain the parameter innodb_buffer_pool_chunk_size and allow it to be an integer power-of-2 multiple of the hugepage size.

            The field buf_block_t::modify_clock is not related to the adaptive hash index after all. Its purpose is to identify that an optimistic btr_pcur_t::restore_pos() is not possible. The counter will be incremented whenever a record is deleted from a page, or a page is freed or evicted from the buffer pool. This would cause a comparison to btr_pcur_t::modify_clock to fail. We might add the field btr_pcur_t::page_id (to compare to what buf_page_t::id() would return when we attempt optimistic restoration) and simply store the FIL_PAGE_LSN contents of the page frame in btr_pcur_t. Replacing modify_clock with FIL_PAGE_LSN and page_id_t would make the optimistic btr_pcur_t::restore_pos() less likely, because the FIL_PAGE_LSN in an index page would be updated on any insert or update, not only when records are being deleted or pages being evicted or freed.

            I computed a table for three block descriptor sizes:

            sizeof(buf_block_t) scenario
            152 removing buf_page_t::frame only
            136 also moving the adaptive hash index behind a pointer
            128 also removing modify_clock
            hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 128-byte
            2048 4 512 19 17 16
            2048 8 256 5 5 4
            2048 16 128 2 2 1
            2048 32 64 1 1 1
            2048 64 32 1 1 1
            1048576 4 262144 9380 8425 7944
            1048576 8 131072 2388 2141 2017
            1048576 16 65536 603 540 509
            1048576 32 32768 152 136 128
            1048576 64 16384 38 34 32

            The biggest overhead difference above occurs with 2MiB hugepages and the default innodb_page_size=16k: We would use 1/128 of the memory for 128-byte block descriptors, or 1/64 when using larger block descriptors.

            I think that we can live with the current sizeof(buf_block_t), only removing buf_page_t::frame.

            marko Marko Mäkelä added a comment - The field buf_block_t::modify_clock is not related to the adaptive hash index after all. Its purpose is to identify that an optimistic btr_pcur_t::restore_pos() is not possible. The counter will be incremented whenever a record is deleted from a page, or a page is freed or evicted from the buffer pool. This would cause a comparison to btr_pcur_t::modify_clock to fail. We might add the field btr_pcur_t::page_id (to compare to what buf_page_t::id() would return when we attempt optimistic restoration) and simply store the FIL_PAGE_LSN contents of the page frame in btr_pcur_t . Replacing modify_clock with FIL_PAGE_LSN and page_id_t would make the optimistic btr_pcur_t::restore_pos() less likely, because the FIL_PAGE_LSN in an index page would be updated on any insert or update, not only when records are being deleted or pages being evicted or freed. I computed a table for three block descriptor sizes: sizeof(buf_block_t) scenario 152 removing buf_page_t::frame only 136 also moving the adaptive hash index behind a pointer 128 also removing modify_clock hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 128-byte 2048 4 512 19 17 16 2048 8 256 5 5 4 2048 16 128 2 2 1 2048 32 64 1 1 1 2048 64 32 1 1 1 1048576 4 262144 9380 8425 7944 1048576 8 131072 2388 2141 2017 1048576 16 65536 603 540 509 1048576 32 32768 152 136 128 1048576 64 16384 38 34 32 The biggest overhead difference above occurs with 2MiB hugepages and the default innodb_page_size=16k : We would use 1/128 of the memory for 128-byte block descriptors, or 1/64 when using larger block descriptors. I think that we can live with the current sizeof(buf_block_t) , only removing buf_page_t::frame .

            I created a constexpr function that should allow us to calculate the mappings between page frame addresses and block descriptors at compilation time, with the innodb_page_size being the only run-time parameter. We might generate a small number sets of mapping functions for each supported innodb_page_size (5 values) and innodb_buffer_pool_chunk_size (limited to a small number of sizes) and set function pointers based on the chosen start-up parameters.

            In C++11, a constexpr function body must consist of a single return statement. Both Clang and GCC limit the recursion depth to 512 by default. The following naïve attempt requires 351 recursion steps, and it works in all compilers that I tried: GCC 4.8.5 or later; clang 3.1 or later; ICC 16.0.3 or later; not too old MSVC:

            static constexpr size_t fix(size_t pages, size_t bs, size_t ps, size_t b)
            {
              return ((ps * b - (pages - b) * bs) > ps)
                ? fix(pages, bs, ps, b - 1)
                : b;
            }
             
            static constexpr size_t b(size_t pages, size_t bs, size_t ps)
            {
              return fix(pages, bs, ps, (pages * bs + (ps - 1)) / ps);
            }
             
            static constexpr size_t bpp(size_t hugepagesize, size_t bs, size_t ps)
            {
              return b(hugepagesize * 1024 / ps, bs, ps);
            }
             
            constexpr size_t big = 152; // sizeof(buf_block_t)
             
            constexpr static size_t sizes[] = {
              bpp(2048, big, 4096),
              bpp(2048, big, 8192),
              bpp(2048, big, 16384),
              bpp(2048, big, 32768),
              bpp(2048, big, 65536),
              bpp(1048576, big, 4096),
              bpp(1048576, big, 8192),
              bpp(1048576, big, 16384),
              bpp(1048576, big, 32768),
              bpp(1048576, big, 65536)
            };
            

            marko Marko Mäkelä added a comment - I created a constexpr function that should allow us to calculate the mappings between page frame addresses and block descriptors at compilation time, with the innodb_page_size being the only run-time parameter. We might generate a small number sets of mapping functions for each supported innodb_page_size (5 values) and innodb_buffer_pool_chunk_size (limited to a small number of sizes) and set function pointers based on the chosen start-up parameters. In C++11, a constexpr function body must consist of a single return statement. Both Clang and GCC limit the recursion depth to 512 by default. The following naïve attempt requires 351 recursion steps, and it works in all compilers that I tried: GCC 4.8.5 or later; clang 3.1 or later; ICC 16.0.3 or later; not too old MSVC: static constexpr size_t fix( size_t pages, size_t bs, size_t ps, size_t b) { return ((ps * b - (pages - b) * bs) > ps) ? fix(pages, bs, ps, b - 1) : b; }   static constexpr size_t b( size_t pages, size_t bs, size_t ps) { return fix(pages, bs, ps, (pages * bs + (ps - 1)) / ps); }   static constexpr size_t bpp( size_t hugepagesize, size_t bs, size_t ps) { return b(hugepagesize * 1024 / ps, bs, ps); }   constexpr size_t big = 152; // sizeof(buf_block_t)   constexpr static size_t sizes[] = { bpp(2048, big, 4096), bpp(2048, big, 8192), bpp(2048, big, 16384), bpp(2048, big, 32768), bpp(2048, big, 65536), bpp(1048576, big, 4096), bpp(1048576, big, 8192), bpp(1048576, big, 16384), bpp(1048576, big, 32768), bpp(1048576, big, 65536) };
            marko Marko Mäkelä made changes -

            Implementing MDEV-31976 would shrink buf_block_t by 2 pointers. If we also remove the redundant buf_page_t::frame pointer, we would end up with sizeof(buf_block_t) being 136 bytes on 64-bit systems or 100 bytes on 32-bit systems. The sizeof(buf_page_t) is unaffected by that: 80 bytes on 32-bit and 112 bytes on 64-bit.

            One more thing that we can do is to replace all pointers in buf_block_t or buf_page_t with 32-bit integers, counting buffer page frame slots from the start of the contiguous buffer pool memory, divided by innodb_page_size. Null pointers can trivially be mapped to the value 0, because at the start of the memory we will always have a buf_block_t and never a valid page frame. The smallest valid nonzero value for the integer would be 2048k/16k=128, which would be equivalent to the buf_block_t starting at the first address of buffer pool memory. There is only one pointer that we cannot replace in this way: buf_block_t::index. That is, sizeof(buf_block_t) would have to be 104 (0x68) bytes on 64-bit systems.

            The pointer page_zip_des_t::data would require up to 4 extra bits (ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 blocks within a innodb_page_size=16k frame). We have exactly the required amount of 2+2 spare bits available in m_end and n_blobs.

            With the minimum innodb_page_size=4k (2¹² bytes), the 32-bit “pointers” would allow innodb_buffer_pool_size to reach up to 2¹²·2³²=2⁴⁴=16TiB. At the maximum innodb_page_size=64k we would reach 2⁴⁸=256TiB, which is the maximum virtual address space size of contemporary 64-bit processors.

            Here is an updated table:

            hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte
            2048 4 512 19 17 13 13
            2048 8 256 5 5 4 4
            2048 16 128 2 2 1 1
            2048 32 64 1 1 1 1
            2048 64 32 1 1 1 1
            1048576 4 262144 9380 8425 6248 6492
            1048576 8 131072 2388 2141 1581 1644
            1048576 16 65536 603 540 398 414
            1048576 32 32768 152 136 100 104
            1048576 64 16384 38 34 25 26

            The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 13/512=2.54% or 6492/262144=2.48%. With the default innodb_page_size=16k the overhead is 1/128=0.78% or 414/65536=0.63%.

            marko Marko Mäkelä added a comment - Implementing MDEV-31976 would shrink buf_block_t by 2 pointers. If we also remove the redundant buf_page_t::frame pointer, we would end up with sizeof(buf_block_t) being 136 bytes on 64-bit systems or 100 bytes on 32-bit systems. The sizeof(buf_page_t) is unaffected by that: 80 bytes on 32-bit and 112 bytes on 64-bit. One more thing that we can do is to replace all pointers in buf_block_t or buf_page_t with 32-bit integers, counting buffer page frame slots from the start of the contiguous buffer pool memory, divided by innodb_page_size . Null pointers can trivially be mapped to the value 0, because at the start of the memory we will always have a buf_block_t and never a valid page frame. The smallest valid nonzero value for the integer would be 2048k/16k=128, which would be equivalent to the buf_block_t starting at the first address of buffer pool memory. There is only one pointer that we cannot replace in this way: buf_block_t::index . That is, sizeof(buf_block_t) would have to be 104 (0x68) bytes on 64-bit systems. The pointer page_zip_des_t::data would require up to 4 extra bits ( ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 blocks within a innodb_page_size=16k frame). We have exactly the required amount of 2+2 spare bits available in m_end and n_blobs . With the minimum innodb_page_size=4k (2¹² bytes), the 32-bit “pointers” would allow innodb_buffer_pool_size to reach up to 2¹²·2³²=2⁴⁴=16TiB. At the maximum innodb_page_size=64k we would reach 2⁴⁸=256TiB, which is the maximum virtual address space size of contemporary 64-bit processors. Here is an updated table: hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte 2048 4 512 19 17 13 13 2048 8 256 5 5 4 4 2048 16 128 2 2 1 1 2048 32 64 1 1 1 1 2048 64 32 1 1 1 1 1048576 4 262144 9380 8425 6248 6492 1048576 8 131072 2388 2141 1581 1644 1048576 16 65536 603 540 398 414 1048576 32 32768 152 136 100 104 1048576 64 16384 38 34 25 26 The worst-case overhead of allocating block descriptors (at innodb_page_size=4k ) would be 13/512=2.54% or 6492/262144=2.48%. With the default innodb_page_size=16k the overhead is 1/128=0.78% or 414/65536=0.63%.

            The current minimum value of innodb_buffer_pool_size is 2MiB, which coincidentally is equal to the smaller IA-32 or AMD64 hugepage size. In each 2MiB segment, we would allocate the 13 first page frames (52 kilobytes) for block descriptors. When using innodb_buffer_pool_size=3m we would reserve a total of 26*4KiB=104 KiB for page descriptors, wasting 6½*4KiB for the last 1MiB for which we are not going to allocate page frames.

            When innodb_buffer_pool_chunk_size=1GiB, at every 1GiB we would use 6492*4KiB=25 MiB for innodb_page_size=4k page descriptors, or 6.5 MiB for innodb_page_size=16k, or 1.63 MiB for innodb_page_size=64k.

            marko Marko Mäkelä added a comment - The current minimum value of innodb_buffer_pool_size is 2MiB, which coincidentally is equal to the smaller IA-32 or AMD64 hugepage size. In each 2MiB segment, we would allocate the 13 first page frames (52 kilobytes) for block descriptors. When using innodb_buffer_pool_size=3m we would reserve a total of 26*4KiB=104 KiB for page descriptors, wasting 6½*4KiB for the last 1MiB for which we are not going to allocate page frames. When innodb_buffer_pool_chunk_size=1GiB , at every 1GiB we would use 6492*4KiB=25 MiB for innodb_page_size=4k page descriptors, or 6.5 MiB for innodb_page_size=16k , or 1.63 MiB for innodb_page_size=64k .

            I realized that the trick of replacing 64-bit pointers with 32-bit integers will not work, because the buf_page_t descriptors of compressed-only ROW_FORMAT=COMPRESSED blocks in the buffer pool would be allocated by malloc(), outside the contiguous virtual address range that is associated with the buf_block_t descriptors of uncompressed pages as well as page frames. If we were to remove ROW_FORMAT=COMPRESSED support altogether (which we won’t; see MDEV-22367), then sizeof(buf_block_t) would be shrunk further to 88 bytes on 32-bit systems, and possibly 96 on 64-bit. By further removing the adaptive hash index we would come down to 72 bytes (on both 32-bit and 64-bit systems):

            Here is an updated table that includes these hypothetical cases:

            hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte 88-byte 96-byte 72-byte
            2048 4 512 19 17 13 13 11 12 9
            2048 8 256 5 5 4 4 3 3 3
            2048 16 128 2 2 1 1 1 1 1
            2048 32 64 1 1 1 1 1 1 1
            2048 64 32 1 1 1 1 1 1 1
            1048576 4 262144 9380 8425 6248 6492 5514 6004 4529
            1048576 8 131072 2388 2141 1581 1644 1394 1519 1142
            1048576 16 65536 603 540 398 414 351 382 287
            1048576 32 32768 152 136 100 104 88 96 72
            1048576 64 16384 38 34 25 26 22 24 18

            The worst-case overhead of allocating block descriptors (at innodb_page_size=4k) would be 9/512=1.76% (instead of 17/512=3.32%) or 4529/262144=1.73% (instead of 8425/262144=3.21%). With the default innodb_page_size=16k the overhead is 1/128=0.78% or 287/65536=0.44% (or 540/65536=0.83%). Nearly halving the size of the block descriptor from 136 to 72 bytes would roughly halve the memory overhead. For now, we can only shrink the block descriptor by 3 pointers (1 if we do not implement MDEV-31976).

            marko Marko Mäkelä added a comment - I realized that the trick of replacing 64-bit pointers with 32-bit integers will not work, because the buf_page_t descriptors of compressed-only ROW_FORMAT=COMPRESSED blocks in the buffer pool would be allocated by malloc() , outside the contiguous virtual address range that is associated with the buf_block_t descriptors of uncompressed pages as well as page frames. If we were to remove ROW_FORMAT=COMPRESSED support altogether (which we won’t; see MDEV-22367 ), then sizeof(buf_block_t) would be shrunk further to 88 bytes on 32-bit systems, and possibly 96 on 64-bit. By further removing the adaptive hash index we would come down to 72 bytes (on both 32-bit and 64-bit systems): Here is an updated table that includes these hypothetical cases: hugepage/KiB innodb_page_size/KiB pages/hugepage 152-byte 136-byte 100-byte 104-byte 88-byte 96-byte 72-byte 2048 4 512 19 17 13 13 11 12 9 2048 8 256 5 5 4 4 3 3 3 2048 16 128 2 2 1 1 1 1 1 2048 32 64 1 1 1 1 1 1 1 2048 64 32 1 1 1 1 1 1 1 1048576 4 262144 9380 8425 6248 6492 5514 6004 4529 1048576 8 131072 2388 2141 1581 1644 1394 1519 1142 1048576 16 65536 603 540 398 414 351 382 287 1048576 32 32768 152 136 100 104 88 96 72 1048576 64 16384 38 34 25 26 22 24 18 The worst-case overhead of allocating block descriptors (at innodb_page_size=4k ) would be 9/512=1.76% (instead of 17/512=3.32%) or 4529/262144=1.73% (instead of 8425/262144=3.21%). With the default innodb_page_size=16k the overhead is 1/128=0.78% or 287/65536=0.44% (or 540/65536=0.83%). Nearly halving the size of the block descriptor from 136 to 72 bytes would roughly halve the memory overhead. For now, we can only shrink the block descriptor by 3 pointers (1 if we do not implement MDEV-31976 ).
            julien.fritsch Julien Fritsch made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.4 [ 29301 ]
            Fix Version/s 11.3 [ 28565 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.5 [ 29506 ]
            Fix Version/s 11.4 [ 29301 ]
            julien.fritsch Julien Fritsch made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            I see that madvise(addr, length, MADV_FREE) is available on Linux, FreeBSD, OpenBSD, NetBSD, Dragonfly BSD and Solaris. On IBM AIX, all forms of madvise() are ignored. On macOS, we would want MADV_FREE_REUSABLE.

            I think that we will need a new start-up parameter innodb_buffer_pool_max_size that specifies the virtual address range size that will be allocated for the InnoDB buffer pool. The parameter innodb_buffer_pool_chunk_size would be deprecated and have no effect. The innodb_buffer_pool_size may be set up to the predecladed maximum size. At al times, the usable size of the buffer pool (in terms of page frames) would be innodb_buffer_pool_size minus the overhead of allocating the buf_block_t.

            marko Marko Mäkelä added a comment - I see that madvise(addr, length, MADV_FREE) is available on Linux, FreeBSD, OpenBSD, NetBSD, Dragonfly BSD and Solaris. On IBM AIX, all forms of madvise() are ignored. On macOS, we would want MADV_FREE_REUSABLE . I think that we will need a new start-up parameter innodb_buffer_pool_max_size that specifies the virtual address range size that will be allocated for the InnoDB buffer pool. The parameter innodb_buffer_pool_chunk_size would be deprecated and have no effect. The innodb_buffer_pool_size may be set up to the predecladed maximum size. At al times, the usable size of the buffer pool (in terms of page frames) would be innodb_buffer_pool_size minus the overhead of allocating the buf_block_t .
            marko Marko Mäkelä made changes -

            To start with, it could be simplest to double the granularity of innodb_buffer_pool_size from 1 to 2 megabytes, which coincidentally is the smallest hugepage size on many AMD64 implementations, and to make the mapping of block descriptors to block addresses independent of the MMU or TLB page size.

            After removing the buf_page_t::frame pointer, we have sizeof(buf_block_t) of 152 bytes in a non-debug 64-bit build, or 200 bytes in a debug build. Within each 2MiB slice of the innodb_buffer_pool_size we would have block descriptors and then the corresponding page frames. Let us look at the part of the previously constructed table:

            innodb_page_size/KiB pages/2MiB 152-byte
            4 512 19
            8 256 5
            16 128 2
            32 64 1
            64 32 1

            The first line means that at innodb_page_size=4k, we would have 512 page frames per 2MiB. But, we will allocate the first 19 of those page frames for the 152-byte buf_block_t descriptors, that is, 19*4096/152 = 77824/152 = up to 512 block descriptors. We actually use 512-19=493 block descriptors.

            Similarly, at the default innodb_page_size=16k we would need 2 page frames = 32768 bytes for allocating the 128-2=126 block descriptors (126*152=19152 bytes). This would be the same amount also on debug builds: 126*200=25200 bytes resides between 16384 and 32768 bytes.

            marko Marko Mäkelä added a comment - To start with, it could be simplest to double the granularity of innodb_buffer_pool_size from 1 to 2 megabytes, which coincidentally is the smallest hugepage size on many AMD64 implementations, and to make the mapping of block descriptors to block addresses independent of the MMU or TLB page size. After removing the buf_page_t::frame pointer, we have sizeof(buf_block_t) of 152 bytes in a non-debug 64-bit build, or 200 bytes in a debug build. Within each 2MiB slice of the innodb_buffer_pool_size we would have block descriptors and then the corresponding page frames. Let us look at the part of the previously constructed table: innodb_page_size/KiB pages/2MiB 152-byte 4 512 19 8 256 5 16 128 2 32 64 1 64 32 1 The first line means that at innodb_page_size=4k , we would have 512 page frames per 2MiB. But, we will allocate the first 19 of those page frames for the 152-byte buf_block_t descriptors, that is, 19*4096/152 = 77824/152 = up to 512 block descriptors. We actually use 512-19=493 block descriptors. Similarly, at the default innodb_page_size=16k we would need 2 page frames = 32768 bytes for allocating the 128-2=126 block descriptors (126*152=19152 bytes). This would be the same amount also on debug builds: 126*200=25200 bytes resides between 16384 and 32768 bytes.

            If we kept doubling the size of an extent (the allocation granularity) of innodb_buffer_pool_size further, the allocation of block descriptors would incur even less overhead. Here are a few numbers, corresponding to sizeof(buf_block_t) for 64-bit and 32-bit non-debug builds:

            extent innodb_page_size pages/extent 152-byte 108-byte
            2MiB 4KiB 512 19 14
            2MiB 8KiB 256 5 4
            2MiB 16KiB 128 2 1
            2MiB 32KiB 64 1 1
            2MiB 64KiB 32 1 1
            4MiB 4KiB 1024 37 27
            4MiB 8KiB 512 10 7
            4MiB 16KiB 256 3 2
            4MiB 32KiB 128 1 1
            4MiB 64KiB 64 1 1
            8MiB 4KiB 2048 74 53
            8MiB 8KiB 1024 19 14
            8MiB 16KiB 512 5 4
            8MiB 32KiB 256 2 1
            8MiB 64KiB 128 1 1
            16MiB 4KiB 4096 147 106
            16MiB 8KiB 2048 38 27
            16MiB 16KiB 1024 10 7
            16MiB 32KiB 512 3 2
            16MiB 64KiB 256 1 1

            Each time the number of block descriptor page frames per extent is odd, halving the size of the extent would incur more overhead. But, we would not want to unnecessarily increase the granularity of innodb_buffer_pool_size. Below is the above information represented in more compact format, which may be harder to understand but easier to compare. The first and (second) choice for each page size is highlighted:

            sizeof(buf_block_t) extent 4KiB 8KiB 16KiB 32KiB 64KiB
            152 2MiB 19 5 2 1 1
            152 4MiB (37) 10 (3) (1) 1
            152 8MiB 74 19 5 2 (1)
            152 16MiB 147 38 10 3 1
            108 2MiB 14 4 (1) 1 1
            108 4MiB (27) (7) 2 (1) 1
            108 8MiB 53 14 4 1 (1)
            108 16MiB 106 27 7 2 1

            At the default innodb_page_size=16k, we can see that out of these numbers, we get the minimal overhead for 64-bit systems (152-byte descriptors) with 8MiB extents, using 5*16KiB of descriptors to cover 512-5=507 pages. That is an overhead of 5/512=0.98%. For innodb_page_size=4k the overhead would be 74/2048=3.61%. By using 16MiB extents we could lower that to 147/4096=3.59%, which is not much better. At innodb_page_size=64k we would halve the overhead from 1/128=0.78% to 1/256=0.39%, but that is a small overhead to begin with and not a default page size.

            On 32-bit systems, the 8MiB extent size would be close to optimal, but I think that we should go with 2MiB extent size, doubling the previous granularity of 1MiB. For innodb_page_size=16k we would use 1 page frame to cover 128-1=127 pages, corresponding to an overhead of 1/128=0.78%. With a 16MiB extent size, the overhead would drop to 7/1024=0.68%, which is not significantly better.

            marko Marko Mäkelä added a comment - If we kept doubling the size of an extent (the allocation granularity) of innodb_buffer_pool_size further, the allocation of block descriptors would incur even less overhead. Here are a few numbers, corresponding to sizeof(buf_block_t) for 64-bit and 32-bit non-debug builds: extent innodb_page_size pages/extent 152-byte 108-byte 2MiB 4KiB 512 19 14 2MiB 8KiB 256 5 4 2MiB 16KiB 128 2 1 2MiB 32KiB 64 1 1 2MiB 64KiB 32 1 1 4MiB 4KiB 1024 37 27 4MiB 8KiB 512 10 7 4MiB 16KiB 256 3 2 4MiB 32KiB 128 1 1 4MiB 64KiB 64 1 1 8MiB 4KiB 2048 74 53 8MiB 8KiB 1024 19 14 8MiB 16KiB 512 5 4 8MiB 32KiB 256 2 1 8MiB 64KiB 128 1 1 16MiB 4KiB 4096 147 106 16MiB 8KiB 2048 38 27 16MiB 16KiB 1024 10 7 16MiB 32KiB 512 3 2 16MiB 64KiB 256 1 1 Each time the number of block descriptor page frames per extent is odd, halving the size of the extent would incur more overhead. But, we would not want to unnecessarily increase the granularity of innodb_buffer_pool_size . Below is the above information represented in more compact format, which may be harder to understand but easier to compare. The first and (second) choice for each page size is highlighted: sizeof(buf_block_t) extent 4KiB 8KiB 16KiB 32KiB 64KiB 152 2MiB 19 5 2 1 1 152 4MiB (37) 10 (3) (1) 1 152 8MiB 74 19 5 2 (1) 152 16MiB 147 38 10 3 1 108 2MiB 14 4 (1) 1 1 108 4MiB (27) (7) 2 (1) 1 108 8MiB 53 14 4 1 (1) 108 16MiB 106 27 7 2 1 At the default innodb_page_size=16k , we can see that out of these numbers, we get the minimal overhead for 64-bit systems (152-byte descriptors) with 8MiB extents, using 5*16KiB of descriptors to cover 512-5=507 pages. That is an overhead of 5/512=0.98%. For innodb_page_size=4k the overhead would be 74/2048=3.61%. By using 16MiB extents we could lower that to 147/4096=3.59%, which is not much better. At innodb_page_size=64k we would halve the overhead from 1/128=0.78% to 1/256=0.39%, but that is a small overhead to begin with and not a default page size. On 32-bit systems, the 8MiB extent size would be close to optimal, but I think that we should go with 2MiB extent size, doubling the previous granularity of 1MiB. For innodb_page_size=16k we would use 1 page frame to cover 128-1=127 pages, corresponding to an overhead of 1/128=0.78%. With a 16MiB extent size, the overhead would drop to 7/1024=0.68%, which is not significantly better.

            Just to not - large pages on Windows are special . can't be reserved, then must be reserved and committed in one go via VirtualAlloc(MEM_RESERVE|MEM_COMMIT).
            From what I remember, and my memory might be a bit dated, they are always committed, locked in memory (thus LockPagesInMemory privilege is required to allocate them).

            Anyway, it might turn out that bufferpool extending and shrinking functionality only works with chunks, if bufferpool consists of large pages. Perhaps it is not a showstopper, but we at least should have a test for it.

            wlad Vladislav Vaintroub added a comment - Just to not - large pages on Windows are special . can't be reserved, then must be reserved and committed in one go via VirtualAlloc(MEM_RESERVE|MEM_COMMIT). From what I remember, and my memory might be a bit dated, they are always committed, locked in memory (thus LockPagesInMemory privilege is required to allocate them). Anyway, it might turn out that bufferpool extending and shrinking functionality only works with chunks, if bufferpool consists of large pages. Perhaps it is not a showstopper, but we at least should have a test for it.
            marko Marko Mäkelä made changes -

            The minimum buffer pool size was 256*5/4 pages, or 320 pages, which corresponds to exactly 5 MiB when using the default innodb_page_size=16k. After these changes, the innodb_buffer_pool_size will include the block descriptors, and therefore the minimum will increase to innodb_buffer_pool_size=6m for that page size. The allocation granularity will remain 1 MiB. If the last extent is incomplete (not a multiple of 8 MiB on 64-bit systems), the first usable page of the last extent will be after the descriptor page frames. That is, if only 1MiB of the last extent were used, we might reserve 5*16KiB for block descriptors, and only 61 page frames at offset 5‥63 would be available. This setup will allow the buffer pool to be resized freely, completely ignoring innodb_buffer_pool_chunk_size, to anything up to the new parameter innodb_buffer_pool_size_max.

            On Microsoft Windows, I think that we must disable buffer pool resizing when using large_pages.

            marko Marko Mäkelä added a comment - The minimum buffer pool size was 256*5/4 pages, or 320 pages, which corresponds to exactly 5 MiB when using the default innodb_page_size=16k . After these changes, the innodb_buffer_pool_size will include the block descriptors, and therefore the minimum will increase to innodb_buffer_pool_size=6m for that page size. The allocation granularity will remain 1 MiB. If the last extent is incomplete (not a multiple of 8 MiB on 64-bit systems), the first usable page of the last extent will be after the descriptor page frames. That is, if only 1MiB of the last extent were used, we might reserve 5*16KiB for block descriptors, and only 61 page frames at offset 5‥63 would be available. This setup will allow the buffer pool to be resized freely, completely ignoring innodb_buffer_pool_chunk_size , to anything up to the new parameter innodb_buffer_pool_size_max . On Microsoft Windows, I think that we must disable buffer pool resizing when using large_pages .

            I tested this with a 30-second Sysbench oltp_update_index workload, with the following statement executed right before the server shutdown:

            SET GLOBAL innodb_buffer_pool_size=10485760, innodb_fast_shutdown=0;
            

            In an attempt to do this with a CMAKE_BUILD_TYPE=RelWithDebInfo server yesterday, this led to a crash because there were race conditions in my initial shrinking algorithm. With today’s fixes, the server does not crash or hang, and the buffer pool resizing will be aborted due to running out of space:

            2024-03-08 14:25:34 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 119220)
            2024-03-08 14:25:36 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 1013 blocks are in use and 38 free. Consider increasing innodb_buffer_pool_size.
            2024-03-08 14:25:36 69 [ERROR] mariadbd: innodb_buffer_pool_size change aborted
            2024-03-08 14:25:36 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
            2024-03-08 14:25:36 0 [Note] InnoDB: FTS optimize thread exiting.
            2024-03-08 14:25:36 0 [Note] InnoDB: to purge 5942416 transactions
            2024-03-08 14:25:46 0 [Note] InnoDB: Starting shutdown...
            

            Before I implemented this logic, I observed a hang where the purge coordinator and 2 purge worker tasks were blocked, waiting to allocate a page frame for reading something into the buffer pool.

            marko Marko Mäkelä added a comment - I tested this with a 30-second Sysbench oltp_update_index workload, with the following statement executed right before the server shutdown: SET GLOBAL innodb_buffer_pool_size=10485760, innodb_fast_shutdown=0; In an attempt to do this with a CMAKE_BUILD_TYPE=RelWithDebInfo server yesterday, this led to a crash because there were race conditions in my initial shrinking algorithm. With today’s fixes, the server does not crash or hang, and the buffer pool resizing will be aborted due to running out of space: 2024-03-08 14:25:34 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 119220) 2024-03-08 14:25:36 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 1013 blocks are in use and 38 free. Consider increasing innodb_buffer_pool_size. 2024-03-08 14:25:36 69 [ERROR] mariadbd: innodb_buffer_pool_size change aborted 2024-03-08 14:25:36 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-03-08 14:25:36 0 [Note] InnoDB: FTS optimize thread exiting. 2024-03-08 14:25:36 0 [Note] InnoDB: to purge 5942416 transactions 2024-03-08 14:25:46 0 [Note] InnoDB: Starting shutdown... Before I implemented this logic, I observed a hang where the purge coordinator and 2 purge worker tasks were blocked, waiting to allocate a page frame for reading something into the buffer pool.

            It was tricky to get the logic around ROW_FORMAT=COMPRESSED to work when the buffer pool is being shrunk, but I think that I finally made it today. The rewritten test innodb.innodb_buffer_pool_resize was very useful in that, along with rr record of course. Failures were mostly observed in RelWithDebInfo, not Debug, which complicated the debugging.

            I ran a quick Sysbench based performance test to compare this to its prerequisite MDEV-33588:

            revision throughput/tps average latency/ms
            baseline 198494.96 0.32
            MDEV-33588+baseline 196647.46 0.32
            work in progress 194563.12 0.33

            I think that more extensive testing is needed to see if MDEV-33588 actually introduces a performance regression. There undeniably is a clear performance regression for the current work in progress. I have observed it also earlier. I think that it can be helped by not removing the buf_page_t::frame pointer. We can initialize it lazily.

            marko Marko Mäkelä added a comment - It was tricky to get the logic around ROW_FORMAT=COMPRESSED to work when the buffer pool is being shrunk, but I think that I finally made it today. The rewritten test innodb.innodb_buffer_pool_resize was very useful in that, along with rr record of course. Failures were mostly observed in RelWithDebInfo , not Debug , which complicated the debugging. I ran a quick Sysbench based performance test to compare this to its prerequisite MDEV-33588 : revision throughput/tps average latency/ms baseline 198494.96 0.32 MDEV-33588 +baseline 196647.46 0.32 work in progress 194563.12 0.33 I think that more extensive testing is needed to see if MDEV-33588 actually introduces a performance regression. There undeniably is a clear performance regression for the current work in progress. I have observed it also earlier. I think that it can be helped by not removing the buf_page_t::frame pointer. We can initialize it lazily.

            By the way, before this task, InnoDB could hang if one attempted to shrink the buffer pool too much:

            10.6 4ac8c4c820ebcff3571a2c67acc4fc41510b2d33

            2024-03-14 16:34:39 69 [Note] InnoDB: Requested to resize buffer pool. (new size: 134217728 bytes)
            2024-03-14 16:34:39 0 [Note] InnoDB: Resizing buffer pool from 32212254720 to 134217728 (unit=134217728).
            2024-03-14 16:34:39 0 [Note] InnoDB: Disabling adaptive hash index.
            2024-03-14 16:34:39 0 [Note] InnoDB: Withdrawing blocks to be shrunken.
            2024-03-14 16:34:39 0 [Note] InnoDB: start to withdraw the last 1938768 blocks
            2024-03-14 16:34:39 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
            2024-03-14 16:34:39 0 [Note] InnoDB: FTS optimize thread exiting.
            2024-03-14 16:34:39 0 [Note] InnoDB: to purge 22333135 transactions
            2024-03-14 16:34:40 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 383253 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.
            2024-03-14 16:34:40 0 [Note] InnoDB: withdrawing blocks. (1563630/1938768)
            2024-03-14 16:34:40 0 [Note] InnoDB: withdrew 1414234 blocks from free list. Tried to relocate 0 pages (1563755/1938768)
            2024-03-14 16:35:02 0 [Note] InnoDB: withdrawing blocks. (1604389/1938768)
            2024-03-14 16:35:02 0 [Note] InnoDB: withdrew 0 blocks from free list. Tried to relocate 0 pages (1604389/1938768)
            2024-03-14 16:35:02 0 [Note] InnoDB: will retry to withdraw later
            

            marko Marko Mäkelä added a comment - By the way, before this task, InnoDB could hang if one attempted to shrink the buffer pool too much: 10.6 4ac8c4c820ebcff3571a2c67acc4fc41510b2d33 2024-03-14 16:34:39 69 [Note] InnoDB: Requested to resize buffer pool. (new size: 134217728 bytes) 2024-03-14 16:34:39 0 [Note] InnoDB: Resizing buffer pool from 32212254720 to 134217728 (unit=134217728). 2024-03-14 16:34:39 0 [Note] InnoDB: Disabling adaptive hash index. 2024-03-14 16:34:39 0 [Note] InnoDB: Withdrawing blocks to be shrunken. 2024-03-14 16:34:39 0 [Note] InnoDB: start to withdraw the last 1938768 blocks 2024-03-14 16:34:39 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-03-14 16:34:39 0 [Note] InnoDB: FTS optimize thread exiting. 2024-03-14 16:34:39 0 [Note] InnoDB: to purge 22333135 transactions 2024-03-14 16:34:40 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 383253 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size. 2024-03-14 16:34:40 0 [Note] InnoDB: withdrawing blocks. (1563630/1938768) 2024-03-14 16:34:40 0 [Note] InnoDB: withdrew 1414234 blocks from free list. Tried to relocate 0 pages (1563755/1938768) … 2024-03-14 16:35:02 0 [Note] InnoDB: withdrawing blocks. (1604389/1938768) 2024-03-14 16:35:02 0 [Note] InnoDB: withdrew 0 blocks from free list. Tried to relocate 0 pages (1604389/1938768) 2024-03-14 16:35:02 0 [Note] InnoDB: will retry to withdraw later
            marko Marko Mäkelä made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 11.6 [ 29515 ]
            Fix Version/s 11.5 [ 29506 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 11.7 [ 29815 ]
            Fix Version/s 11.6 [ 29515 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201628
            Zendesk active tickets 201628
            serg Sergei Golubchik made changes -
            Fix Version/s 11.8 [ 29921 ]
            Fix Version/s 11.7 [ 29815 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.9 [ 29945 ]
            Fix Version/s 11.8 [ 29921 ]
            marko Marko Mäkelä made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            I revived this work, still based on 10.6 so that if any unrelated bugs are found during testing, it will be more convenient to fix them. There currently is an issue with the innodb.doublewrite test, which I am yet to fully diagnose and fix. It might be the case that something around the doublewrite buffer is currently broken.

            This will also include MDEV-25340. The server start-up time with a large buffer pool size seems to be roughly halved from what it used to be.

            I tested a 64-thread, 64-table, 100k row Sysbench oltp_update_index with an initial 30GiB buffer pool, shrinking it to 10 MiB during the workload:

            10.6-MDEV-29445 07513d4faba65a074ce64d308f5327cd5954a324

            [ 60s ] thds: 64 tps: 188308.82 qps: 188308.82 (r/w/o: 0.00/188308.82/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00
            [ 65s ] thds: 64 tps: 188916.86 qps: 188917.06 (r/w/o: 0.00/188917.06/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00
            [ 70s ] thds: 64 tps: 191395.07 qps: 191394.87 (r/w/o: 0.00/191394.87/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00
            [ 75s ] thds: 64 tps: 78314.97 qps: 78314.97 (r/w/o: 0.00/78314.97/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00
            [ 80s ] thds: 64 tps: 772.44 qps: 772.44 (r/w/o: 0.00/772.44/0.00) lat (ms,99%): 5607.61 err/s: 0.00 reconn/s: 0.00
            [ 85s ] thds: 64 tps: 1695.37 qps: 1695.37 (r/w/o: 0.00/1695.37/0.00) lat (ms,99%): 179.94 err/s: 0.00 reconn/s: 0.00
            [ 90s ] thds: 64 tps: 1789.03 qps: 1789.03 (r/w/o: 0.00/1789.03/0.00) lat (ms,99%): 153.02 err/s: 0.00 reconn/s: 0.00
            [ 95s ] thds: 64 tps: 1742.64 qps: 1742.64 (r/w/o: 0.00/1742.64/0.00) lat (ms,99%): 170.48 err/s: 0.00 reconn/s: 0.00
            [ 100s ] thds: 64 tps: 1736.15 qps: 1736.15 (r/w/o: 0.00/1736.15/0.00) lat (ms,99%): 155.80 err/s: 0.00 reconn/s: 0.00
            [ 105s ] thds: 64 tps: 1834.52 qps: 1834.52 (r/w/o: 0.00/1834.52/0.00) lat (ms,99%): 137.35 err/s: 0.00 reconn/s: 0.00
            [ 110s ] thds: 64 tps: 1836.39 qps: 1836.39 (r/w/o: 0.00/1836.39/0.00) lat (ms,99%): 139.85 err/s: 0.00 reconn/s: 0.00
            [ 115s ] thds: 64 tps: 1744.06 qps: 1744.06 (r/w/o: 0.00/1744.06/0.00) lat (ms,99%): 176.73 err/s: 0.00 reconn/s: 0.00
            [ 120s ] thds: 64 tps: 1780.16 qps: 1780.16 (r/w/o: 0.00/1780.16/0.00) lat (ms,99%): 161.51 err/s: 0.00 reconn/s: 0.00
            

            In the server error log, I see that shutdown is hanging, so I will have something to do:

            2025-01-30 15:36:37 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 328633)
            2025-01-30 15:36:46 69 [Note] InnoDB: Resizing hash tables
            2025-01-30 15:36:46 69 [Note] InnoDB: innodb_buffer_pool_size=10m (630 pages) resized from 30720m (1946880 pages)
            2025-01-30 15:37:25 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
            2025-01-30 15:37:25 0 [Note] InnoDB: FTS optimize thread exiting.
            2025-01-30 15:37:25 0 [Note] InnoDB: to purge 11743512 transactions
            2025-01-30 15:37:26 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 413 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.
            

            The buf_flush_page_cleaner() thread is busy, invoking buf_flush_LRU() and unable to free any pages. There are no dirty pages in the buffer pool. I checked that all the 413 pages of buf_pool.LRU are undo log pages in state UNFIXED+1, all of them buffer-fixed by trx_purge_attach_undo_recs(). The buffer pool usage limit in that subsystem seems to be misplaced.

            marko Marko Mäkelä added a comment - I revived this work, still based on 10.6 so that if any unrelated bugs are found during testing, it will be more convenient to fix them. There currently is an issue with the innodb.doublewrite test, which I am yet to fully diagnose and fix. It might be the case that something around the doublewrite buffer is currently broken. This will also include MDEV-25340 . The server start-up time with a large buffer pool size seems to be roughly halved from what it used to be. I tested a 64-thread, 64-table, 100k row Sysbench oltp_update_index with an initial 30GiB buffer pool, shrinking it to 10 MiB during the workload: 10.6-MDEV-29445 07513d4faba65a074ce64d308f5327cd5954a324 [ 60s ] thds: 64 tps: 188308.82 qps: 188308.82 (r/w/o: 0.00/188308.82/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00 [ 65s ] thds: 64 tps: 188916.86 qps: 188917.06 (r/w/o: 0.00/188917.06/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 64 tps: 191395.07 qps: 191394.87 (r/w/o: 0.00/191394.87/0.00) lat (ms,99%): 0.83 err/s: 0.00 reconn/s: 0.00 [ 75s ] thds: 64 tps: 78314.97 qps: 78314.97 (r/w/o: 0.00/78314.97/0.00) lat (ms,99%): 0.86 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 64 tps: 772.44 qps: 772.44 (r/w/o: 0.00/772.44/0.00) lat (ms,99%): 5607.61 err/s: 0.00 reconn/s: 0.00 [ 85s ] thds: 64 tps: 1695.37 qps: 1695.37 (r/w/o: 0.00/1695.37/0.00) lat (ms,99%): 179.94 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 64 tps: 1789.03 qps: 1789.03 (r/w/o: 0.00/1789.03/0.00) lat (ms,99%): 153.02 err/s: 0.00 reconn/s: 0.00 [ 95s ] thds: 64 tps: 1742.64 qps: 1742.64 (r/w/o: 0.00/1742.64/0.00) lat (ms,99%): 170.48 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 64 tps: 1736.15 qps: 1736.15 (r/w/o: 0.00/1736.15/0.00) lat (ms,99%): 155.80 err/s: 0.00 reconn/s: 0.00 [ 105s ] thds: 64 tps: 1834.52 qps: 1834.52 (r/w/o: 0.00/1834.52/0.00) lat (ms,99%): 137.35 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 64 tps: 1836.39 qps: 1836.39 (r/w/o: 0.00/1836.39/0.00) lat (ms,99%): 139.85 err/s: 0.00 reconn/s: 0.00 [ 115s ] thds: 64 tps: 1744.06 qps: 1744.06 (r/w/o: 0.00/1744.06/0.00) lat (ms,99%): 176.73 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 64 tps: 1780.16 qps: 1780.16 (r/w/o: 0.00/1780.16/0.00) lat (ms,99%): 161.51 err/s: 0.00 reconn/s: 0.00 In the server error log, I see that shutdown is hanging, so I will have something to do: 2025-01-30 15:36:37 69 [Note] InnoDB: Trying to shrink innodb_buffer_pool_size=10m (630 pages) from 30720m (1946880 pages, to withdraw 328633) 2025-01-30 15:36:46 69 [Note] InnoDB: Resizing hash tables 2025-01-30 15:36:46 69 [Note] InnoDB: innodb_buffer_pool_size=10m (630 pages) resized from 30720m (1946880 pages) 2025-01-30 15:37:25 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2025-01-30 15:37:25 0 [Note] InnoDB: FTS optimize thread exiting. 2025-01-30 15:37:25 0 [Note] InnoDB: to purge 11743512 transactions 2025-01-30 15:37:26 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 413 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size. The buf_flush_page_cleaner() thread is busy, invoking buf_flush_LRU() and unable to free any pages. There are no dirty pages in the buffer pool. I checked that all the 413 pages of buf_pool.LRU are undo log pages in state UNFIXED+1 , all of them buffer-fixed by trx_purge_attach_undo_recs() . The buffer pool usage limit in that subsystem seems to be misplaced.

            I applied some tweaks and performed some more load testing. An attempt to shrink the buffer pool too much during a heavy write load will typically fail, because buf_flush_page_cleaner() would invoke buf_pool.LRU_warn() to notify that it is unable to free any blocks.

            marko Marko Mäkelä added a comment - I applied some tweaks and performed some more load testing. An attempt to shrink the buffer pool too much during a heavy write load will typically fail, because buf_flush_page_cleaner() would invoke buf_pool.LRU_warn() to notify that it is unable to free any blocks.

            I ran a simple performance test on RAM disk on a dual Intel® Xeon® Gold 6230R (26×2 threads per socket), with innodb_buffer_pool_size=5G and innodb_log_file_size=5G:

            sysbench oltp_update_index --tables=100 --table_size=10000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run
            

            Compared to the baseline, I observed a 2% regression in the average throughput. My first suspect would be the lazy initialization of the buffer pool (MDEV-25340), which is part of this change, but I did not analyze it deeper yet.

            I also tested crash recovery by killing the workload about 115 seconds into it (5 seconds before it would end), and measuring the time to recover a copy of that data directory, using two settings for innodb_buffer_pool_size: 1 GiB (requiring 2 recovery batches) and 5 GiB (682,236,800 bytes of log processed in 1 batch). The times between baseline and the patch were very similar. I will have to repeat this experiment after diagnosing and addressing the performance regression during the workload.

            marko Marko Mäkelä added a comment - I ran a simple performance test on RAM disk on a dual Intel® Xeon® Gold 6230R (26×2 threads per socket), with innodb_buffer_pool_size=5G and innodb_log_file_size=5G : sysbench oltp_update_index --tables=100 --table_size=10000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run Compared to the baseline, I observed a 2% regression in the average throughput. My first suspect would be the lazy initialization of the buffer pool ( MDEV-25340 ), which is part of this change, but I did not analyze it deeper yet. I also tested crash recovery by killing the workload about 115 seconds into it (5 seconds before it would end), and measuring the time to recover a copy of that data directory, using two settings for innodb_buffer_pool_size : 1 GiB (requiring 2 recovery batches) and 5 GiB (682,236,800 bytes of log processed in 1 batch). The times between baseline and the patch were very similar. I will have to repeat this experiment after diagnosing and addressing the performance regression during the workload.

            Some regression right after server startup could be expected due to the lazy initialization of the buffer pool. However, once the entire buffer pool corresponding to the working set of the workload has been fully initialized, we could expect that no buf_pool_t::lazy_allocate() will take place. According to perf record -g, only 2.13% of total samples are being recorded in buf_page_get_low(); 0.83% of total samples attributed to waiting for a shared buf_page_t::lock there. Any other buffer pool related functions account for less than 0.02% of the samples each, probably less than 0.2% of the total samples. Because this is a 10.6 based development branch without MDEV-27774, a write-heavy benchmark such as sysbench oltp_update_index would be dominated by contention on log_sys.mutex. I will move on to a read-only benchmark.

            marko Marko Mäkelä added a comment - Some regression right after server startup could be expected due to the lazy initialization of the buffer pool. However, once the entire buffer pool corresponding to the working set of the workload has been fully initialized, we could expect that no buf_pool_t::lazy_allocate() will take place. According to perf record -g , only 2.13% of total samples are being recorded in buf_page_get_low() ; 0.83% of total samples attributed to waiting for a shared buf_page_t::lock there. Any other buffer pool related functions account for less than 0.02% of the samples each, probably less than 0.2% of the total samples. Because this is a 10.6 based development branch without MDEV-27774 , a write-heavy benchmark such as sysbench oltp_update_index would be dominated by contention on log_sys.mutex . I will move on to a read-only benchmark.

            I am observing a small performance improvement with

            sysbench oltp_read_only --tables=100 --table_size=100000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run

            Both for the baseline and the patched version, the first of two such subsequent runs is faster. The patch is improving throughput (queries per second) by 775537.75/769316.38 = 0.80% for the first run and by 761852.58/758954.99 = 0.38% for the second run.

            My benchmark setup is far from reliable. To get more stable numbers, it would help to pin the mariadbd process to a single NUMA node and to disable hyperthreading as well. This can be considered at most as a sanity check before running a broader set of performance tests.

            marko Marko Mäkelä added a comment - I am observing a small performance improvement with sysbench oltp_read_only --tables=100 --table_size=100000 --threads=100 --time=120 --report-interval=5 --max-requests=0 run Both for the baseline and the patched version, the first of two such subsequent runs is faster. The patch is improving throughput (queries per second) by 775537.75/769316.38 = 0.80% for the first run and by 761852.58/758954.99 = 0.38% for the second run. My benchmark setup is far from reliable. To get more stable numbers, it would help to pin the mariadbd process to a single NUMA node and to disable hyperthreading as well. This can be considered at most as a sanity check before running a broader set of performance tests.
            marko Marko Mäkelä made changes -

            After some discussion with wlad, I decided to check if this actually depends on lazy buffer pool initialization (MDEV-25340), which I had implemented in my development branch. It seems that the logic in many places would be simpler if that change was reverted. I posted some performance test results to MDEV-25340. With a 96 GiB buffer pool allocated in 1 GiB MMU pages on a dual Haswell/Broadwell Xeon that has 2×64GiB of RAM, we’re talking about possibly halving the start-up time, but the starting point was less than 1 second. I think that a startup time of about 10 ms/GB should be acceptable.

            Today, I removed the lazy allocation. Two loops on startup and resize are now invoking block_descriptors_in_bytes, which makes them extremely slow. I will fix that next week. Then, hopefully, this task is practically done. In the stress testing so far, mleich has reported one mystery crash that does not reproduce under rr. It might end up having been fixed by today’s cleanup.

            marko Marko Mäkelä added a comment - After some discussion with wlad , I decided to check if this actually depends on lazy buffer pool initialization ( MDEV-25340 ), which I had implemented in my development branch. It seems that the logic in many places would be simpler if that change was reverted. I posted some performance test results to MDEV-25340 . With a 96 GiB buffer pool allocated in 1 GiB MMU pages on a dual Haswell/Broadwell Xeon that has 2×64GiB of RAM, we’re talking about possibly halving the start-up time, but the starting point was less than 1 second. I think that a startup time of about 10 ms/GB should be acceptable. Today, I removed the lazy allocation. Two loops on startup and resize are now invoking block_descriptors_in_bytes , which makes them extremely slow. I will fix that next week. Then, hopefully, this task is practically done. In the stress testing so far, mleich has reported one mystery crash that does not reproduce under rr . It might end up having been fixed by today’s cleanup.

            Today, after fixing the loops, I retested the MDEV-25340 scenario. I observed a significantly slower but still kind-of acceptable start-up time compared to the lazy initialization.

            I believe that this should ideally target the 10.11 branch, for the following reasons:

            • This fixes a race condition in the adaptive hash index and therefore fixes MDEV-35485.
            • This might make obsolete the MDEV-24670 interface, which has been somewhat problematic and appeared in 10.11.
            marko Marko Mäkelä added a comment - Today, after fixing the loops, I retested the MDEV-25340 scenario. I observed a significantly slower but still kind-of acceptable start-up time compared to the lazy initialization. I believe that this should ideally target the 10.11 branch, for the following reasons: This fixes a race condition in the adaptive hash index and therefore fixes MDEV-35485 . This might make obsolete the MDEV-24670 interface, which has been somewhat problematic and appeared in 10.11.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Debarun Banerjee [ JIRAUSER54513 ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            Based on some performance tests that I conducted today, I’m seeing a much smaller reduction of the resident set size than expected when adjusting the innodb_buffer_pool_size between 50GiB and 1GiB.

            Edit: I started to think if we really need to include the lazy buffer pool allocation (MDEV-25340) in this, but then I realized that the lazy allocation would only have an impact when increasing innodb_buffer_pool_size. By lazy allocation, we would avoid the immediate pollution of some pages by linking each block descriptor to the buf_pool.free list. This ought to be more prominent when using large_pages (1 GiB or 2 MiB instead of 4 KiB on x86-64).

            I figured out a way to artificially limit the available memory when not using large_pages. The following would seem to ‘retract’ 98 GiB RAM on my 128 GiB test system:

            echo 96|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
            echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
            

            When I was testing some memory pressure related changes (MDEV-34863) in the same branch, I think that I observed some benefit from invoking madvise(MADV_FREE), but I am unsure if it has any effect on explicit huge pages on Linux.

            In pmap -x $(pgrep mariadbd) I can identify the buffer pool allocation:

            Address           Kbytes     RSS   Dirty Mode  Mapping
            00007f0ae4000000 52443136 18468352 3801088 rw---   [ anon ]
            

            The "Dirty" size is slightly smaller than the current innodb_buffer_pool_size=4m. The virtual size corresponds to the start-up parameter innodb_buffer_pool_size_max=50m. But the resident set size (RSS) is much larger than I would expect.

            marko Marko Mäkelä added a comment - - edited Based on some performance tests that I conducted today, I’m seeing a much smaller reduction of the resident set size than expected when adjusting the innodb_buffer_pool_size between 50GiB and 1GiB. Edit: I started to think if we really need to include the lazy buffer pool allocation ( MDEV-25340 ) in this, but then I realized that the lazy allocation would only have an impact when increasing innodb_buffer_pool_size . By lazy allocation, we would avoid the immediate pollution of some pages by linking each block descriptor to the buf_pool.free list. This ought to be more prominent when using large_pages (1 GiB or 2 MiB instead of 4 KiB on x86-64). I figured out a way to artificially limit the available memory when not using large_pages . The following would seem to ‘retract’ 98 GiB RAM on my 128 GiB test system: echo 96|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages When I was testing some memory pressure related changes ( MDEV-34863 ) in the same branch, I think that I observed some benefit from invoking madvise(MADV_FREE) , but I am unsure if it has any effect on explicit huge pages on Linux. In pmap -x $(pgrep mariadbd) I can identify the buffer pool allocation: Address Kbytes RSS Dirty Mode Mapping 00007f0ae4000000 52443136 18468352 3801088 rw--- [ anon ] The "Dirty" size is slightly smaller than the current innodb_buffer_pool_size=4m . The virtual size corresponds to the start-up parameter innodb_buffer_pool_size_max=50m . But the resident set size (RSS) is much larger than I would expect.

            I found some claims that Linux would take MADV_FREE as a hint only, to free the memory in the event memory pressure is detected. If I use MADV_DONTNEED instead of MADV_FREE, then the mapping will be shrunk immediately. Indeed, I observe that the RES reported by top would drop by much more than with MADV_FREE, and pmap -x $(pgrep mariadbd) would report exactly the expected innodb_buffer_pool_size=10g:

            Address           Kbytes     RSS   Dirty Mode  Mapping
            00007f729a800000 52443136 10485760 8321024 rw---   [ anon ]
            

            I will retest large_pages=1 with MADV_DONTNEED to determine the granularity of those allocations.

            marko Marko Mäkelä added a comment - I found some claims that Linux would take MADV_FREE as a hint only, to free the memory in the event memory pressure is detected. If I use MADV_DONTNEED instead of MADV_FREE , then the mapping will be shrunk immediately. Indeed, I observe that the RES reported by top would drop by much more than with MADV_FREE , and pmap -x $(pgrep mariadbd) would report exactly the expected innodb_buffer_pool_size=10g : Address Kbytes RSS Dirty Mode Mapping 00007f729a800000 52443136 10485760 8321024 rw--- [ anon ] I will retest large_pages=1 with MADV_DONTNEED to determine the granularity of those allocations.

            It turns out that when large_pages=1 are in use, the RES in top will not cover the innodb_buffer_pool_size=47g at all. During my test, it would remain steadily at around 5.5GiB. Also pmap -x $(pgrep mariadbd) would only report the size of the virtual address range:

            Address           Kbytes     RSS   Dirty Mode  Mapping
            00007f1240000000 50331648       0       0 rw--- anon_hugepage (deleted)
            

            In /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages the size 0 was reported, even after I reduced the innodb_buffer_pool_size of the running process to 4GiB. The MADV_DONTNEED does not appear to have any effect on huge page allocation, except maybe for some TLB pages. After the mariadbd process was shut down, all 48 huge pages that I had configured were reported as available.

            For the record, here is the test script that I used:

            #!/bin/bash
            : ${SRCTREE=/mariadb/10.11}
            : ${MDIR=/dev/shm/10.11}
            : ${TDIR=/dev/shm/sbtest}
            LD_LIBRARY_PATH="$MDIR/libmysql"
            MYSQL_SOCK=$TDIR/mysqld.sock
            MYSQL_USER=root
            #: ${INNODB=--innodb-log-file-size=5g --innodb-undo-tablespaces=3 --innodb-undo-log-truncate=ON}
            #: ${INNODB=--innodb-log-file-size=5g --innodb-data-home-dir=/dev/shm}
            : ${INNODB=--innodb-log-file-size=5g}
             
            SYSBENCH="sysbench oltp_update_non_index \
              --mysql-socket=$MYSQL_SOCK \
              --mysql-user=$MYSQL_USER \
              --mysql-db=test \
              --percentile=99 \
              --tables=40 \
              --table_size=1000000"
            rm -rf "$TDIR"
            cd $MDIR
            sh scripts/mariadb-install-db --user="$USERNAME" --srcdir="$SRCTREE" --builddir=. --datadir="$TDIR" --auth-root-authentication-method=normal $INNODB
            cd ../
            #numactl --cpunodebind 1 --localalloc \
            $MDIR/sql/mariadbd --no-defaults --gdb --innodb \
               --datadir="$TDIR" --socket=$MYSQL_SOCK \
               --large-pages=1 \
              $INNODB\
              --innodb_buffer_pool_size=47g \
              --innodb_buffer_pool_size_min=5g --innodb-buffer-pool-size-max=47g \
              --innodb_flush_log_at_trx_commit=0 \
              --innodb-fast-shutdown=0 \
              --max-connections=300 \
            \
              --aria-checkpoint-interval=0 > "$TDIR"/mysqld.err 2>&1 &
            timeo=600
            echo -n "waiting for server to come up "
            while [ $timeo -gt 0 ]
            do
              $MDIR/client/mariadb-admin -S $MYSQL_SOCK -u $MYSQL_USER -b -s ping && break
              echo -n "."
              timeo=$(($timeo - 1))
              sleep 1
            done
             
            if [ $timeo -eq 0 ]
            then
              echo " server not starting! Abort!"
              break
            fi
             
            #numactl --cpunodebind 0 --localalloc \
            $SYSBENCH prepare --threads=40
             
            #numactl --cpunodebind 0 --localalloc \
            $SYSBENCH --rand-seed=42 --rand-type=uniform --max-requests=0 --time=2400 --report-interval=5 --threads=40 run
             
            #$SYSBENCH cleanup
            $MDIR/client/mariadb-admin -u $MYSQL_USER -S $MYSQL_SOCK shutdown
            

            An interesting observation is that shrinking the buffer pool would very easily fail during the workload if show status like 'innodb_history_list_length' was large. Sometimes, set global innodb_purge_threads=32; would help, other times I had to attach a debugger to the sysbench process to pause the workload. Sometimes I also had to set global innodb_max_dirty_pages_pct=1; in order to improve the chances of shrinking the buffer pool to very small sizes to succeed. I have implemented some logic for aborting the shrinking if InnoDB would seem to run out of buffer pool. If I detached the debugger from sysbench after shrinking the buffer pool to something very small (such as 10MiB), it would report 0.0qps until I’d increase the buffer pool size again.

            marko Marko Mäkelä added a comment - It turns out that when large_pages=1 are in use, the RES in top will not cover the innodb_buffer_pool_size=47g at all. During my test, it would remain steadily at around 5.5GiB. Also pmap -x $(pgrep mariadbd) would only report the size of the virtual address range: Address Kbytes RSS Dirty Mode Mapping 00007f1240000000 50331648 0 0 rw--- anon_hugepage (deleted) In /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages the size 0 was reported, even after I reduced the innodb_buffer_pool_size of the running process to 4GiB. The MADV_DONTNEED does not appear to have any effect on huge page allocation, except maybe for some TLB pages. After the mariadbd process was shut down, all 48 huge pages that I had configured were reported as available. For the record, here is the test script that I used: #!/bin/bash : ${SRCTREE= /mariadb/10 .11} : ${MDIR= /dev/shm/10 .11} : ${TDIR= /dev/shm/sbtest } LD_LIBRARY_PATH= "$MDIR/libmysql" MYSQL_SOCK=$TDIR /mysqld .sock MYSQL_USER=root #: ${INNODB=--innodb-log-file-size=5g --innodb-undo-tablespaces=3 --innodb-undo-log-truncate=ON} #: ${INNODB=--innodb-log-file-size=5g --innodb-data-home-dir=/dev/shm} : ${INNODB=--innodb-log- file -size=5g}   SYSBENCH="sysbench oltp_update_non_index \ --mysql-socket=$MYSQL_SOCK \ --mysql-user=$MYSQL_USER \ --mysql-db= test \ --percentile=99 \ --tables=40 \ --table_size=1000000" rm -rf "$TDIR" cd $MDIR sh scripts /mariadb-install-db --user= "$USERNAME" --srcdir= "$SRCTREE" --builddir=. --datadir= "$TDIR" --auth-root-authentication-method=normal $INNODB cd ../ #numactl --cpunodebind 1 --localalloc \ $MDIR /sql/mariadbd --no-defaults --gdb --innodb \ --datadir= "$TDIR" --socket=$MYSQL_SOCK \ --large-pages=1 \ $INNODB\ --innodb_buffer_pool_size=47g \ --innodb_buffer_pool_size_min=5g --innodb-buffer-pool-size-max=47g \ --innodb_flush_log_at_trx_commit=0 \ --innodb-fast- shutdown =0 \ --max-connections=300 \ \ --aria-checkpoint-interval=0 > "$TDIR" /mysqld .err 2>&1 & timeo=600 echo -n "waiting for server to come up " while [ $timeo -gt 0 ] do $MDIR /client/mariadb-admin -S $MYSQL_SOCK -u $MYSQL_USER -b -s ping && break echo -n "." timeo=$(($timeo - 1)) sleep 1 done   if [ $timeo - eq 0 ] then echo " server not starting! Abort!" break fi   #numactl --cpunodebind 0 --localalloc \ $SYSBENCH prepare --threads=40   #numactl --cpunodebind 0 --localalloc \ $SYSBENCH --rand-seed=42 --rand- type =uniform --max-requests=0 -- time =2400 --report-interval=5 --threads=40 run   #$SYSBENCH cleanup $MDIR /client/mariadb-admin -u $MYSQL_USER -S $MYSQL_SOCK shutdown An interesting observation is that shrinking the buffer pool would very easily fail during the workload if show status like 'innodb_history_list_length' was large. Sometimes, set global innodb_purge_threads=32; would help, other times I had to attach a debugger to the sysbench process to pause the workload. Sometimes I also had to set global innodb_max_dirty_pages_pct=1; in order to improve the chances of shrinking the buffer pool to very small sizes to succeed. I have implemented some logic for aborting the shrinking if InnoDB would seem to run out of buffer pool. If I detached the debugger from sysbench after shrinking the buffer pool to something very small (such as 10MiB), it would report 0.0qps until I’d increase the buffer pool size again.

            I’m considering to replace the use of MADV_FREE with MADV_DONTNEED so that the resident set size of the mariadbd process would immediately reflect the change in innodb_buffer_pool_size. Because shrinking the InnoDB buffer pool can be an intrusive operation, invoking the more expensive madvise() variant should be acceptable. At least it should reduce confusion and related support requests.

            It turns out that IBM AIX only documents MADV_DONTNEED but not MADV_FREE. Apple macOS documents both, but the description says that MADV_FREE would free the memory immediately, while MADV_DONTNEED could defer it, which is the exact opposite of what the documentation for other systems (Linux, FreeBSD, NetBSD, OpenBSD, Dragonfly BSD) is saying. My understanding is that MADV_DONTNEED was first, and MADV_FREE was introduced later (for example, Linux 4.5, revised for swapless systems in 4.12) in order to allow a reduction of overhead in implementations of malloc(3) and free(3).

            marko Marko Mäkelä added a comment - I’m considering to replace the use of MADV_FREE with MADV_DONTNEED so that the resident set size of the mariadbd process would immediately reflect the change in innodb_buffer_pool_size . Because shrinking the InnoDB buffer pool can be an intrusive operation, invoking the more expensive madvise() variant should be acceptable. At least it should reduce confusion and related support requests. It turns out that IBM AIX only documents MADV_DONTNEED but not MADV_FREE . Apple macOS documents both , but the description says that MADV_FREE would free the memory immediately, while MADV_DONTNEED could defer it, which is the exact opposite of what the documentation for other systems (Linux, FreeBSD, NetBSD, OpenBSD, Dragonfly BSD) is saying. My understanding is that MADV_DONTNEED was first, and MADV_FREE was introduced later (for example, Linux 4.5, revised for swapless systems in 4.12 ) in order to allow a reduction of overhead in implementations of malloc(3) and free(3) .
            debarun Debarun Banerjee made changes -
            Assignee Debarun Banerjee [ JIRAUSER54513 ] Marko Mäkelä [ marko ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -

            debarun, thank you for your thorough review. I clarified some things and fixed others. You reproduced two bugs when the some of the innodb_buffer_pool_size related parameters are not a multiple of the allocation extent size (8 MiB or 2 MiB). I also revised the maximum buffer pool size to be 16EiB-8MiB or 4GiB-2MiB, to simplify the innodb_init_params() logic for adjusting the current, minimum, and maximum values of the parameters with respect to each other.

            marko Marko Mäkelä added a comment - debarun , thank you for your thorough review. I clarified some things and fixed others. You reproduced two bugs when the some of the innodb_buffer_pool_size related parameters are not a multiple of the allocation extent size (8 MiB or 2 MiB). I also revised the maximum buffer pool size to be 16EiB-8MiB or 4GiB-2MiB, to simplify the innodb_init_params() logic for adjusting the current, minimum, and maximum values of the parameters with respect to each other.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Debarun Banerjee [ JIRAUSER54513 ]
            Status Stalled [ 10000 ] In Review [ 10002 ]

            Some intermediate result of RQG testing on
            origin/10.11-MDEV-29445 4afd83b99d0a161d698f234427f9dbb2a670ff2f 2025-02-28T17:05:09+02:00
             
            # 2025-03-03T07:20:04 [827446] | mariadbd: /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921: int ha_innobase::info_low(uint, bool): Assertion `ib_table->stat_initialized()' failed.
            (rr) bt
            #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
            #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
            #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
            #3  0x00007b987c24526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
            #4  0x00007b987c2288ff in __GI_abort () at ./stdlib/abort.c:79
            #5  0x00007b987c22881b in __assert_fail_base (fmt=0x7b987c3d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x561de3d83aaf "ib_table->stat_initialized()", 
                file=file@entry=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=line@entry=14921, function=function@entry=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)")
                at ./assert/assert.c:94
            #6  0x00007b987c23b507 in __assert_fail (assertion=0x561de3d83aaf "ib_table->stat_initialized()", file=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=14921, 
                function=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)") at ./assert/assert.c:103
            #7  0x0000561de37e57f9 in ha_innobase::info_low (this=0x7b98580dd618, flag=18, is_analyze=<optimized out>, is_analyze@entry=false) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921
            #8  0x0000561de37e5d38 in ha_innobase::info (this=<optimized out>, flag=<optimized out>) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:15199
            #9  0x0000561de3439bbd in TABLE_LIST::fetch_number_of_rows (this=this@entry=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/table.cc:9955
            #10 0x0000561de33b64c7 in make_join_statistics (join=join@entry=0x7b9858016fc0, tables_list=..., keyuse_array=keyuse_array@entry=0x7b9858017318) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5499
            #11 0x0000561de33b916d in JOIN::optimize_inner (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:2643
            #12 0x0000561de33b93fb in JOIN::optimize (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:1954
            #13 0x0000561de33b94dd in mysql_select (thd=thd@entry=0x7b9858002568, tables=0x7b9858015b08, fields=..., conds=0x7b9858016400, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, 
                select_options=<optimized out>, result=0x7b9858016f98, unit=0x7b9858006828, select_lex=0x7b9858015018) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5218
            #14 0x0000561de33b98a8 in handle_select (thd=thd@entry=0x7b9858002568, lex=lex@entry=0x7b9858006750, result=result@entry=0x7b9858016f98, setup_tables_done_option=setup_tables_done_option@entry=0)
                at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:600
            #15 0x0000561de333d5f1 in execute_sqlcom_select (thd=thd@entry=0x7b9858002568, all_tables=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:6426
            #16 0x0000561de3346d1b in mysql_execute_command (thd=thd@entry=0x7b9858002568, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:4012
            #17 0x0000561de334d131 in mysql_parse (thd=thd@entry=0x7b9858002568, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7b98795b3400)
                at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:8188
            #18 0x0000561de334e79d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7b9858002568, 
                packet=packet@entry=0x7b985800c8d9 "SELECT `col_int_nokey` % 10 AS `col_int_nokey`, `col_int_key` % 10 AS `col_int_key` FROM a WHERE `col_int_nokey` <= 6 /* E_R Thread4 QNO 1053 CON_ID 20 */ ", 
                packet_length=packet_length@entry=155, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1905
            #19 0x0000561de334fc3b in do_command (thd=thd@entry=0x7b9858002568, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1418
            #20 0x0000561de34719af in do_handle_one_connection (connect=<optimized out>, connect@entry=0x561de6c00898, put_in_cache=put_in_cache@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1386
            #21 0x0000561de3471bc0 in handle_one_connection (arg=0x561de6c00898) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1298
            #22 0x00007b987c29ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
            #23 0x00007b987c329a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
            (rr) 
            sdp:/data/results/1741001176/TB-2232$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
             
            The test fiddles with partitioned tables and the server starts with innodb_undo_log_truncate=ON.
            

            mleich Matthias Leich added a comment - Some intermediate result of RQG testing on origin/10.11-MDEV-29445 4afd83b99d0a161d698f234427f9dbb2a670ff2f 2025-02-28T17:05:09+02:00   # 2025-03-03T07:20:04 [827446] | mariadbd: /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921: int ha_innobase::info_low(uint, bool): Assertion `ib_table->stat_initialized()' failed. (rr) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007b987c24526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007b987c2288ff in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007b987c22881b in __assert_fail_base (fmt=0x7b987c3d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x561de3d83aaf "ib_table->stat_initialized()", file=file@entry=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=line@entry=14921, function=function@entry=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)") at ./assert/assert.c:94 #6 0x00007b987c23b507 in __assert_fail (assertion=0x561de3d83aaf "ib_table->stat_initialized()", file=0x561de3c82618 "/data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc", line=14921, function=0x561de3c87d20 "int ha_innobase::info_low(uint, bool)") at ./assert/assert.c:103 #7 0x0000561de37e57f9 in ha_innobase::info_low (this=0x7b98580dd618, flag=18, is_analyze=<optimized out>, is_analyze@entry=false) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:14921 #8 0x0000561de37e5d38 in ha_innobase::info (this=<optimized out>, flag=<optimized out>) at /data/Server/10.11-MDEV-29445F/storage/innobase/handler/ha_innodb.cc:15199 #9 0x0000561de3439bbd in TABLE_LIST::fetch_number_of_rows (this=this@entry=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/table.cc:9955 #10 0x0000561de33b64c7 in make_join_statistics (join=join@entry=0x7b9858016fc0, tables_list=..., keyuse_array=keyuse_array@entry=0x7b9858017318) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5499 #11 0x0000561de33b916d in JOIN::optimize_inner (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:2643 #12 0x0000561de33b93fb in JOIN::optimize (this=this@entry=0x7b9858016fc0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:1954 #13 0x0000561de33b94dd in mysql_select (thd=thd@entry=0x7b9858002568, tables=0x7b9858015b08, fields=..., conds=0x7b9858016400, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=<optimized out>, result=0x7b9858016f98, unit=0x7b9858006828, select_lex=0x7b9858015018) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:5218 #14 0x0000561de33b98a8 in handle_select (thd=thd@entry=0x7b9858002568, lex=lex@entry=0x7b9858006750, result=result@entry=0x7b9858016f98, setup_tables_done_option=setup_tables_done_option@entry=0) at /data/Server/10.11-MDEV-29445F/sql/sql_select.cc:600 #15 0x0000561de333d5f1 in execute_sqlcom_select (thd=thd@entry=0x7b9858002568, all_tables=0x7b9858015b08) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:6426 #16 0x0000561de3346d1b in mysql_execute_command (thd=thd@entry=0x7b9858002568, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:4012 #17 0x0000561de334d131 in mysql_parse (thd=thd@entry=0x7b9858002568, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7b98795b3400) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:8188 #18 0x0000561de334e79d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7b9858002568, packet=packet@entry=0x7b985800c8d9 "SELECT `col_int_nokey` % 10 AS `col_int_nokey`, `col_int_key` % 10 AS `col_int_key` FROM a WHERE `col_int_nokey` <= 6 /* E_R Thread4 QNO 1053 CON_ID 20 */ ", packet_length=packet_length@entry=155, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1905 #19 0x0000561de334fc3b in do_command (thd=thd@entry=0x7b9858002568, blocking=blocking@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_parse.cc:1418 #20 0x0000561de34719af in do_handle_one_connection (connect=<optimized out>, connect@entry=0x561de6c00898, put_in_cache=put_in_cache@entry=true) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1386 #21 0x0000561de3471bc0 in handle_one_connection (arg=0x561de6c00898) at /data/Server/10.11-MDEV-29445F/sql/sql_connect.cc:1298 #22 0x00007b987c29ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447 #23 0x00007b987c329a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100 (rr) sdp:/data/results/1741001176/TB-2232$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio   The test fiddles with partitioned tables and the server starts with innodb_undo_log_truncate=ON.
            marko Marko Mäkelä made changes -

            marko I am done with the review. Please check my latest comments. I think the one around LRU flush is important to think through.

            debarun Debarun Banerjee added a comment - marko I am done with the review. Please check my latest comments. I think the one around LRU flush is important to think through.
            debarun Debarun Banerjee made changes -
            Assignee Debarun Banerjee [ JIRAUSER54513 ] Marko Mäkelä [ marko ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            debarun, thank you, very much appreciated. I was actually duplicating some logic of buf_pool_t::shrink() in buf_flush_LRU_list_batch(). I reverted some changes to the latter and made the former periodically release buf_pool.mutex in order to avoid starvation. I believe that its use of buf_pool.lru_itr should not conflict with buf_LRU_free_from_common_LRU_list(). Even if it did, the worst that could happen is that a buf_pool.LRU traversal is terminated prematurely and an outer loop will eventually handle it.

            marko Marko Mäkelä added a comment - debarun , thank you, very much appreciated. I was actually duplicating some logic of buf_pool_t::shrink() in buf_flush_LRU_list_batch() . I reverted some changes to the latter and made the former periodically release buf_pool.mutex in order to avoid starvation. I believe that its use of buf_pool.lru_itr should not conflict with buf_LRU_free_from_common_LRU_list() . Even if it did, the worst that could happen is that a buf_pool.LRU traversal is terminated prematurely and an outer loop will eventually handle it.

            I still need to agree with wlad regarding the interface to allocating virtual address space. Based on our discussion so far, I would change the default value of the new parameter innodb_buffer_pool_size_max to a ‘reasonably large’ value on Linux and Windows, instead of defaulting to the specified innodb_buffer_pool_size. In this way, the buffer pool can be extended just like it used to be able to.

            On other operating systems such as FreeBSD, OpenBSD, NetBSD, IBM AIX, there does not appear to be a way to overcommit the virtual address space allocation. Hence, on those systems, unless you specify innodb_buffer_pool_size_max on startup, you would only be able to shrink innodb_buffer_pool_size from its initially specified value.

            marko Marko Mäkelä added a comment - I still need to agree with wlad regarding the interface to allocating virtual address space. Based on our discussion so far, I would change the default value of the new parameter innodb_buffer_pool_size_max to a ‘reasonably large’ value on Linux and Windows, instead of defaulting to the specified innodb_buffer_pool_size . In this way, the buffer pool can be extended just like it used to be able to. On other operating systems such as FreeBSD, OpenBSD, NetBSD, IBM AIX, there does not appear to be a way to overcommit the virtual address space allocation. Hence, on those systems, unless you specify innodb_buffer_pool_size_max on startup, you would only be able to shrink innodb_buffer_pool_size from its initially specified value.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Vladislav Vaintroub [ wlad ]
            Status Stalled [ 10000 ] In Review [ 10002 ]

            mmap(MAP_NORESERVE) on Linux would still allocate MMU page tables. What would be a reasonable default value of innodb_buffer_pool_size_max? Someone could say 64 MiB, someone else 64 GiB or 64 TiB (which is a quarter of the 48-bit address space limit of many contemporary 64-bit ISA implementations). It turns out that if we allocated virtual address space for 64 TiB, we would frequently run out of memory on some workers of our CI system. The MMU page tables to cover 64 TiB of virtual address space are large, possibly in the gigabyte range when using 4096-byte pages.

            So, I would revert back to limiting innodb_buffer_pool_size_max to the start-up value of innodb_buffer_pool_size. If someone anticipates a need to increase innodb_buffer_pool_size while the server is running, they can explicitly specify innodb_buffer_pool_size_max in the server configuration.

            marko Marko Mäkelä added a comment - mmap(MAP_NORESERVE) on Linux would still allocate MMU page tables. What would be a reasonable default value of innodb_buffer_pool_size_max ? Someone could say 64 MiB, someone else 64 GiB or 64 TiB (which is a quarter of the 48-bit address space limit of many contemporary 64-bit ISA implementations). It turns out that if we allocated virtual address space for 64 TiB, we would frequently run out of memory on some workers of our CI system. The MMU page tables to cover 64 TiB of virtual address space are large, possibly in the gigabyte range when using 4096-byte pages. So, I would revert back to limiting innodb_buffer_pool_size_max to the start-up value of innodb_buffer_pool_size . If someone anticipates a need to increase innodb_buffer_pool_size while the server is running, they can explicitly specify innodb_buffer_pool_size_max in the server configuration.

            marko I just think you can't reserve "large pages" on Linux, and that MAP_NORESERVE does not work for them. So, the large pages are not resizable, and one should not attempt to reserve that.

            wlad Vladislav Vaintroub added a comment - marko I just think you can't reserve "large pages" on Linux, and that MAP_NORESERVE does not work for them. So, the large pages are not resizable, and one should not attempt to reserve that.

            HugeTLB pages are unavailable by default on Linux. You have to explicitly reserve physical memory for it to be able to use it:

            echo 4|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
            echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
            

            Based on my testing, madvise(MADV_DONTNEED) cannot shrink hugepage mappings and release such mappings to the operating system.

            Maybe Microsoft Windows can defer the allocation of page mappings until a TLB miss, but Linux appears to populate the page mappings immediately.

            marko Marko Mäkelä added a comment - HugeTLB pages are unavailable by default on Linux. You have to explicitly reserve physical memory for it to be able to use it: echo 4|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Based on my testing, madvise(MADV_DONTNEED) cannot shrink hugepage mappings and release such mappings to the operating system. Maybe Microsoft Windows can defer the allocation of page mappings until a TLB miss, but Linux appears to populate the page mappings immediately.
            wlad Vladislav Vaintroub added a comment - - edited

            So, I understand this correctly, that one can't "reserve address space" on Linux for large pages, but only allocate them immediately, i.e among other things, MAP_NORESERVE is a no-op.

            There is however a mentioning of madvise(MADV_HUGEPAGE) however, and this sounds like it could be used. It is less explicit, but if internet, and Linux documentation does not lie, it sometimes works, for some Linuxes, thus perhaps can be used, to "commit memory"

            wlad Vladislav Vaintroub added a comment - - edited So, I understand this correctly, that one can't "reserve address space" on Linux for large pages, but only allocate them immediately, i.e among other things, MAP_NORESERVE is a no-op. There is however a mentioning of madvise(MADV_HUGEPAGE) however, and this sounds like it could be used. It is less explicit, but if internet, and Linux documentation does not lie, it sometimes works, for some Linuxes, thus perhaps can be used, to "commit memory"
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 12.1 [ 29992 ]
            Fix Version/s 12.0 [ 29945 ]
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            marko Marko Mäkelä made changes -
            Description copied from [MDEV-25341|https://jira.mariadb.org/browse/MDEV-25341?focusedCommentId=232177&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-232177]:

            * The buf_pool.free as well as the buffer pool blocks that are backing store for the AHI or lock_sys could be doubly linked with each other via bytes allocated within the page frame itself. We do not need a dummy buf_page_t for such blocks.

            * We could allocate a contiguous virtual address range for the maximum supported size of buffer pool, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.
            The InnoDB buffer pool had been allocated in multiple chunks, because {{SET GLOBAL innodb_buffer_pool_size}} would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below {{innodb_buffer_pool_chunk_size}}.

            It would be cleaner to allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter {{innodb_buffer_pool_size_max}}, which defaults to the initially specified {{innodb_buffer_pool_size}}) and to allow the {{innodb_buffer_pool_size}} to be changed in increments of 1 megabyte.

            , and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.
            marko Marko Mäkelä made changes -
            Description The InnoDB buffer pool had been allocated in multiple chunks, because {{SET GLOBAL innodb_buffer_pool_size}} would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below {{innodb_buffer_pool_chunk_size}}.

            It would be cleaner to allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter {{innodb_buffer_pool_size_max}}, which defaults to the initially specified {{innodb_buffer_pool_size}}) and to allow the {{innodb_buffer_pool_size}} to be changed in increments of 1 megabyte.

            , and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.
            The InnoDB buffer pool had been allocated in multiple chunks, because {{SET GLOBAL innodb_buffer_pool_size}} would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below {{innodb_buffer_pool_chunk_size}}.

            It would be cleaner to:
            * allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter {{innodb_buffer_pool_size_max}}, which defaults to the initially specified {{innodb_buffer_pool_size}})
            * allow the {{innodb_buffer_pool_size}} to be changed in increments of 1 megabyte
            * define a fixed mapping between the virtual memory addresses of buffer page descriptors page frames, to fix bugs like MDEV-34677 and MDEV-35485
            * refactor the shrinking of the buffer pool to provide more meaningful progress output and to avoid hangs

            The complicated logic of having multiple buffer pool chunks can be removed, and the parameter {{innodb_buffer_pool_chunk_size}} will be deprecated and ignored.

            madvise(MADV_HUGEPAGE) is something for enabling Transparent Huge Pages (THP). When the large_pages interface is being used, we are allocating explicit huge pages with mmap(). I think that if we were to experiment with madvise(MADV_HUGEPAGE), it should be tied to a configuration parameter that is disabled by default.

            marko Marko Mäkelä added a comment - madvise(MADV_HUGEPAGE) is something for enabling Transparent Huge Pages (THP). When the large_pages interface is being used, we are allocating explicit huge pages with mmap() . I think that if we were to experiment with madvise(MADV_HUGEPAGE) , it should be tied to a configuration parameter that is disabled by default.
            marko Marko Mäkelä made changes -
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 11.4 [ 29301 ]
            Fix Version/s 11.8 [ 29921 ]
            Fix Version/s 12.1 [ 29992 ]
            marko Marko Mäkelä made changes -
            Status In Review [ 10002 ] In Testing [ 10301 ]
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Matthias Leich [ mleich ]
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2025-03-26 15:45:33.0 2025-03-26 15:45:32.995
            marko Marko Mäkelä made changes -
            Fix Version/s 10.11.12 [ 29998 ]
            Fix Version/s 11.4.6 [ 29999 ]
            Fix Version/s 11.8.2 [ 30001 ]
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 11.4 [ 29301 ]
            Fix Version/s 11.8 [ 29921 ]
            Assignee Matthias Leich [ mleich ] Marko Mäkelä [ marko ]
            Resolution Fixed [ 1 ]
            Status In Testing [ 10301 ] Closed [ 6 ]

            The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.

            marko Marko Mäkelä added a comment - The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            People

              marko Marko Mäkelä
              danblack Daniel Black
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.