The InnoDB buffer pool had been allocated in multiple chunks, because SET GLOBAL innodb_buffer_pool_size would extend the buffer pool in chunks. This would lead to many limitations, such as the inability to shrink the buffer pool below innodb_buffer_pool_chunk_size.
It would be cleaner to:
allocate a contiguous virtual address range for a maximum supported size of buffer pool (a new parameter innodb_buffer_pool_size_max, which defaults to the initially specified innodb_buffer_pool_size)
allow the innodb_buffer_pool_size to be changed in increments of 1 megabyte
define a fixed mapping between the virtual memory addresses of buffer page descriptors page frames, to fix bugs like MDEV-34677 and MDEV-35485
refactor the shrinking of the buffer pool to provide more meaningful progress output and to avoid hangs
The complicated logic of having multiple buffer pool chunks can be removed, and the parameter innodb_buffer_pool_chunk_size will be deprecated and ignored.
The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.
Marko Mäkelä
added a comment - The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.
madvise(MADV_HUGEPAGE) is something for enabling Transparent Huge Pages (THP). When the large_pages interface is being used, we are allocating explicit huge pages with mmap(). I think that if we were to experiment with madvise(MADV_HUGEPAGE), it should be tied to a configuration parameter that is disabled by default.
Marko Mäkelä
added a comment - madvise(MADV_HUGEPAGE) is something for enabling Transparent Huge Pages (THP). When the large_pages interface is being used, we are allocating explicit huge pages with mmap() . I think that if we were to experiment with madvise(MADV_HUGEPAGE) , it should be tied to a configuration parameter that is disabled by default.
So, I understand this correctly, that one can't "reserve address space" on Linux for large pages, but only allocate them immediately, i.e among other things, MAP_NORESERVE is a no-op.
There is however a mentioning of madvise(MADV_HUGEPAGE) however, and this sounds like it could be used. It is less explicit, but if internet, and Linux documentation does not lie, it sometimes works, for some Linuxes, thus perhaps can be used, to "commit memory"
Vladislav Vaintroub
added a comment - - edited So, I understand this correctly, that one can't "reserve address space" on Linux for large pages, but only allocate them immediately, i.e among other things, MAP_NORESERVE is a no-op.
There is however a mentioning of madvise(MADV_HUGEPAGE) however, and this sounds like it could be used. It is less explicit, but if internet, and Linux documentation does not lie, it sometimes works, for some Linuxes, thus perhaps can be used, to "commit memory"
HugeTLB pages are unavailable by default on Linux. You have to explicitly reserve physical memory for it to be able to use it:
echo 4|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Based on my testing, madvise(MADV_DONTNEED) cannot shrink hugepage mappings and release such mappings to the operating system.
Maybe Microsoft Windows can defer the allocation of page mappings until a TLB miss, but Linux appears to populate the page mappings immediately.
Marko Mäkelä
added a comment - HugeTLB pages are unavailable by default on Linux. You have to explicitly reserve physical memory for it to be able to use it:
echo 4|sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
echo 1024|sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Based on my testing, madvise(MADV_DONTNEED) cannot shrink hugepage mappings and release such mappings to the operating system.
Maybe Microsoft Windows can defer the allocation of page mappings until a TLB miss, but Linux appears to populate the page mappings immediately.
marko I just think you can't reserve "large pages" on Linux, and that MAP_NORESERVE does not work for them. So, the large pages are not resizable, and one should not attempt to reserve that.
Vladislav Vaintroub
added a comment - marko I just think you can't reserve "large pages" on Linux, and that MAP_NORESERVE does not work for them. So, the large pages are not resizable, and one should not attempt to reserve that.
The changes made many crash recovery tests hang in a Valgrind environment. I was able to reproduce the problem locally. I applied a fixup that is reducing the problem at least to some extent. The underlying issue is that the default Valgrind Memcheck tool uses an unfair scheduler. If a thread is waiting other threads to do something, thread context switches must be enforced by suitable system calls.