[MDEV-18726] INNDOB gets confused when using large pages if pages=1G Created: 2019-02-24  Updated: 2019-03-26  Resolved: 2019-03-18

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.12
Fix Version/s: 10.4.4

Type: Bug Priority: Major
Reporter: Philip orleans Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: contribution, foundation, patch
Environment:

Ubuntu


Issue Links:
Relates
relates to MDEV-15685 large pages - out of memory handling ... Open
relates to MDEV-18851 modernise Linux Large Page support (m... Closed

 Description   

In older machines, large pages are =2MB, and of you have enough of them, let's say 10G, you may use in my.cnf
large_pages=1
innodb_buffer_pool_size=10G
and innodb correctly allocates from this faster, never swappable memory pool.
BUT, if the machine is newer, and you booted with a kernel command line with
hugepagesz=1G default_hugepagesz=1G
Then you only need to allocate 10 pages to get 10GB of memory. This makes memory managent much faster
however, Innodb get confused. if you add
innodb_buffer_pool_size=5G
it will allocate 50G from the OS, verifiable doing
cat /proc/meminfo | grep HugePages
yet internally it will think it has only 5G, the rest is wasted.

I have a box ready to show the issue to Elena is she wants to see it. I have seen the issue in many boxes.



 Comments   
Comment by Daniel Black [ 2019-02-25 ]

in MDEV-15685 innodb_buffer_pool_chunk_size is the allocation size, except even this is inflated by a bit of "management overhead" (buf_chunk_init) causing the actual allocation consume two lots of innodb_buffer_pool_chunk_size at a minimum with huge pages showing the same behaviour here.

Gets even worse with POWER hardware having 16G large pages (as an option).

Agree its all a bit of a mess. I think the `buf_chunk_init` should at least include the management overhead of 32M within its size as the first fix.

Long term the availability of multiple page types (since 2.6.27) probably should drop `innodb_buffer_pool_chunk_size` and allocate by pages available (though resizing innodb_buffer_pool_size would need to be re-implemented).

Comment by Daniel Black [ 2019-02-28 ]

I've done some x86_64 and Power8 (non-large page) tests by removing in buf_chunk_init. The rest of the calculations in this function seem to adequately account for the removed space. Small test result change in `innodb.innodb` with a different value of innodb_buffer_pool_pages_total (500/493(debug build)) but otherwise passes tests ok.

--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -1565,11 +1565,6 @@ buf_chunk_init(
        /* Round down to a multiple of page size,
        although it already should be. */
        mem_size = ut_2pow_round(mem_size, ulint(srv_page_size));
-       /* Reserve space for the block descriptors. */
-       mem_size += ut_2pow_round((mem_size >> srv_page_size_shift)
-                                 * (sizeof *block)
-                                 + (srv_page_size - 1),
-                                 ulint(srv_page_size));

Comment by Philip orleans [ 2019-02-28 ]

if you send me an installer I may test it.

Comment by Daniel Black [ 2019-03-01 ]

I'm not sure how to make an installer and don't really have time. The above is the entire list of lines to remove beginning with `-`.

I've progress a bit with a use all the large page sizes starting with the largest - https://github.com/grooverdan/mariadb-server/commits/10.4-large-pages-descriptors-in-page. Now requires 3.8+ kernel (which every maintained distro that mariadb release on does). Resizing buffer pool hasn't be implemented to account for these changes (would need to search each chunk and release larger pages first for downsize) - adds to an already complex implementation. Loosely tested by still WIP.

Comment by Philip orleans [ 2019-03-01 ]

I wonder if somebody can generate an update for the 10.4 branch with these improvements.

Comment by Daniel Black [ 2019-03-03 ]

The initial patch is worth testing and would resolve this issue. More extensive patches need to be complete before merged.

Recommend watching https://fosdem.org/2019/schedule/event/hugepages_databases/
Look particularly at the perf measurement for dTLB-

{loads|stores} vs dTLB-{loads|stores}

-misses. If your miss ratio on your workload is low then larger pages may not provide benefit.

Comment by Philip orleans [ 2019-03-04 ]

My database has all possible numbers in North America, 17BN plus all associated information.
Anyway, I stopped using Innodb for the main table. It requires about 8 times the disk space compared to RocksDB, for the same information. It is faster maybe but inferior.

Comment by Daniel Black [ 2019-03-07 ]

Just to highlight the problem, below a 2M chunk size is increased by 2% and when allocating on a 2M large_page_size system, 4M gets allocated per chunk of which only 51% is used.

gdb --args sql/mysqld --no-defaults --skip-networking --datadir=/tmp/datadir --log-bin=/tmp/datadir/mysqlbin --socket /tmp/s.sock --lc-messages-dir=/home/dan/repos/build-mariadb-server-10.4-upstream/sql/share --verbose --innodb-buffer-pool-size=10M --innodb-buffer-pool-instances=2 --innodb-buffer-pool-chunk-size=2M --large-pages
(gdb) break buf_chunk_init
Breakpoint 1 at 0x51d17f: buf_chunk_init. (2 locations)
(gdb) r
Thread 1 "mysqld" hit Breakpoint 1, buf_chunk_init (buf_pool=0x5555574a47e0, chunk=0x5555574a4e20, mem_size=2097152) at /home/dan/repos/mariadb-server/storage/innobase/buf/buf0buf.cc:1560
1560	{
(gdb) p my_large_page_size 
$1 = 2097152
(gdb) n
1567		mem_size = ut_2pow_round(mem_size, ulint(srv_page_size));
(gdb) 
1569		mem_size += ut_2pow_round((mem_size >> srv_page_size_shift)
(gdb) p mem_size
$2 = 2097152
(gdb) n
1576		chunk->mem = buf_pool->allocator.allocate_large(mem_size,
(gdb) p mem_size
$3 = 2146304
(gdb) p mem_size - 2097152
$4 = 49152
(gdb) p 49152 * 100 / 2097152
$5 = 2
(gdb) s
ut_allocator<unsigned char, true>::allocate_large (dontdump=true, pfx=0x5555574a4e30, n_elements=2146304, this=0x5555574a4850) at /home/dan/repos/mariadb-server/storage/innobase/include/ut0new.h:634
634		allocate_large(
(gdb) s
os_mem_alloc_large (n=0x7fffffff59c0) at /home/dan/repos/mariadb-server/storage/innobase/os/os0proc.cc:66
66	{
(gdb) n
73		if (!os_use_large_pages || !os_large_page_size) {
(gdb) 
79		size = ut_2pow_round(*n + (os_large_page_size - 1),
(gdb) 
82		shmid = shmget(IPC_PRIVATE, (size_t) size, SHM_HUGETLB | SHM_R | SHM_W);
(gdb) p size
$6 = 4194304
(gdb) n
83		if (shmid < 0) {
(gdb) 
88			ptr = shmat(shmid, NULL, 0);
(gdb) 
89			if (ptr == (void*)-1) {
(gdb) p ptr
$1 = (void *) 0x7fffe1000000

OS confirms:

$ cd /proc/`pidof mysqld` ; egrep -A 20 '/(SYS|anon_huge)' smaps | more
7fffe1000000-7fffe1400000 rw-s 00000000 00:0f 31064141                   /SYSV00000000 (deleted)
Size:               4096 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms de ht sd 

Comment by Marko Mäkelä [ 2019-03-08 ]

This is a welcome idea, but there are a couple of minor problems with the implementation, causing mismatch related to innodb_buffer_pool_size. Please address my review comments.

Comment by Ralf Schenk [ 2019-03-18 ]

When husing 1G hugepages with 10.1.38 on Ubuntu I get 32GB of used SHM Memory when declaring innodb_buffer_pool=16G. I think 16GB are wasted.
In earlier days (before 10.1.38) and when using 2 MB Hugepages when I set up innodb_buffer_pool=16G and innodb_buffer_instances=16. I got exactly 16 shared memory segments of 1 GB. Now I get 16 Segments of 2 GB!
On 10.3.x I had no possibility (tried different innodb_buffer_chunk_size and innodb_buffer_instances settings) to get innodb_buffer_pool of configured size. MySQL tried to allocate multiple times the innodb_buffer_pool size of RAM.

Comment by Daniel Black [ 2019-03-18 ]

rs@databay.de you may be interested in MDEV-18851 too.

Generated at Thu Feb 08 08:46:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.