Innodb buffer can fill up on things like during a mariadb-dump.
The logically successor to MDEV-25340 to keep only the memory required for innodb actually mapped is as part of the LRU to purge off old pages if they really haven't been used in a while. These will be still allocated to mariadbd, however in a virtual non-paged in sense.
This can be done on linux with madvise(MADV_FREE) (Linux and FreeBSD). Windows has VirtualFree MEM_DECOMMIT/MEM_RESET.
These a proactive releases, and them mechanisms implemented can be reused to serve reactive mechanism like MDEV-24670.
Attachments
Issue Links
causes
MDEV-31953madvise(frame, srv_page_size, MADV_FREE) is causing a performance regression
Closed
relates to
MDEV-29429"free" innodb pages from dropped table/index/(deleted?)/table rebuild
Open
MDEV-29431SQL interface and Storage Engine interface for reclaiming memory
Daniel Black
added a comment - Used MADV_FREE. (which will fail on huge memory) on a page by page basis.
MEM_RESET appears good, but would require MEM_RESET_UNDO on allocation again. Happy to include this too.
https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc
The change so far looks OK to me. For huge pages, I think that we could add something to actively free pages. Should there be some interface, say, a SET GLOBAL variable, to prompt immediate eviction of pages that are no longer useful? The choice of actions could be something like the following:
Evict all pages that became useless after DROP TABLE or a table rebuild.
Evict not-recently-used pages that are in a clean state and not in use.
Evict not-recently-used pages that not in use.
Related to the "not-recently-used" above, there probably should be "minimum buffer pool size" parameter, so that some of the buffer pool could always be preserved. The usage scenario for that could be that most of the time, the working set is small, but occasionally more memory will be needed for running a larger query.
The retained pages could be relocated together, so that there could be a chance to free a number of adjacent buffers (entire huge pages) at once. Relocation is not possible if a page latch or a reference is being held by some thread.
Marko Mäkelä
added a comment - The change so far looks OK to me. For huge pages, I think that we could add something to actively free pages. Should there be some interface, say, a SET GLOBAL variable, to prompt immediate eviction of pages that are no longer useful? The choice of actions could be something like the following:
Evict all pages that became useless after DROP TABLE or a table rebuild.
Evict not-recently-used pages that are in a clean state and not in use.
Evict not-recently-used pages that not in use.
Related to the "not-recently-used" above, there probably should be "minimum buffer pool size" parameter, so that some of the buffer pool could always be preserved. The usage scenario for that could be that most of the time, the working set is small, but occasionally more memory will be needed for running a larger query.
The retained pages could be relocated together, so that there could be a chance to free a number of adjacent buffers (entire huge pages) at once. Relocation is not possible if a page latch or a reference is being held by some thread.
origin/bb-10.11-danielblack-MDEV-25341-innodb-politeness 6b76c3313fc3936f1b26aea6c2d8c5b9841bb042 2022-08-14T10:12:12+10:00
performed well in RQG testing. Failures observed are in the official trees too.
Matthias Leich
added a comment - origin/bb-10.11-danielblack- MDEV-25341 -innodb-politeness 6b76c3313fc3936f1b26aea6c2d8c5b9841bb042 2022-08-14T10:12:12+10:00
performed well in RQG testing. Failures observed are in the official trees too.
Stared in default configuration with 20G innodb buffer pool with datadir on tmpfs.
prepared with sysbench --mysql-user=dan --mysql-db=test --mysql-socket=/tmp/build-mariadb-server-10.11.sock --threads=16 --time=600 --report-interval=20 --tables=32 --table_size=1000000 oltp_read_write prepare, though restarted the server after prepare to run.
On cold started, server started with 273 maps and 5.69G mapped in:
FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to server during query) for query 'SELECT SUM(k) FROM sbtest10 WHERE id BETWEEN ? AND ?'
From early in the test perf results happened:
$ perf record -g -p $(pidof mysqld) -F 16
[ perf record: Woken up 5732801 times to write data ]
[ perf record: Captured and wrote 18.911 MB perf.data (82199 samples) ]
Of all the samples a few notable functions related to this PR:
[63089.748310] Out of memory: Killed process 171566 (eatmemory) total-vm:26628724kB, anon-rss:26336484kB, file-rss:1444kB, shmem-rss:0kB, UID:0 pgtables:51592kB oom_score_adj:200
[63091.796602] oom_reaper: reaped process 171566 (eatmemory), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(oddly didn't see mysqld/mariadbd though terminal reported kill 9)
Daniel Black
added a comment - Test run 1:
Test of updated pr, 30G, 16 vcpu laptop:
Stared in default configuration with 20G innodb buffer pool with datadir on tmpfs.
prepared with sysbench --mysql-user=dan --mysql-db=test --mysql-socket=/tmp/build-mariadb-server-10.11.sock --threads=16 --time=600 --report-interval=20 --tables=32 --table_size=1000000 oltp_read_write prepare , though restarted the server after prepare to run.
On cold started, server started with 273 maps and 5.69G mapped in:
start of test
$ while [ 1 ] ; do echo "$(date) pages: $(wc -l < /proc/$(pidof mysqld)/maps) DRS,TRS,VSZ,RSS $(ps -C mysqld -o drs,trs,vsz,rss | tail -n 1)"; sleep 1 ; done
Sat 03 Sep 2022 09:56:57 AEST pages: 273 DRS,TRS,VSZ,RSS 24585907 9156 24595064 5691376
Sat 03 Sep 2022 09:56:58 AEST pages: 273 DRS,TRS,VSZ,RSS 24585907 9156 24595064 5691376
Sat 03 Sep 2022 09:56:59 AEST pages: 273 DRS,TRS,VSZ,RSS 24585907 9156 24595064 5691376
Sat 03 Sep 2022 09:57:00 AEST pages: 273 DRS,TRS,VSZ,RSS 24585907 9156 24595064 5691376
Sat 03 Sep 2022 09:57:01 AEST pages: 273 DRS,TRS,VSZ,RSS 24585907 9156 24595064 5691376
Sat 03 Sep 2022 09:49:00 AEST $sysbench --mysql-user=dan --mysql-db=test --mysql-socket=/tmp/build-mariadb-server-10.11.sock --threads=16 --time=600 --report-interval=20 --tables=32 --table_size=1000000 oltp_read_write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Report intermediate results every 20 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 20s ] thds: 16 tps: 9096.27 qps: 181935.92 (r/w/o: 127357.04/36385.53/18193.34) lat (ms,95%): 2.91 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 16 tps: 8904.07 qps: 178079.45 (r/w/o: 124655.16/35616.14/17808.14) lat (ms,95%): 3.07 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 16 tps: 8853.59 qps: 177073.10 (r/w/o: 123951.41/35414.51/17707.18) lat (ms,95%): 3.07 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 16 tps: 8864.84 qps: 177294.25 (r/w/o: 124105.51/35459.06/17729.68) lat (ms,95%): 3.07 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 16 tps: 9133.53 qps: 182673.78 (r/w/o: 127872.70/36534.02/18267.06) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 16 tps: 9184.83 qps: 183696.48 (r/w/o: 128587.20/36739.62/18369.66) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 16 tps: 9212.69 qps: 184249.44 (r/w/o: 128973.62/36850.48/18425.34) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 16 tps: 9239.05 qps: 184782.91 (r/w/o: 129348.30/36956.45/18478.15) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 16 tps: 9213.78 qps: 184275.92 (r/w/o: 128993.06/36855.30/18427.55) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 16 tps: 9180.36 qps: 183607.59 (r/w/o: 128525.48/36721.39/18360.72) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 16 tps: 9192.66 qps: 183852.76 (r/w/o: 128697.00/36770.49/18385.27) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 16 tps: 9265.44 qps: 185306.56 (r/w/o: 129714.25/37061.44/18530.87) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
Sysbench started at a consistent rate.
At this point the buffer pool was:
$mysql -u dan -S /tmp/build-mariadb-server-10.11.sock -e 'show global status like "innodb_buffer_pool_pages%"'
+-----------------------------------------+---------+
| Variable_name | Value |
+-----------------------------------------+---------+
| Innodb_buffer_pool_pages_data | 442706 |
| Innodb_buffer_pool_pages_dirty | 88191 |
| Innodb_buffer_pool_pages_flushed | 2960918 |
| Innodb_buffer_pool_pages_free | 855278 |
| Innodb_buffer_pool_pages_made_not_young | 542455 |
| Innodb_buffer_pool_pages_made_young | 902580 |
| Innodb_buffer_pool_pages_misc | 0 |
| Innodb_buffer_pool_pages_old | 163400 |
| Innodb_buffer_pool_pages_total | 1297984 |
| Innodb_buffer_pool_pages_lru_flushed | 0 |
| Innodb_buffer_pool_pages_lru_freed | 0 |
+-----------------------------------------+---------+
$mysql -u dan -S /tmp/build-mariadb-server-10.11.sock -e 'set global innodb_max_dirty_pages_pct=0'
At this time the sysbench tps stayed the same:
[ 260s ] thds: 16 tps: 9234.88 qps: 184701.26 (r/w/o: 129291.76/36939.68/18469.82) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 16 tps: 9167.76 qps: 183354.80 (r/w/o: 128347.84/36671.49/18335.47) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 16 tps: 9199.79 qps: 183995.78 (r/w/o: 128797.35/36798.81/18399.63) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 320s ] thds: 16 tps: 9085.71 qps: 181714.43 (r/w/o: 127200.09/36342.93/18171.41) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 340s ] thds: 16 tps: 9198.14 qps: 183962.28 (r/w/o: 128773.36/36792.63/18396.29) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
[ 360s ] thds: 16 tps: 9170.48 qps: 183407.56 (r/w/o: 128384.71/36681.90/18340.95) lat (ms,95%): 2.97 err/s: 0.00 reconn/s: 0.00
And no big changes in virtual mapping of pages/rss memory:
Sat 03 Sep 2022 10:01:54 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8061700
Sat 03 Sep 2022 10:01:55 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8070676
Sat 03 Sep 2022 10:01:56 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8078068
Sat 03 Sep 2022 10:01:57 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8087564
Sat 03 Sep 2022 10:01:58 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8099440
Sat 03 Sep 2022 10:01:59 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8108680
Sat 03 Sep 2022 10:02:00 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8117392
Sat 03 Sep 2022 10:02:01 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8125312
Sat 03 Sep 2022 10:02:02 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8133760
Sat 03 Sep 2022 10:02:03 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8142208
Sat 03 Sep 2022 10:02:04 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8150656
Sat 03 Sep 2022 10:02:05 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8158840
Sat 03 Sep 2022 10:02:06 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8167552
Just before the next step the innodb buffer pool was:
mysql -u dan -S /tmp/build-mariadb-server-10.11.sock -e 'show global status like "innodb_buffer_pool_pages%"'
+-----------------------------------------+---------+
| Variable_name | Value |
+-----------------------------------------+---------+
| Innodb_buffer_pool_pages_data | 518717 |
| Innodb_buffer_pool_pages_dirty | 88487 |
| Innodb_buffer_pool_pages_flushed | 4978087 |
| Innodb_buffer_pool_pages_free | 779267 |
| Innodb_buffer_pool_pages_made_not_young | 585320 |
| Innodb_buffer_pool_pages_made_young | 1172425 |
| Innodb_buffer_pool_pages_misc | 0 |
| Innodb_buffer_pool_pages_old | 191459 |
| Innodb_buffer_pool_pages_total | 1297984 |
| Innodb_buffer_pool_pages_lru_flushed | 0 |
| Innodb_buffer_pool_pages_lru_freed | 0 |
+-----------------------------------------+---------+
This is the point where memory was squeezed - 26G of locked memory:
$sudo ./target/release/eatmemory -m --alloc-definition=1 -s 26000
Killed
Killed from from the OOM killer
reduce rss size quickly until server was killed
Sat 03 Sep 2022 10:03:57 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 9029708
Sat 03 Sep 2022 10:03:58 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 8214668
Sat 03 Sep 2022 10:03:59 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 6950240
Sat 03 Sep 2022 10:04:01 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 5461480
Sat 03 Sep 2022 10:04:02 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4832316
Sat 03 Sep 2022 10:04:03 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4605364
Sat 03 Sep 2022 10:04:05 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4587420
Sat 03 Sep 2022 10:04:06 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4559304
Sat 03 Sep 2022 10:04:07 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4538528
Sat 03 Sep 2022 10:04:08 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4489348
Sat 03 Sep 2022 10:04:10 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4446184
Sat 03 Sep 2022 10:04:13 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4385052
Sat 03 Sep 2022 10:04:14 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4345644
Sat 03 Sep 2022 10:04:15 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4296496
Sat 03 Sep 2022 10:04:17 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4226108
Sat 03 Sep 2022 10:04:19 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4157748
Sat 03 Sep 2022 10:04:21 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4179292
Sat 03 Sep 2022 10:04:22 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4188908
Sat 03 Sep 2022 10:04:23 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4196520
Sat 03 Sep 2022 10:04:24 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4209808
Sat 03 Sep 2022 10:04:26 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4235948
Sat 03 Sep 2022 10:04:27 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4253008
Sat 03 Sep 2022 10:04:28 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4262616
Sat 03 Sep 2022 10:04:29 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4280436
Sat 03 Sep 2022 10:04:30 AEST pages: 339 DRS,TRS,VSZ,RSS 25639583 9156 25648740 4302776
bash: /proc//maps: No such file or directory
expected sysbench dropoff
[ 380s ] thds: 16 tps: 8508.37 qps: 170170.18 (r/w/o: 119120.25/34033.19/17016.74) lat (ms,95%): 3.19 err/s: 0.00 reconn/s: 0.00
[ 400s ] thds: 16 tps: 1358.79 qps: 27175.66 (r/w/o: 19022.89/5435.20/2717.58) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00
FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to server during query) for query 'SELECT SUM(k) FROM sbtest10 WHERE id BETWEEN ? AND ?'
From early in the test perf results happened:
$ perf record -g -p $(pidof mysqld) -F 16
[ perf record: Woken up 5732801 times to write data ]
[ perf record: Captured and wrote 18.911 MB perf.data (82199 samples) ]
Of all the samples a few notable functions related to this PR:
0.00% 0.00% mysqld mariadbd [.] buf_LRU_get_free_only
buf_LRU_block_free_non_file_page, set_os_used, set_os_unused (didn't show up in any samples)
0.00% 0.00% mysqld libc.so.6 [.] __madvise
OOM killer:
dmesg
[62583.817703] perf: interrupt took too long (5130 > 5031), lowering kernel.perf_event_max_sample_rate to 38000
[62594.988081] perf: interrupt took too long (6425 > 6412), lowering kernel.perf_event_max_sample_rate to 31000
[63089.747836] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2112 pages left available.
[63089.747919] slack invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=200
[63089.747922] CPU: 4 PID: 3771 Comm: slack Tainted: G W ------- --- 6.0.0-0.rc2.19.fc38.x86_64 #1
[63089.747924] Hardware name: Acer Nitro AN515-57/Scala_TLS, BIOS V1.17 06/02/2022
[63089.747925] Call Trace:
[63089.747926] <TASK>
[63089.747928] dump_stack_lvl+0x44/0x5c
[63089.747932] dump_header+0x4a/0x1ff
[63089.747934] oom_kill_process.cold+0xb/0x10
[63089.747936] out_of_memory+0x1be/0x4f0
[63089.747939] __alloc_pages_slowpath.constprop.0+0xc17/0xd50
[63089.747943] __alloc_pages+0x201/0x220
[63089.747944] __folio_alloc+0x17/0x50
[63089.747946] ? policy_node+0x51/0x70
[63089.747948] vma_alloc_folio+0x88/0x320
[63089.747949] do_swap_page+0x7ac/0xb40
[63089.747952] ? psi_task_change+0x83/0x120
[63089.747955] ? __mod_node_page_state+0x72/0xc0
[63089.747957] __handle_mm_fault+0x837/0xe40
[63089.747959] handle_mm_fault+0xae/0x290
[63089.747961] do_user_addr_fault+0x1cd/0x690
[63089.747964] exc_page_fault+0x70/0x170
[63089.747966] asm_exc_page_fault+0x22/0x30
as expected user space faulting of memory address caused it (eatmemory):
[63089.748021] Mem-Info:
[63089.748021] active_anon:25995 inactive_anon:28763 isolated_anon:0
active_file:1535 inactive_file:1417 isolated_file:0
unevictable:6585743 dirty:0 writeback:3
slab_reclaimable:40034 slab_unreclaimable:80365
mapped:3943 shmem:26303 pagetables:22102 bounce:0
kernel_misc_reclaimable:0
free:50492 free_pcp:63 free_cma:0
Decent reclaimable amount.
[63089.748056] 34000 total pagecache pages
[63089.748056] 4267 pages in swap cache
[63089.748057] Free swap = 2264564kB
[63089.748057] Total swap = 16502776kB
[63089.748058] 8332110 pages RAM
[63089.748058] 0 pages HighMem/MovableOnly
[63089.748058] 171770 pages reserved
[63089.748059] 0 pages cma reserved
[63089.748059] 0 pages hwpoisoned
[63089.748059] Tasks state (memory values in pages):
...
...
[63089.748296] [ 171566] 0 171566 6657181 6584482 52830208 39 200 eatmemory
[63089.748298] [ 172293] 0 172293 55367 106 69632 25 0 sleep
[63089.748299] [ 172298] 1000 172298 91841 592 368640 25168 200 perf
[63089.748300] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-7b4a5622-7723-4011-9c12-8a95251f2509.scope,task=eatmemory,pid=171566,uid=0
[63089.748310] Out of memory: Killed process 171566 (eatmemory) total-vm:26628724kB, anon-rss:26336484kB, file-rss:1444kB, shmem-rss:0kB, UID:0 pgtables:51592kB oom_score_adj:200
[63091.796602] oom_reaper: reaped process 171566 (eatmemory), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(oddly didn't see mysqld/mariadbd though terminal reported kill 9)
As test 1 used < 50% of the buffer pool and most of the impacts of this change are write based lets look at the impacts under a memory constrained high write test.
origin/bb-10.11-danielblack-MDEV-25341-innodb-politeness e9d2ef69408ab7c8c64330f7c359433e41e61c1e 2022-08-31T13:49:33+03:00
performed well in RQG testing.
Matthias Leich
added a comment - origin/bb-10.11-danielblack- MDEV-25341 -innodb-politeness e9d2ef69408ab7c8c64330f7c359433e41e61c1e 2022-08-31T13:49:33+03:00
performed well in RQG testing.
Used MADV_FREE. (which will fail on huge memory) on a page by page basis.
MEM_RESET appears good, but would require MEM_RESET_UNDO on allocation again. Happy to include this too.
https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc