[MDEV-24670] avoid OOM by linux kernel co-operative memory management Created: 2021-01-25 Updated: 2024-01-31 Resolved: 2023-11-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 10.11.7 |
| Type: | Task | Priority: | Critical |
| Reporter: | Daniel Black | Assignee: | Daniel Black |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | energy | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Allocating sufficiently available memory but not too much has been a challenge for MariaDB and MySQL users for a significant time. The overallocation results in a risk of OOM killing of the server process. Linux kernel this interface: By receiving a memory event of memory pressure options are:
So configuration would need a trigger values, "<some|full> <stall amount in us> <time window in us>", from the kernel interface. And a response for "some"/"full". |
| Comments |
| Comment by Marko Mäkelä [ 2021-03-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Note that in an effort to speed up operations like DROP INDEX or DROP TABLE, InnoDB is not actively moving garbage pages to buf_pool.free. This is even more so with | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Damien Ransome [ 2021-03-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The current observed behaviour of MariaDB is to grab memory and never release it. This issue proposes to release unused memory in the event of memory pressure, but I would like to propose that you take that further by releasing memory on a proactive basis (periodic garbage collection?). When a user segments their server into containers, MariaDB is typically in its own container. However, the underlying hardware system still benefits from being able to allocate unused RAM to other containers, or even to OS-level tasks such as cache. MariaDB should treat memory as an elastic resource to match the way containerised hardware is used, and in doing so facilitate more efficient utilisation of the underlying hardware. We have seen a similar transformation with respect to JVM, and it would be really great to have MariaDB embrace resource-management in a more "container friendly" way too. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-04-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Fair call damien, I was lucky enough to have temporary access to a multi TB RAM machine and it took several minutes (10-20 mins maybe) even to allocate the buffer pool (on default chunk size). Point taken about proactive release. If you've seen any good descriptions of what JVM does I'm happy to take a read. What kind of tuning parameters would you like to see from a user perspective to tradeoff the overhead of memory management vs static allocation? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Damien Ransome [ 2021-04-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
danblack these may help:
I think this depends heavily on the implementation specifics - for example the size of overhead. It might be appropriate to enable conservative memory management at a number of different points/actions (such as the DROP INDEX and DROP TABLE cases referenced earlier in the comments), in which case a flag for each of those could be one option? And/or, it might be desirable to have some form of periodic garbage collection, in which case I'd imagine some sort of gc.probability 0-100 scale? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-04-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Nice read. Thanks. ha minifest ref on the kinds on tuning knobs that jelastic offers: https://mariadb.org/ha-minifest2021/jelastic-paas/ - 3:12:15 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The madvising and VirtualAlloc'ing done too often might slow down server considerably. I have seen it in the past (granted, only Windows environment), that changing memory attributes (in that case , VirtualAlloc with either PAGE_NOACCES or PAGE_READWRITE) for subregions of large memory allocation caused quite a slowdown . I'm not an expert on such things, but IIRC, this was increased TLB access time, which is not really Windows specific. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Makes sense. The purpose of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-06-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I had a chat to Balbir Singh who reviewed the Linux kernel patches around memory pressure who 3 years later still thinks is a good interface for MariaDB to use. He noted because its still async you may still be in trouble by the time you get to processing the memory event, however that's no worse than where we are now. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2021-08-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A couple of comments:
The only way I know of reliable being able to give memory areas back One possible problem with this is that one can resize the memory to be | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-09-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I hadn't really considered the thread cache but there might be opportunities there. This bug was more about trying to auto-downsize things rather than relying on scripted/manual FLUSH TABLES external intervention. The more I think about it, to scale back the LRU time period (across a broad range of allocations) while memory pressure exists. e.g. capping cache time to a maximum of 5 mins while X pressure exists, revert back when it dropped as a pseudo goal. madvise is generally used with mmap areas, its not a requirement at all however. The glibc malloc uses MADV_DONTNEED internally as it shrinks structures. I agree, its all rather circumstantial on the effects on malloc based free on actual allocation. Looking at the code, free under some circumstances in the glib malloc code do a MADV_DONTNEED and under some, I think large allocations, even munmap. Using malloc_trim is a more direct way to get MADV_DONTNEED called on some areas and it seems to be the library call that incurs the least complication in side affects. Let me rebase bb-10.6-danielblack-MDEV-25424-my_multi_malloc-large-use-my_large_malloc again as there are some conflicts and we'll get a few more areas mmaped in preparation. > One possible problem with this is that one can resize the memory to be smaller, but not bigger again. For doing it bigger, one needs to allocate Not quite true, the MADV_FREE is the kernel can deallocate the pages, but it keep the virtual address mapping there. At some point later you may end up with a page of 0s. If you do, writing to this preserves it again at least as I understand it (small ambiguity currently of page vs address/length but there's code and people to ask). So as long as these schematics are implemented on arriving at a 0 page its possible to use. MADV_DONTNEED needs more schematics like MADV_WILLNEED to reverse the operation > I am also not sure of the disadvantages of using mmap areas for the bufferpool from the cpu's point of view. None really, they give the kernel the instructions to handle an allocation with a few flags of constraint about the intended usage. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-02-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
note: for memory hogging tests - https://github.com/fritshoogland-yugabyte/eatmemory-rust | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A couple more random thoughts:
When it comes to the implementation, VirtualAllocEx() on Microsoft Windows looks like it could allow the physical memory allocation of a virtual address range to be modified. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-08-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Windows pressure events:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-08-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
danblack, wait functions do not work with completion ports. We do have a place (main thread) where we WaitForMultipleObjects, maybe you want to place that notification here. Alas, the original bug report is all about Linux , and OOM-kills (those do not happen on Windows, although paging is of course a problem) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-09-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Marko's random thoughts above to separate MDEVs. serg, I started a proof of concept by extending the hander interface on storage engines. Other plugins are possible however I was less certain about extending those. https://github.com/grooverdan/mariadb-server/commit/d2fd183689532ff92309eb724c9edd29c6237b5a It works as far as the memory_pressure is triggered with its short on memory. There's still a large list of TODO's in the commit message. To what extent is this concept acceptable/not acceptable? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-09-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What is this for?
So what is the use case? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-09-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This puts MariaDB back on comparable usability with Postgres that uses page cache. Because the OS kernels implements the page cache for those using buffered IO, MariaDB and its more hands on management need handling. By adding a bit more tolerance we handle a variety of user mistakes a little more gracefully. This isn't designed for the memory leak case in mind. A dedicated server can increase memory use in a number of cases:
A shared hosting server using MariaDB will have a significant number of users and normally co-located php instances. While individual memory limits exist on non-MariaDB elements its quite hard for a provider to gain a memory requirement of the absolute worse case scenario (like DoS on one or more of its users). A provider is likely to be more receptive to the database freeing memory to avoid the OOM. There are still a large quantity of new users doing own VMs that take 80% of available memory, and gloss over the "available", meaning. They also often don't adequately reduce this amount for a high concurrent number of users. By detecting memory pressure, and using a global status variable count this, we expose this over allocation and protect the uptime of servers as well. They also try to do far too much on 1-2G ram, so lets tolerate this and educate. There's a large amount we can do proactively on explicitly freeing unused items. There's still an amount of true explicit cache in the buffer pool that are intentionally put off from being in case multiple changes to the pages. By flushing on a memory pressure event allows a few more G to be saved only when needed. True cached elements like thread caches, query caches, innodb pages, could have a common release under pressure tolerance system variable (300 seconds?) that add up to a reasonable amount of memory with low impacts. I think there's a point were we still do get some performance gains, so the tolerance for pre-emptive release isn't as strong, until there's a pressure event. By hooking explicit triggers like FLUSH LOW MEMORY enables an entrypoint for VM migration/container checkpoint restore to be able to reduce the quantity of pages saved/moved. I'm not too worried about the desktop use, however might be a bit beneficial to the application developer with a local DB to run tests and not need to downsize MariaDB use to avoid OOM. With the ability to react to pressure by default allows us the option of raising system variables like innodb_buffer_pool_size, which hasn't changed since 8a3ea85c921c 13 years ago (8 -> 128M), to benefit users with out of the box performance everywhere without worrying too much if it will OOM on low end hardware, being a VM, a NAS, or a crowded container environment. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2022-11-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think we are missing the big picture here.
Instead of trying to reduce page buffers (that may just hide the real problem), I would suggest us to first focus on:
Thing that could be improved in the server to keep memory usage lower:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
monty, I believe that this is a first step towards simplifying the InnoDB buffer pool resizing. I think that we should borrow a trick from AddressSanitizer and make use of the typically 48-bit virtual address space (256 terabytes) of contemporary 64-bit processors. That is, at InnoDB startup, simply allocate the virtual memory addresses corresponding to the maximum innodb_buffer_pool_size, and then request the operating system to map physical memory corresponding to the currently requested innodb_buffer_pool_size. This task implements the necessary operating system interfaces on Linux and Microsoft Windows. This task would also allow future changes to make InnoDB more ‘polite’ in ‘overprovisioned’ installations where many database installations are deployed on a single server, each in their own Docker container. If one of the database servers suddenly sees increased activity, then the more idle servers could free up memory by shrinking the buffer pool allocation. This could allow hosting providers to make much more efficient use of resources. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-11-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In a sense, this topic fits nicely into a self-tuning server, where it configures all buffers depending on the amount of available memory. And if the latter changes — it reconfigures buffers as needed. But in that case, you need to start from some kind of a scheme or plan of how to configure all buffers given a specific amount of available memory. And then you continue from there. I don't mean you actually need to implement auto-configuring of all buffers at this step. But you need to know what size each buffer should get. Even if only on paper. Instead of arbitrarily and ad hoc reducing random memory buffers whenever the kernel signals the pressure. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Max Mether [ 2022-11-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Wouldn't a starting point be not to allocate more memory when we get a certain signal from the kernel? Would that be a valid first step? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-11-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> Freeing memory to the OS in this case helps nothing and will likely make things worse as it may be hard to get back the resources when needed No, MADV_FREE on a cache item will allow the kernel to reclaim it, but the virtual memory address will be the same, so if you get a page back of 0s, it got reclaimed and need to treat it like a cache miss. > ...allocated too big buffer or resources to various MariaDB components (heap tables, sort buffers, etc) and MariaDB uses up all system memory. Agree. It takes fair expertise to tune this correctly. > In this case freeing resources to system will over time just make things worse as it makes it more likely that this will happen again. I was planning on freeing them temporarily, an like Serg said, autosize down. We can show counts on the events and the reclaim/autoresize activities performed, and error log. But continuing running allows a person to take notice, look at the state, of a still running server and see what could be trimmed in size. If the server was OOMed killed, its harder to constructively analyze what was being use and not used with regard to memory. > reduce page buffers (that may just hide the real problem), Its kind of optional if it would reduce size, only in an autotune case. If facilitated by an autosize system variable, a LRU part of buffers may be purged. > I would suggest us to first focus on ... These are the sort of things MySQL Tuner does already. This works best on a running server with lots of state. I've submitted changes there before to give better recommendations and that can continue. > InnoDB more ‘polite’ in ‘overprovisioned’ installation The hosting service provider or cloud database provider isn't the user of the database. But does share some of the impact, justified or not, about the user "over use" of the database. Keeping it available in a hostile environment is a good gain. > In a sense, this topic fits nicely into a self-tuning server, Sure, this make a lot of sense. There can be a lot of smarts put into the handers to autosize down based on use and/ limit the response to changing autosized variables rather than explicitly configured items of users. > Wouldn't a starting point be not to allocate more memory when we get a certain signal from the kernel? Sure, the same kernel interface is an indication of memory constraints. Going from serg's autosizing concept, this could be the indicator that now we know the max innodb buffer/thread pool size and stop there (and maybe take a few small steps back). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I agree that the automatic dynamic scaling of buffers is challenging to implement in a heavily used system, especially in the case that there are multiple database processes running on the same system, unaware of each other. Sometimes it could be a lesser evil to let the kernel randomly kill one of several database instances, instead of randomly letting all database instances trade memory pressure for I/O pressure. Yes, it would avoid the OOM kill, but the throughput could be chaotic. The automatic scaling would definitely have to be an opt-in feature. I think that enabling it would make sense in a case where there are lots of mostly-idle or read-mostly database instances deployed in a shared hosting environment. It would seem to depend on MDEV-19895, which is for automatic static scaling (choosing sensible buffer sizes based on one parameter). As far as I understand, once the memory pressure reported by the Linux kernel interface goes away, the memory usage would eventually be restored to the normal level. When it comes to the InnoDB buffer pool, I still think that it would be a great idea to allocate all pages in a contiguous virtual address range corresponding to the maximum supported size. This would achieve a greater granularity than the current allocation in multiple chunks. I would like to improve the memory usage of the InnoDB buffer pool. Currently there is no control over it apart from SET GLOBAL innodb_buffer_pool_size.
I see that there already is a parameter innodb_old_blocks_pct, which could play a role here. Maybe we would just want to have one command, to evict all "old" blocks, to defragment buf_pool.free and invoke madvise(MADV_FREE) on it. That command could simply be SET GLOBAL innodb_old_blocks_pct=…;, that is, the heavy cleanup could be triggered on any assignment of that parameter. I think that we should implement the madvise() calls independently of the memory pressure event interface. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I see that a subset of the "light variant" in my previous comment was implemented in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looking to reuse the code as an alternate trigger to marko's fix in MDEV-31593. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2023-11-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-11-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you mleich for testing. Two executions of buf_pool_t::garbage_collect should be possible so something is unlocking buf_pool.mutex. Since buf_pool_t::garbage_collect is only meant to be executed one at a time, maybe it needs it own mutex. Only a debug build can trigger this at the moment, so the mutex would be uncongested. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The assertion failure occurs in the buf_pool_t::garbage_collect() that I contributed. I checked the rr replay trace. It is invoking this function several times per second, most of the time freeing no pages. The trace involves the adaptive hash index, which was disabled by default in
At this time, the garbage collector was busy removing some adaptive hash index entries on a different block:
I had overlooked the fact that buf_LRU_free_page() is indeed releasing and reacquiring the buf_pool.mutex in this code path:
Due to this, we must use buf_pool.lru_hp to safely traverse the buf_pool.LRU list, similar to how buf_flush_LRU_list_batch() does it, but with a jump to the start of the eviction loop if the hazard pointer was invalidated. I believe that this handling is only needed #ifdef BTR_CUR_HASH_ADAPT. |