[MDEV-24670] avoid OOM by linux kernel co-operative memory management Created: 2021-01-25  Updated: 2024-01-31  Resolved: 2023-11-18

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.11.7

Type: Task Priority: Critical
Reporter: Daniel Black Assignee: Daniel Black
Resolution: Fixed Votes: 2
Labels: energy

Issue Links:
Blocks
blocks MDEV-31953 madvise(frame, srv_page_size, MADV_FR... Closed
Problem/Incident
causes MDEV-33340 fix for MDEV-24670 causes a performan... Closed
Relates
relates to MDEV-25340 Server startup with large innodb_buff... Open
relates to MDEV-25341 innodb buffer pool soft decommit of m... Closed
relates to MDEV-29431 SQL interface and Storage Engine inte... Open
relates to MDEV-29432 innodb huge pages reclaim Open
relates to MDEV-19895 Support "autoset" in SET GLOBAL for A... Open
relates to MDEV-25342 autosize innodb_buffer_pool_chunk_size Closed
relates to MDEV-27772 Performance regression with default c... Open
relates to MDEV-30762 Memory Limit reached causes mariadb s... Open

 Description   

Allocating sufficiently available memory but not too much has been a challenge for MariaDB and MySQL users for a significant time.

The overallocation results in a risk of OOM killing of the server process.

Linux kernel this interface:
https://www.kernel.org/doc/html/latest/accounting/psi.html

By receiving a memory event of memory pressure options are:

  • madvice(DONT_NEED/FREE) on buf_pool.free
  • Moving innodb buffer LRU pool chunks to free
  • Resizing down innodb buffer pool
  • Kill off thread cache thread(s)?
  • table caches (are these large?)

So configuration would need a trigger values, "<some|full> <stall amount in us> <time window in us>", from the kernel interface. And a response for "some"/"full".



 Comments   
Comment by Marko Mäkelä [ 2021-03-25 ]

Note that in an effort to speed up operations like DROP INDEX or DROP TABLE, InnoDB is not actively moving garbage pages to buf_pool.free. This is even more so with MDEV-22456 and MDEV-23399. Upon receiving a memory pressure event, we should actively shrink the buf_pool.LRU list.

Comment by Damien Ransome [ 2021-03-31 ]

The current observed behaviour of MariaDB is to grab memory and never release it.

This issue proposes to release unused memory in the event of memory pressure, but I would like to propose that you take that further by releasing memory on a proactive basis (periodic garbage collection?).

When a user segments their server into containers, MariaDB is typically in its own container. However, the underlying hardware system still benefits from being able to allocate unused RAM to other containers, or even to OS-level tasks such as cache. MariaDB should treat memory as an elastic resource to match the way containerised hardware is used, and in doing so facilitate more efficient utilisation of the underlying hardware.

We have seen a similar transformation with respect to JVM, and it would be really great to have MariaDB embrace resource-management in a more "container friendly" way too.

Comment by Daniel Black [ 2021-04-02 ]

Fair call damien, I was lucky enough to have temporary access to a multi TB RAM machine and it took several minutes (10-20 mins maybe) even to allocate the buffer pool (on default chunk size).

Point taken about proactive release.

If you've seen any good descriptions of what JVM does I'm happy to take a read.

What kind of tuning parameters would you like to see from a user perspective to tradeoff the overhead of memory management vs static allocation?

Comment by Damien Ransome [ 2021-04-05 ]

If you've seen any good descriptions of what JVM does I'm happy to take a read.

danblack these may help:

What kind of tuning parameters would you like to see from a user perspective to tradeoff the overhead of memory management vs static allocation?

I think this depends heavily on the implementation specifics - for example the size of overhead.

It might be appropriate to enable conservative memory management at a number of different points/actions (such as the DROP INDEX and DROP TABLE cases referenced earlier in the comments), in which case a flag for each of those could be one option?

And/or, it might be desirable to have some form of periodic garbage collection, in which case I'd imagine some sort of gc.probability 0-100 scale?

Comment by Daniel Black [ 2021-04-08 ]

Nice read. Thanks.

ha minifest ref on the kinds on tuning knobs that jelastic offers: https://mariadb.org/ha-minifest2021/jelastic-paas/ - 3:12:15

Comment by Vladislav Vaintroub [ 2021-04-27 ]

The madvising and VirtualAlloc'ing done too often might slow down server considerably. I have seen it in the past (granted, only Windows environment), that changing memory attributes (in that case , VirtualAlloc with either PAGE_NOACCES or PAGE_READWRITE) for subregions of large memory allocation caused quite a slowdown . I'm not an expert on such things, but IIRC, this was increased TLB access time, which is not really Windows specific.

Comment by Daniel Black [ 2021-04-27 ]

Makes sense. The purpose of MDEV-25342 was to get the chunk size up to a large but still useful size to avoid the number of these changes that would occur and keep them meaningful.

Comment by Daniel Black [ 2021-06-10 ]

I had a chat to Balbir Singh who reviewed the Linux kernel patches around memory pressure who 3 years later still thinks is a good interface for MariaDB to use. He noted because its still async you may still be in trouble by the time you get to processing the memory event, however that's no worse than where we are now.

Comment by Michael Widenius [ 2021-08-30 ]

A couple of comments:

  • Thread are freed automatically from the thread cache after 10 minutes
    of inactivity. We could free these faster by adding a FLUSH THREADS
    command, but I don't think that will help much. (Note that we have
    REFRESH_THREADS defined but it does not do anything)
  • madvice is mostly (only?) useful for mmap areas.
  • Note that 'free' will not normally release memory to the operating systems.
    It is only when the whole top memory area allocated internally in malloc
    with sbrk() (or similar call) is free, that the memory can be given back
    to the operating system. In practice it means that any free() call
    is very unlikely to return memory back to the operating system.

The only way I know of reliable being able to give memory areas back
to the operating system is to not use malloc/free for LARGE memory
areas but instead use mmap and munmap for these.

One possible problem with this is that one can resize the memory to be
smaller, but not bigger again. For doing it bigger, one needs to allocate
a new area and copy the old area to it.
I am also not sure of the disadvantages of using mmap areas for the bufferpool from the cpu's point of view.

Comment by Daniel Black [ 2021-09-07 ]

I hadn't really considered the thread cache but there might be opportunities there. This bug was more about trying to auto-downsize things rather than relying on scripted/manual FLUSH TABLES external intervention. The more I think about it, to scale back the LRU time period (across a broad range of allocations) while memory pressure exists. e.g. capping cache time to a maximum of 5 mins while X pressure exists, revert back when it dropped as a pseudo goal.

madvise is generally used with mmap areas, its not a requirement at all however. The glibc malloc uses MADV_DONTNEED internally as it shrinks structures.

I agree, its all rather circumstantial on the effects on malloc based free on actual allocation. Looking at the code, free under some circumstances in the glib malloc code do a MADV_DONTNEED and under some, I think large allocations, even munmap. Using malloc_trim is a more direct way to get MADV_DONTNEED called on some areas and it seems to be the library call that incurs the least complication in side affects.

Let me rebase bb-10.6-danielblack-MDEV-25424-my_multi_malloc-large-use-my_large_malloc again as there are some conflicts and we'll get a few more areas mmaped in preparation.

> One possible problem with this is that one can resize the memory to be smaller, but not bigger again. For doing it bigger, one needs to allocate
a new area and copy the old area to it.

Not quite true, the MADV_FREE is the kernel can deallocate the pages, but it keep the virtual address mapping there. At some point later you may end up with a page of 0s. If you do, writing to this preserves it again at least as I understand it (small ambiguity currently of page vs address/length but there's code and people to ask). So as long as these schematics are implemented on arriving at a 0 page its possible to use. MADV_DONTNEED needs more schematics like MADV_WILLNEED to reverse the operation

> I am also not sure of the disadvantages of using mmap areas for the bufferpool from the cpu's point of view.

None really, they give the kernel the instructions to handle an allocation with a few flags of constraint about the intended usage.

Comment by Daniel Black [ 2022-02-21 ]

note: for memory hogging tests - https://github.com/fritshoogland-yugabyte/eatmemory-rust

Comment by Marko Mäkelä [ 2022-06-21 ]

A couple more random thoughts:

  • Now that MDEV-27058 reduced sizeof(buf_page_t), it could be acceptable to malloc() block descriptors on demand, for pages that are associated with file pages, or attached to buf_pool.LRU. We could shrink sizeof{buf_block_t) further by moving all fields of the disabled-by-default adaptive hash index (MDEV-20487) behind one pointer.
  • The buf_pool.free as well as the buffer pool blocks that are backing store for the AHI or lock_sys could be doubly linked with each other via bytes allocated within the page frame itself. We do not need a dummy buf_page_t for such blocks.
  • We could allocate a contiguous virtual address range for the maximum supported size of buffer pool, and let the operating system physically allocate a subset of these addresses. The complicated logic of having multiple buffer pool chunks can be removed. On 32-bit architectures, the maximum size could be about 2GiB. On 64-bit architectures, the virtual address bus often is 48 bits (around 256 TiB). Perhaps we could shift some burden to the user and introduce a startup parameter innodb_buffer_pool_size_max.
  • If we are using huge memory pages, we would probably need to relocate blocks in order to be able to free up memory one huge page at a time. Some relocation is already implemented for ROW_FORMAT=COMPRESSED pages in the buf0buddy.cc allocator. An analogy would be defragmenting file systems on HDD or the relocation magic done by flash translation layers, to arrange erase blocks for reuse.
  • Supporting relocation for lock_sys objects (MDEV-28803) would require some effort, and the allocator may have to be replaced.
  • Some memory can be trivially thrown away on a memory pressure event. Such memory includes all pages of the adaptive hash index, as well all buffer pool pages that do not need to be written back to a file. Also some ‘dirty’ pages may actually belong to freed blocks (MDEV-15528) and can be discarded without being written.

When it comes to the implementation, VirtualAllocEx() on Microsoft Windows looks like it could allow the physical memory allocation of a virtual address range to be modified.

Comment by Daniel Black [ 2022-08-13 ]

Windows pressure events:

Comment by Vladislav Vaintroub [ 2022-08-13 ]

danblack, wait functions do not work with completion ports. We do have a place (main thread) where we WaitForMultipleObjects, maybe you want to place that notification here. Alas, the original bug report is all about Linux , and OOM-kills (those do not happen on Windows, although paging is of course a problem)

Comment by Daniel Black [ 2022-09-02 ]

Marko's random thoughts above to separate MDEVs.

serg, I started a proof of concept by extending the hander interface on storage engines. Other plugins are possible however I was less certain about extending those.

https://github.com/grooverdan/mariadb-server/commit/d2fd183689532ff92309eb724c9edd29c6237b5a

It works as far as the memory_pressure is triggered with its short on memory. There's still a large list of TODO's in the commit message.

To what extent is this concept acceptable/not acceptable?

Comment by Sergei Golubchik [ 2022-09-11 ]

What is this for?

  • A dedicated database server doesn't need it, it can only get close to OOM if there's a memory leak or some out-of-control query allocates way more memory than it should. That is, if a dedicated db server hits OOM — it's a bug. Freeing some memory under pressure will allow the query to allocate a bit more before crashing
  • Desktop usage? Like a personal address book, amarok, etc? There are many applications on desktop that need a lot of memory, but MariaDB shouldn't be one of them. On desktop it shouldn't need large buffers, so few Mb that it'll be able to free under pressure will not help anyway.
  • Containers? As comments above suggest to be a good containerized citizen MariaDB needs to free the memory regularly and proactively and not wait for a pressure.

So what is the use case?

Comment by Daniel Black [ 2022-09-12 ]

This puts MariaDB back on comparable usability with Postgres that uses page cache. Because the OS kernels implements the page cache for those using buffered IO, MariaDB and its more hands on management need handling. By adding a bit more tolerance we handle a variety of user mistakes a little more gracefully.

This isn't designed for the memory leak case in mind.

A dedicated server can increase memory use in a number of cases:

  • as data is added query plans could flip to using addition join_buffer_size which if common amongst a number of threads can increase usage by a modest but not unbounded amount.
  • some script/install/unexpected cron condition on a dedicated server uses are large amount memory for a brief time. By taking a small hit in memory both can continue.

A shared hosting server using MariaDB will have a significant number of users and normally co-located php instances. While individual memory limits exist on non-MariaDB elements its quite hard for a provider to gain a memory requirement of the absolute worse case scenario (like DoS on one or more of its users). A provider is likely to be more receptive to the database freeing memory to avoid the OOM.

There are still a large quantity of new users doing own VMs that take 80% of available memory, and gloss over the "available", meaning. They also often don't adequately reduce this amount for a high concurrent number of users. By detecting memory pressure, and using a global status variable count this, we expose this over allocation and protect the uptime of servers as well. They also try to do far too much on 1-2G ram, so lets tolerate this and educate.

There's a large amount we can do proactively on explicitly freeing unused items. There's still an amount of true explicit cache in the buffer pool that are intentionally put off from being in case multiple changes to the pages. By flushing on a memory pressure event allows a few more G to be saved only when needed.

True cached elements like thread caches, query caches, innodb pages, could have a common release under pressure tolerance system variable (300 seconds?) that add up to a reasonable amount of memory with low impacts. I think there's a point were we still do get some performance gains, so the tolerance for pre-emptive release isn't as strong, until there's a pressure event.

By hooking explicit triggers like FLUSH LOW MEMORY enables an entrypoint for VM migration/container checkpoint restore to be able to reduce the quantity of pages saved/moved.

I'm not too worried about the desktop use, however might be a bit beneficial to the application developer with a local DB to run tests and not need to downsize MariaDB use to avoid OOM.

With the ability to react to pressure by default allows us the option of raising system variables like innodb_buffer_pool_size, which hasn't changed since 8a3ea85c921c 13 years ago (8 -> 128M), to benefit users with out of the box performance everywhere without worrying too much if it will OOM on low end hardware, being a VM, a NAS, or a crowded container environment.

Comment by Michael Widenius [ 2022-11-15 ]

I think we are missing the big picture here.

  • Most 'enterprise' or 'critical' Database installations are running on dedicated servers or containers. This means that there is no other process
    that competes with resources. Freeing memory to the OS in this case helps nothing and will likely make things worse as it may be hard to get back the resources when needed (and we don't have any way to signal the process that it can now use more memory)
  • The most common case for a MariadB OOM is that the user has allocated too big buffer or resources to various MariaDB components (heap tables, sort buffers, etc) and MariaDB uses up all system memory. In this case freeing resources to system will over time just make things worse as it makes it more likely that this will happen again.

Instead of trying to reduce page buffers (that may just hide the real problem), I would suggest us to first focus on:

  • Finding the real cause of the OOM and advice the user of how to fix it.
  • Create a program that checks the MariaDB variables and gives a report what are the constraints of the current setup:
  • Basic memory usage (after startup)
  • Expected memory usage per 'simple' user
  • Max memory usage for a query, with detailed information of each allocation (sort_buffer, join_buffer, temp tables, optimization of many tables etc)
  • Tuning recommendations

Thing that could be improved in the server to keep memory usage lower:

  • Restrict buffer sizes if there are many running queries that uses a lot of memory
  • sort_buffer, join_buffer, in_memory_temporary_tables etc.
Comment by Marko Mäkelä [ 2022-11-16 ]

monty, I believe that this is a first step towards simplifying the InnoDB buffer pool resizing. I think that we should borrow a trick from AddressSanitizer and make use of the typically 48-bit virtual address space (256 terabytes) of contemporary 64-bit processors. That is, at InnoDB startup, simply allocate the virtual memory addresses corresponding to the maximum innodb_buffer_pool_size, and then request the operating system to map physical memory corresponding to the currently requested innodb_buffer_pool_size. This task implements the necessary operating system interfaces on Linux and Microsoft Windows.

This task would also allow future changes to make InnoDB more ‘polite’ in ‘overprovisioned’ installations where many database installations are deployed on a single server, each in their own Docker container. If one of the database servers suddenly sees increased activity, then the more idle servers could free up memory by shrinking the buffer pool allocation. This could allow hosting providers to make much more efficient use of resources.

Comment by Sergei Golubchik [ 2022-11-16 ]

In a sense, this topic fits nicely into a self-tuning server, where it configures all buffers depending on the amount of available memory. And if the latter changes — it reconfigures buffers as needed.

But in that case, you need to start from some kind of a scheme or plan of how to configure all buffers given a specific amount of available memory. And then you continue from there. I don't mean you actually need to implement auto-configuring of all buffers at this step. But you need to know what size each buffer should get. Even if only on paper.

Instead of arbitrarily and ad hoc reducing random memory buffers whenever the kernel signals the pressure.

Comment by Max Mether [ 2022-11-17 ]

Wouldn't a starting point be not to allocate more memory when we get a certain signal from the kernel? Would that be a valid first step?

Comment by Daniel Black [ 2022-11-17 ]

> Freeing memory to the OS in this case helps nothing and will likely make things worse as it may be hard to get back the resources when needed

No, MADV_FREE on a cache item will allow the kernel to reclaim it, but the virtual memory address will be the same, so if you get a page back of 0s, it got reclaimed and need to treat it like a cache miss.

> ...allocated too big buffer or resources to various MariaDB components (heap tables, sort buffers, etc) and MariaDB uses up all system memory.

Agree. It takes fair expertise to tune this correctly.

> In this case freeing resources to system will over time just make things worse as it makes it more likely that this will happen again.

I was planning on freeing them temporarily, an like Serg said, autosize down. We can show counts on the events and the reclaim/autoresize activities performed, and error log. But continuing running allows a person to take notice, look at the state, of a still running server and see what could be trimmed in size.

If the server was OOMed killed, its harder to constructively analyze what was being use and not used with regard to memory.

> reduce page buffers (that may just hide the real problem),

Its kind of optional if it would reduce size, only in an autotune case. If facilitated by an autosize system variable, a LRU part of buffers may be purged.

> I would suggest us to first focus on ...

These are the sort of things MySQL Tuner does already. This works best on a running server with lots of state. I've submitted changes there before to give better recommendations and that can continue.

> InnoDB more ‘polite’ in ‘overprovisioned’ installation

The hosting service provider or cloud database provider isn't the user of the database. But does share some of the impact, justified or not, about the user "over use" of the database. Keeping it available in a hostile environment is a good gain.

> In a sense, this topic fits nicely into a self-tuning server,

Sure, this make a lot of sense. There can be a lot of smarts put into the handers to autosize down based on use and/ limit the response to changing autosized variables rather than explicitly configured items of users.

> Wouldn't a starting point be not to allocate more memory when we get a certain signal from the kernel?
> Would that be a valid first step?

Sure, the same kernel interface is an indication of memory constraints.

Going from serg's autosizing concept, this could be the indicator that now we know the max innodb buffer/thread pool size and stop there (and maybe take a few small steps back).

Comment by Marko Mäkelä [ 2022-11-18 ]

I agree that the automatic dynamic scaling of buffers is challenging to implement in a heavily used system, especially in the case that there are multiple database processes running on the same system, unaware of each other. Sometimes it could be a lesser evil to let the kernel randomly kill one of several database instances, instead of randomly letting all database instances trade memory pressure for I/O pressure. Yes, it would avoid the OOM kill, but the throughput could be chaotic.

The automatic scaling would definitely have to be an opt-in feature. I think that enabling it would make sense in a case where there are lots of mostly-idle or read-mostly database instances deployed in a shared hosting environment. It would seem to depend on MDEV-19895, which is for automatic static scaling (choosing sensible buffer sizes based on one parameter). As far as I understand, once the memory pressure reported by the Linux kernel interface goes away, the memory usage would eventually be restored to the normal level.

When it comes to the InnoDB buffer pool, I still think that it would be a great idea to allocate all pages in a contiguous virtual address range corresponding to the maximum supported size. This would achieve a greater granularity than the current allocation in multiple chunks.

I would like to improve the memory usage of the InnoDB buffer pool. Currently there is no control over it apart from SET GLOBAL innodb_buffer_pool_size.

  1. Light variant: Try to make buf_pool.free a contiguous range of virtual addresses, aligned to the hugepage size, and invoke madvise(MADV_FREE). This could be done automatically by a periodic maintenance task, independent of any memory pressure events, if the performance impact is negligible.
  2. Medium variant: Move any data pages whose state is FREED to buf_page.free. Immediately after a DROP TABLE, DROP INDEX or table-rebuilding DDL operations such as OPTIMIZE TABLE or TRUNCATE TABLE, we might evict useful data even though there is some true garbage close to the most recently used end of the buf_pool.LRU list. Such garbage will eventually be collected by LRU eviction or by buf_page_t::flush() in the buf_flush_page_cleaner() thread, but we could be more aggressive about it.
  3. Heavy variant: Free some least recently used pages. Ever since MDEV-23855, there no longer is any "background LRU eviction" in InnoDB. Least recently used pages will be retained in the buffer pool for possible future access, until a caller of buf_block_alloc() really needs memory.

I see that there already is a parameter innodb_old_blocks_pct, which could play a role here. Maybe we would just want to have one command, to evict all "old" blocks, to defragment buf_pool.free and invoke madvise(MADV_FREE) on it. That command could simply be SET GLOBAL innodb_old_blocks_pct=…;, that is, the heavy cleanup could be triggered on any assignment of that parameter.

I think that we should implement the madvise() calls independently of the memory pressure event interface.

Comment by Marko Mäkelä [ 2022-11-18 ]

I see that a subset of the "light variant" in my previous comment was implemented in MDEV-25341. The last half of my comment is more relevant for the open tickets linked to MDEV-25341: MDEV-29445, MDEV-29432, MDEV-29429.

Comment by Daniel Black [ 2023-10-24 ]

Looking to reuse the code as an alternate trigger to marko's fix in MDEV-31593.

Comment by Matthias Leich [ 2023-11-01 ]

Preliminary results of RQG testing
---------------------------------------------------
origin/bb-10.11-MDEV-24670-memory-pressure 7d941227f6b7031edceaa6abef90e178091ac172 2023-10-28T19:16:00+11:00
Build with debug+asan, PERFSCHEMA enabled
I got
1. quite frequent when running a test based on the RQG grammar conf/mariadb/innodb_compression_encryption.yy
mariadbd: /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/buf0buf.h:2031: void buf_page_t::clear_oldest_modification(): Assertion `oldest_modification()' failed.
(rr) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=65511758501440) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=65511758501440) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=65511758501440, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000000006fe17476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000000006fdfd7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000000006fdfd71b in __assert_fail_base (fmt=0x6ffb2150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5626d9f33680 "oldest_modification()", 
    file=0x5626d9f323a0 "/data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/buf0buf.h", line=2031, function=<optimized out>) at ./assert/assert.c:92
#6  0x000000006fe0ee96 in __GI___assert_fail (assertion=0x5626d9f33680 "oldest_modification()", file=0x5626d9f323a0 "/data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/buf0buf.h", line=2031, 
    function=0x5626d9f335e0 "void buf_page_t::clear_oldest_modification()") at ./assert/assert.c:101
#7  0x00005626d90408c4 in buf_page_t::clear_oldest_modification (this=this@entry=0x11696ad51700) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/buf0buf.h:2031
#8  0x00005626d902c845 in buf_pool_t::delete_from_flush_list (this=this@entry=0x5626dacadb40 <buf_pool>, bpage=bpage@entry=0x11696ad51700)
    at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0flu.cc:170
#9  0x00005626d9001965 in buf_pool_t::garbage_collect (this=this@entry=0x5626dacadb40 <buf_pool>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2130
#10 0x00005626d8fe4a68 in mem_pressure::trigger_collection (this=0x5626db7c7520 <mem_pressure_obj>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:906
#11 buf_resize_start () at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2189
#12 0x00005626d891b434 in innodb_buffer_pool_size_update (save=<optimized out>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/handler/ha_innodb.cc:17430
#13 0x00005626d748cebb in sys_var_pluginvar::global_update (this=0x6210000d5488, thd=0x62c0003a0218, var=0x629002a0d4d8) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_plugin.cc:3677
#14 0x00005626d713c16b in sys_var::update (this=0x6210000d5488, thd=<optimized out>, var=0x629002a0d4d8) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/set_var.cc:207
#15 0x00005626d713d4e5 in set_var::update (this=<optimized out>, thd=<optimized out>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/set_var.cc:863
#16 0x00005626d7140a96 in sql_set_variables (thd=thd@entry=0x62c0003a0218, var_list=var_list@entry=0x62c0003a5630, free=free@entry=true) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/set_var.cc:745
#17 0x00005626d742f64e in mysql_execute_command (thd=thd@entry=0x62c0003a0218, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:5064
#18 0x00005626d74396b0 in mysql_parse (thd=thd@entry=0x62c0003a0218, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x3b952517f5a0)
    at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:8030
#19 0x00005626d744053c in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62c0003a0218, 
    packet=packet@entry=0x6290029f9219 " SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size /* E_R Thread1 QNO 2473 CON_ID 22 */ ", packet_length=packet_length@entry=99, blocking=blocking@entry=true)
    at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:1894
#20 0x00005626d7445980 in do_command (thd=0x62c0003a0218, blocking=blocking@entry=true) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:1407
#21 0x00005626d78dbf63 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x608000003a38, put_in_cache=put_in_cache@entry=true) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_connect.cc:1416
#22 0x00005626d78dc779 in handle_one_connection (arg=arg@entry=0x608000003a38) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_connect.cc:1318
#23 0x00005626d87138f1 in pfs_spawn_thread (arg=0x617000007e98) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/perfschema/pfs.cc:2201
#24 0x000000006fe69b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#25 0x000000006fefabb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(rr)
sdp:/data1/results/1698783620/TBR-2080$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
 
2. a bit less frequent when running a test based on the RQG grammar conf/mariadb/innodb_compression_encryption.yy
[rr 2146265 145385]mariadbd: /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2105: void buf_pool_t::garbage_collect(): Assertion `state >= buf_page_t::FREED' failed.
(rr) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140148681741888) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140148681741888) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140148681741888, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f771ca5b476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f771ca417f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f771ca4171b in __assert_fail_base (fmt=0x7f771cbf6150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55f5fe36f460 "state >= buf_page_t::FREED", 
    file=0x55f5fe368c80 "/data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc", line=2105, function=<optimized out>) at ./assert/assert.c:92
#6  0x00007f771ca52e96 in __GI___assert_fail (assertion=0x55f5fe36f460 "state >= buf_page_t::FREED", file=0x55f5fe368c80 "/data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc", line=2105, 
    function=0x55f5fe36f4a0 "void buf_pool_t::garbage_collect()") at ./assert/assert.c:101
#7  0x000055f5fd44c716 in buf_pool_t::garbage_collect (this=this@entry=0x55f5ff0f8b40 <buf_pool>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2105
#8  0x000055f5fd42fa68 in mem_pressure::trigger_collection (this=0x55f5ffc12520 <mem_pressure_obj>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:906
#9  buf_resize_start () at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2189
#10 0x000055f5fcd66434 in innodb_buffer_pool_size_update (save=<optimized out>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/handler/ha_innodb.cc:17430
#11 0x000055f5fb8d7ebb in sys_var_pluginvar::global_update (this=0x6210000d5488, thd=0x62c0001d0218, var=0x6290001184d8) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_plugin.cc:3677
#12 0x000055f5fb58716b in sys_var::update (this=0x6210000d5488, thd=<optimized out>, var=0x6290001184d8) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/set_var.cc:207
#13 0x000055f5fb5884e5 in set_var::update (this=<optimized out>, thd=<optimized out>) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/set_var.cc:863
#14 0x000055f5fb58ba96 in sql_set_variables (thd=thd@entry=0x62c0001d0218, var_list=var_list@entry=0x62c0001d5630, free=free@entry=true) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/set_var.cc:745
#15 0x000055f5fb87a64e in mysql_execute_command (thd=thd@entry=0x62c0001d0218, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:5064
#16 0x000055f5fb8846b0 in mysql_parse (thd=thd@entry=0x62c0001d0218, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7f76e863f5a0)
    at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:8030
#17 0x000055f5fb88b53c in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62c0001d0218, 
    packet=packet@entry=0x6290019a5219 " SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size /* E_R Thread1 QNO 2605 CON_ID 16 */ ", packet_length=packet_length@entry=99, blocking=blocking@entry=true)
    at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:1894
#18 0x000055f5fb890980 in do_command (thd=0x62c0001d0218, blocking=blocking@entry=true) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_parse.cc:1407
#19 0x000055f5fbd26f63 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x608000003338, put_in_cache=put_in_cache@entry=true) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_connect.cc:1416
#20 0x000055f5fbd27779 in handle_one_connection (arg=arg@entry=0x608000003338) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/sql/sql_connect.cc:1318
#21 0x000055f5fcb5e8f1 in pfs_spawn_thread (arg=0x617000007098) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/perfschema/pfs.cc:2201
#22 0x00007f771caadb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#23 0x00007f771cb3ebb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(rr)
sdp:/data1/results/1698783620/TBR-2081$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

Comment by Daniel Black [ 2023-11-02 ]

Thank you mleich for testing. Two executions of buf_pool_t::garbage_collect should be possible so something is unlocking buf_pool.mutex.

Since buf_pool_t::garbage_collect is only meant to be executed one at a time, maybe it needs it own mutex. Only a debug build can trigger this at the moment, so the mutex would be uncongested.

Comment by Marko Mäkelä [ 2023-11-02 ]

The assertion failure occurs in the buf_pool_t::garbage_collect() that I contributed. I checked the rr replay trace. It is invoking this function several times per second, most of the time freeing no pages. The trace involves the adaptive hash index, which was disabled by default in MDEV-20487. The last change to the block state was by the page cleaner thread:

#3  buf_page_t::set_state (s=0, this=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/buf0buf.h:1985
#4  buf_LRU_block_free_non_file_page (block=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0lru.cc:1016
#5  0x000055f5fd493f6d in buf_LRU_block_free_hashed_page (block=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0lru.cc:132
#6  0x000055f5fd49a9ab in buf_LRU_free_page (bpage=..., zip=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0lru.cc:995
#7  0x000055f5fd47ac1a in buf_flush_discard_page (bpage=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0flu.cc:1211
#8  0x000055f5fd483394 in buf_do_flush_list_batch (max_n=..., lsn=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0flu.cc:1472
#9  0x000055f5fd484349 in buf_flush_list_holding_mutex (max_n=..., lsn=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0flu.cc:1552
#10 0x000055f5fd489c88 in buf_flush_page_cleaner () at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0flu.cc:2559

At this time, the garbage collector was busy removing some adaptive hash index entries on a different block:

#0  0x000055f5fcdd6934 in ut_align_offset (ptr=..., alignment=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/ut0byte.h:88
#1  0x000055f5fd3fd607 in page_offset (ptr=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/page0page.h:216
#2  page_rec_check (rec=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/page0page.inl:310
#3  0x000055f5fd3ff852 in page_rec_get_next_low (rec=..., comp=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/include/page0page.inl:358
#4  0x000055f5fd406ab9 in btr_search_drop_page_hash_index (block=..., garbage_collect=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/btr/btr0sea.cc:1332
#5  0x000055f5fd49a94a in buf_LRU_free_page (bpage=..., zip=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0lru.cc:984
#6  0x000055f5fd44c9c7 in buf_pool_t::garbage_collect (this=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2113
#7  0x000055f5fd42fa68 in mem_pressure::trigger_collection (this=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:906
#8  buf_resize_start () at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/buf/buf0buf.cc:2189
#9  0x000055f5fcd66434 in innodb_buffer_pool_size_update (save=...) at /data/Server/bb-10.11-MDEV-24670-memory-pressure/storage/innobase/handler/ha_innodb.cc:17430

I had overlooked the fact that buf_LRU_free_page() is indeed releasing and reacquiring the buf_pool.mutex in this code path:

#ifdef BTR_CUR_HASH_ADAPT
	if (block->index) {
		mysql_mutex_unlock(&buf_pool.mutex);
 
		/* Remove the adaptive hash index on the page.
		The page was declared uninitialized by
		buf_LRU_block_remove_hashed().  We need to flag
		the contents of the page valid (which it still is) in
		order to avoid bogus Valgrind or MSAN warnings.*/
 
		MEM_MAKE_DEFINED(block->page.frame, srv_page_size);
		btr_search_drop_page_hash_index(block, false);
		MEM_UNDEFINED(block->page.frame, srv_page_size);
		mysql_mutex_lock(&buf_pool.mutex);
	}
#endif

Due to this, we must use buf_pool.lru_hp to safely traverse the buf_pool.LRU list, similar to how buf_flush_LRU_list_batch() does it, but with a jump to the start of the eviction loop if the hazard pointer was invalidated. I believe that this handling is only needed #ifdef BTR_CUR_HASH_ADAPT.

Generated at Thu Feb 08 09:31:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.