[MDEV-34863] RAM Usage Changed Significantly Between 10.11 Releases - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.11.9
Fix Version/s: 10.11.12, 11.4.6, 11.8.2
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
Debian 12 Bookworm

Description

We run a pretty large database (about 5.5TB currently), and we were running into issues with the growing history list length that was well known under 10.11.6.

We upgraded to 10.11.9 (10.11.8 first on some servers), and we immediately noticed that the memory usage was wildly different. 10.11.6 seemed very good at freeing memory, and it was great to see an accurate representation of how many resources the servers are actually using.

With 10.11.9, you can see that the graph does a reverse hockey stick. It maximizes allocated memory based on innodb_buffer_pool_size, and then stays occupied, hardly fluctuating.

We had seen this same kind of usage in MariaDB 10.3 and 10.6, so it was really nice when earlier 10.11 had what looked like much more robust memory handling. We wouldn't have a big problem with it except that one of our replicas unexpectedly OOMed over the weekend.

So, what changed from 10.11.6 to 10.11.8? Is there a way to set more aggressive memory management?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Memory-10.11.6.png
27 kB
2024-09-03 18:19
Memory-10.11.9.png
18 kB
2024-09-03 18:19

Issue Links

blocks

MDEV-36197 Implement Buffer Pool Auto-Scaling Based on RAM Availability

Open

is part of

MDEV-29445 reorganise innodb buffer pool (and remove buffer pool chunks)

Closed

relates to

MDEV-24670 avoid OOM by linux kernel co-operative memory management

Closed

MDEV-29445 reorganise innodb buffer pool (and remove buffer pool chunks)

Closed

MDEV-34753 memory pressure - erroneous termination condition

Closed

MDEV-35424 False alarm/crash: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch

Open

MDEV-29432 innodb huge pages reclaim

Needs Feedback

MDEV-31953 madvise(frame, srv_page_size, MADV_FREE) is causing a performance regression

Closed

(3 relates to)

Activity

Ascending order - Click to sort in descending order

Pat Suwalski added a comment - 2024-09-03 18:48

Reviewing the changelogs for 10.11.7 and 10.11.8, all I can see is an entry for 10.11.7:

Memory pressure (~~MDEV-24670~~)

Pat Suwalski added a comment - 2024-09-03 18:48 Reviewing the changelogs for 10.11.7 and 10.11.8, all I can see is an entry for 10.11.7: Memory pressure ( MDEV-24670 )

Marko Mäkelä added a comment - 2024-09-04 04:31

I’m glad to know that the history list length is not that much of a problem for you anymore. ~~MDEV-34515~~ in the upcoming releases should improve that further.

What exactly are the graphs displaying? Is it the MemAvailable in Linux /proc/meminfo? That would seem to reflect the available memory, including any portion that might be used for file system cache.

To diagnose memory usage, it would be helpful to have more detail on the mariadbd process. Can you provide a graph that contains things like the total virtual memory size of the process as well as some global status variables, such as innodb_buffer_pool_pages_data and innodb_buffer_pool_pages_free? That should look a bit like the oltp_ts_*.png attached to ~~MDEV-23855~~, partly with different counters. If those graphs look stable, then we would have to search for a memory leak or fragmentation somewhere, most likely outside InnoDB.

What ~~MDEV-24670~~ aims to do is to free pages from the InnoDB buffer pool in case a memory pressure event is reported by the operating system, to make it more immune to out-of-memory kills.

Marko Mäkelä added a comment - 2024-09-04 04:31 I’m glad to know that the history list length is not that much of a problem for you anymore. MDEV-34515 in the upcoming releases should improve that further. What exactly are the graphs displaying? Is it the MemAvailable in Linux /proc/meminfo ? That would seem to reflect the available memory, including any portion that might be used for file system cache. To diagnose memory usage, it would be helpful to have more detail on the mariadbd process. Can you provide a graph that contains things like the total virtual memory size of the process as well as some global status variables, such as innodb_buffer_pool_pages_data and innodb_buffer_pool_pages_free ? That should look a bit like the oltp_ts_*.png attached to MDEV-23855 , partly with different counters. If those graphs look stable, then we would have to search for a memory leak or fragmentation somewhere, most likely outside InnoDB. What MDEV-24670 aims to do is to free pages from the InnoDB buffer pool in case a memory pressure event is reported by the operating system, to make it more immune to out-of-memory kills.

Vladislav Vaintroub added a comment - 2024-09-04 07:59

At some point, there was a pro-active giving back memory to the OS with madvise, when freeing Innodb pages
I think this is what suwalski was referring to as "good behavior"

But it was found to be too expensive, as it would be a syscall for every freed page under some global buffer pool mutex. So this changed to use memory pressure notification.

Vladislav Vaintroub added a comment - 2024-09-04 07:59 At some point, there was a pro-active giving back memory to the OS with madvise, when freeing Innodb pages I think this is what suwalski was referring to as "good behavior" But it was found to be too expensive, as it would be a syscall for every freed page under some global buffer pool mutex. So this changed to use memory pressure notification.

Pat Suwalski added a comment - 2024-09-04 22:03

Hello Marko and Vlad,

Yes, the history list is much better managed now. We graph it, and it generally peaks at a few hundred thousand before a the background thread can clean it up after a long operation has run. Some of our even longer operations get it into the millions, but they are promptly cleaned up now as well. In contrast, the old ibdata1 file on our master was growing at about a terabyte a month on 10.11.6, before the fix and externalized undo logs.

Yes, your assumption about the graphs is correct. It's the Zabbix available memory item, vm.memory.size[available], which matches MemAvailable in /proc/meminfo. The system runs bare metal with basically nothing other than mariadb, so the number is highly reflective of the mariadb process.

I can start graphing other variables, but I don't have historical data for them.

I think the problem I had over the weekend might have actually been caused because of ~~MDEV-24670~~. The process didn't crash, it just got much slower, and the kswap1 process was completely pegging one core, even though the system has no swap space whatsoever. In the past, it would have OOMed quickly, restarted, and caused much less headache overall. Here, it hobbled along slowly, with no resolution of freeing memory in sight, and I had to do a restart of the mariadb process to resolve it.

I can make it release pages of memory sometimes. To avoid this problem in the future, I lowered innodb_buffer_pool_size by a few gigabytes at runtime, and saw an associated increase in available memory.

I think Vladislav's instinct is correct. I considered it good behaviour that the process was returning memory; it made it easy to keep track of how much RAM was actually being used at any given time. We didn't note any decrease in performance with that strategy, but it would never run out of RAM. In fact, as you can see from the graphs, it would average about half of the RAM in the system and never get low.

Back in the day, MySQL 5.5 could be configured with about 90% of the system as innodb_buffer_pool_size and run very stable for literal years with something like 20M of free memory. 5.6 made this a little less stable, and 5.7 took it away completely. The various versions of MariaDB behave a little like MySQL 5.6 to me in this regard. There seem to be small leaks or unreturned RAM that creeps up over time, so it's much harder to find optimal settings, until that version where the memory was clearly aggressively being returned.

Perhaps you could confirm that this madvise change went away in either 10.11.7 or 10.11.8?

Maybe this isn't a bug, but an intentional change, and I need to keep track of innodb memory variables more closely to tune the system.

Thanks.

Pat Suwalski added a comment - 2024-09-04 22:03 Hello Marko and Vlad, Yes, the history list is much better managed now. We graph it, and it generally peaks at a few hundred thousand before a the background thread can clean it up after a long operation has run. Some of our even longer operations get it into the millions, but they are promptly cleaned up now as well. In contrast, the old ibdata1 file on our master was growing at about a terabyte a month on 10.11.6, before the fix and externalized undo logs. Yes, your assumption about the graphs is correct. It's the Zabbix available memory item, vm.memory.size [available] , which matches MemAvailable in /proc/meminfo . The system runs bare metal with basically nothing other than mariadb, so the number is highly reflective of the mariadb process. I can start graphing other variables, but I don't have historical data for them. I think the problem I had over the weekend might have actually been caused because of MDEV-24670 . The process didn't crash, it just got much slower, and the kswap1 process was completely pegging one core, even though the system has no swap space whatsoever. In the past, it would have OOMed quickly, restarted, and caused much less headache overall. Here, it hobbled along slowly, with no resolution of freeing memory in sight, and I had to do a restart of the mariadb process to resolve it. I can make it release pages of memory sometimes. To avoid this problem in the future, I lowered innodb_buffer_pool_size by a few gigabytes at runtime, and saw an associated increase in available memory. I think Vladislav's instinct is correct. I considered it good behaviour that the process was returning memory; it made it easy to keep track of how much RAM was actually being used at any given time. We didn't note any decrease in performance with that strategy, but it would never run out of RAM. In fact, as you can see from the graphs, it would average about half of the RAM in the system and never get low. Back in the day, MySQL 5.5 could be configured with about 90% of the system as innodb_buffer_pool_size and run very stable for literal years with something like 20M of free memory. 5.6 made this a little less stable, and 5.7 took it away completely. The various versions of MariaDB behave a little like MySQL 5.6 to me in this regard. There seem to be small leaks or unreturned RAM that creeps up over time, so it's much harder to find optimal settings, until that version where the memory was clearly aggressively being returned. Perhaps you could confirm that this madvise change went away in either 10.11.7 or 10.11.8? Maybe this isn't a bug, but an intentional change, and I need to keep track of innodb memory variables more closely to tune the system. Thanks.

Vladislav Vaintroub added a comment - 2024-09-04 23:14 - edited

That madvise I mentioned was the subject of ~~MDEV-31953~~, and yes, it was removed in 10.11.7

It was added in ~~MDEV-25341~~ , 10.11.0, overall did not have a very long life.
But overall, infrequent "batched" madvise is superior to eager, for-every-freed-page. If memory pressure notification does not work, we probably need to understand why, and how to make it work.

I think recently there have been some talk about transparent huge pages on Linux, and how they seem to be causing memory leaks, and the need to disable it. I don't have the full picture, but perhaps marko or danblack can remember details.

Vladislav Vaintroub added a comment - 2024-09-04 23:14 - edited That madvise I mentioned was the subject of MDEV-31953 , and yes, it was removed in 10.11.7 It was added in MDEV-25341 , 10.11.0, overall did not have a very long life. But overall, infrequent "batched" madvise is superior to eager, for-every-freed-page. If memory pressure notification does not work, we probably need to understand why, and how to make it work. I think recently there have been some talk about transparent huge pages on Linux, and how they seem to be causing memory leaks, and the need to disable it. I don't have the full picture, but perhaps marko or danblack can remember details.

Marko Mäkelä added a comment - 2024-09-05 06:31

There is a work-in-progress ticket ~~MDEV-34753~~. I hope that danblack or knielsen can comment whether that could play a role here.

An excerpt of the server error log since the latest startup could help to find out if the memory pressure notification interface was enable or triggered. suwalski, could you provide that?

Marko Mäkelä added a comment - 2024-09-05 06:31 There is a work-in-progress ticket MDEV-34753 . I hope that danblack or knielsen can comment whether that could play a role here. An excerpt of the server error log since the latest startup could help to find out if the memory pressure notification interface was enable or triggered. suwalski , could you provide that?

Kristian Nielsen added a comment - 2024-09-05 13:01 - edited

marko: The ~~MDEV-34753~~ relates to a feature where InnoDB listens for notification from the kernel that the system is under memory pressure (risking OOM). When this happens, InnoDB can try to release memory to help by calling buf_pool.garbage_collect().

The ~~MDEV-34753~~ bug has the consequence that this does not work. The thread listening for notifications will always exit before trying to react to memory pressure.

So I think the ~~MDEV-34753~~ is not relevant here. As a consequence of this bug, the memory pressure code is not active, is my understanding. It looks like this code was introduced in 10.11.7. So the bug will prevent the new feature from making things better in 10.11.7, but it should not cause different behaviour to 10.11.6.

Unless there was some similar memory pressure handling in 10.11.6 that I am not aware of? And this was replaced by the new code in 10.11.7 (which is not working)? If the system was previously dependent on such mechanism to have good memory behavior, and this old code is not replaced with new code that is inoperational, then maybe it could have an effect.

Hope this helps,

- Kristian.

EDIT: Ok, I see the comment about madvise being removed in 10.11.7, so this then somewhat relates to this, since the mew system listening to memory pressure events from the kernel is ineffective in current code, AFAIK. The patch for ~~MDEV-34753~~ that danblack wrote is quite simple, and something the reporter could try to test if desired.

Kristian Nielsen added a comment - 2024-09-05 13:01 - edited marko : The MDEV-34753 relates to a feature where InnoDB listens for notification from the kernel that the system is under memory pressure (risking OOM). When this happens, InnoDB can try to release memory to help by calling buf_pool.garbage_collect(). The MDEV-34753 bug has the consequence that this does not work. The thread listening for notifications will always exit before trying to react to memory pressure. So I think the MDEV-34753 is not relevant here. As a consequence of this bug, the memory pressure code is not active, is my understanding. It looks like this code was introduced in 10.11.7. So the bug will prevent the new feature from making things better in 10.11.7, but it should not cause different behaviour to 10.11.6. Unless there was some similar memory pressure handling in 10.11.6 that I am not aware of? And this was replaced by the new code in 10.11.7 (which is not working)? If the system was previously dependent on such mechanism to have good memory behavior, and this old code is not replaced with new code that is inoperational, then maybe it could have an effect. Hope this helps, - Kristian. EDIT: Ok, I see the comment about madvise being removed in 10.11.7, so this then somewhat relates to this, since the mew system listening to memory pressure events from the kernel is ineffective in current code, AFAIK. The patch for MDEV-34753 that danblack wrote is quite simple, and something the reporter could try to test if desired.

Pat Suwalski added a comment - 2024-09-06 15:53 - edited

wlad I really like this idea of a batched madvise, something that could execute at something approximating the timeframe of undo log truncation, every few minutes, etc.

marko: I was looking through the logs for something indicating memory pressure. There are many entries for the undo log truncation (which also works very well now after HLL fixes!), but [Mm]emory doesn't show up anywhere in the logs. Nothing in dmesg in case it would be a kernel message. Perhaps there is something else I should search for? Or, perhaps, knielsen is correct that 10.11.9 wouldn't have this ever running.

Pat Suwalski added a comment - 2024-09-06 15:53 - edited wlad I really like this idea of a batched madvise, something that could execute at something approximating the timeframe of undo log truncation, every few minutes, etc. marko : I was looking through the logs for something indicating memory pressure. There are many entries for the undo log truncation (which also works very well now after HLL fixes!), but [Mm] emory doesn't show up anywhere in the logs. Nothing in dmesg in case it would be a kernel message. Perhaps there is something else I should search for? Or, perhaps, knielsen is correct that 10.11.9 wouldn't have this ever running.

Marko Mäkelä added a comment - 2024-10-22 08:12

I trust knielsen’s judgement here.

A batched madvise() might be feasible to run in the buf_flush_page_cleaner() thread, about once per second, but I think that it would depend on implementing ~~MDEV-29445~~, so that we could in a fairly straightforward way invoke madvise() on a few contiguous virtual address ranges, instead of invoking it separately for each innodb_page_size page frame, like we do in the current implementation. The current implementation is fine, provided that it is not being invoked too often. Should a busy server process a memory pressure event, we could in any case be in a bigger performance trouble, because buf_pool_t::garbage_collect() will try to free up as many pages as possible, no matter how recent the last access was. So, the memory pressure event could soon be followed by excessive amounts of page reads to warm up the buffer pool for the workload.

Marko Mäkelä added a comment - 2024-10-22 08:12 I trust knielsen ’s judgement here. A batched madvise() might be feasible to run in the buf_flush_page_cleaner() thread, about once per second, but I think that it would depend on implementing MDEV-29445 , so that we could in a fairly straightforward way invoke madvise() on a few contiguous virtual address ranges, instead of invoking it separately for each innodb_page_size page frame, like we do in the current implementation. The current implementation is fine, provided that it is not being invoked too often. Should a busy server process a memory pressure event, we could in any case be in a bigger performance trouble, because buf_pool_t::garbage_collect() will try to free up as many pages as possible, no matter how recent the last access was. So, the memory pressure event could soon be followed by excessive amounts of page reads to warm up the buffer pool for the workload.

Marko Mäkelä added a comment - 2025-02-19 15:19

The rewritten InnoDB buffer pool allocation in ~~MDEV-29445~~ would allow batched madvise(MADV_FREE) of larger chunks of the buffer pool at a time.

Marko Mäkelä added a comment - 2025-02-19 15:19 The rewritten InnoDB buffer pool allocation in MDEV-29445 would allow batched madvise(MADV_FREE) of larger chunks of the buffer pool at a time.

Marko Mäkelä added a comment - 2025-02-28 15:28

This is implemented in the 10.11-~~MDEV-29445~~ branch, which you have already been reviewing.

Marko Mäkelä added a comment - 2025-02-28 15:28 This is implemented in the 10.11- MDEV-29445 branch, which you have already been reviewing.

Debarun Banerjee added a comment - 2025-03-12 12:42

Review done with ~~MDEV-29445~~. https://github.com/MariaDB/server/pull/3826.

Debarun Banerjee added a comment - 2025-03-12 12:42 Review done with MDEV-29445 . https://github.com/MariaDB/server/pull/3826 .

Marko Mäkelä added a comment - 2025-03-20 13:38

This has been implemented in the same branch as ~~MDEV-29445~~.

Marko Mäkelä added a comment - 2025-03-20 13:38 This has been implemented in the same branch as MDEV-29445 .

Sergei Golubchik added a comment - 2025-03-20 22:31

Will be fixed together with ~~MDEV-29445~~, so closing.

Sergei Golubchik added a comment - 2025-03-20 22:31 Will be fixed together with MDEV-29445 , so closing.

Marko Mäkelä added a comment - 1 week ago

While this fix has been tested and reviewed in the same development branch with ~~MDEV-29445~~, it is a separate change on its own.

Marko Mäkelä added a comment - 1 week ago While this fix has been tested and reviewed in the same development branch with MDEV-29445 , it is a separate change on its own.

Marko Mäkelä added a comment - 6 days ago

The memory pressure stall interface is now disabled by default. To enable it:

SET GLOBAL innodb_buffer_pool_size_auto_min=0;

or set it to a minimum value that you allow the buffer pool to be shrunk to, in response to running out of memory. Each event will attempt to reduce innodb_buffer_pool_size to (innodb_buffer_pool_size+innodb_buffer_pool_size_auto_min)/2.

Marko Mäkelä added a comment - 6 days ago The memory pressure stall interface is now disabled by default. To enable it: SET GLOBAL innodb_buffer_pool_size_auto_min=0; or set it to a minimum value that you allow the buffer pool to be shrunk to, in response to running out of memory. Each event will attempt to reduce innodb_buffer_pool_size to (innodb_buffer_pool_size+innodb_buffer_pool_size_auto_min)/2 .

Pat Suwalski added a comment - 6 days ago

Awesome, gentlemen. I am looking forward to trying it.

Pat Suwalski added a comment - 6 days ago Awesome, gentlemen. I am looking forward to trying it.

People

Assignee:: Marko Mäkelä

Reporter:: Pat Suwalski

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 2024-09-03 18:27

Updated:: 6 days ago 15:55

Resolved:: 6 days ago 15:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration