[MDEV-29097] 10.8.3 seems to be using a lot more swap memory, always increasing (every time mariabackup runs daily) Created: 2022-07-13  Updated: 2023-01-20  Resolved: 2022-11-01

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.8.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Nuno Assignee: Daniel Black
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File screenshot-3.png     PNG File screenshot-4.png     PNG File screenshot-5.png     PNG File screenshot-6.png     PNG File screenshot-7.png    

 Description   

I don't remember having this issue with 10.5, as although it was using a good amount of swap memory, I never had to do an emergency database restart because of it.

Since a few days ago, I had been receiving alerts about too much swap being used on the server, and it's been getting worse. I had to restart MariaDB yesterday, because it was getting full.

Before the restart, this was the usage:

Resident RAM:
mariadbd – 89387556 (85.25 GB)

Swap:
mariadbd – 26975616 (25.73 GB)

And by the way, my server's sysctl has this config:

vm.swappiness=1
(which means to only use Swap if absolutely necessary; while =0 would disable it)

My innodb_buffer_pool_size is 80G.

Are you aware of any reason so much Swap would be used by MariaDB?

There was plenty of resident free RAM that could be used instead.
I understand that some Swap can be used, but I don't understand why so much Swap, instead of resident RAM.

Since the restart yesterday, it's already using 2GB swap, and increasing.

Thank you.



 Comments   
Comment by Marko Mäkelä [ 2022-07-14 ]

Would the memory usage tracking of performance_schema produce anything useful?

For a subset of your workload, can you try to get some more information via a heap profiler, such as the one of tcmalloc?

Comment by Nuno [ 2022-07-14 ]

Hi marko. I have performance_schema disabled and looks like it's not a dynamic variable, so requires a restart. I can look into enabling it on the next restart, if the performance impact in having it enabled is negligible.

As for using tcmalloc, that looks like quite advanced and something I'm not sure I'm comfortable enough to do without some guidance/tutorial specific for doing that with MariaDB.

I can see what I get with performance_schema first, but I need to schedule a restart.
Thank you.

Comment by Nuno [ 2022-07-27 ]

Hi marko
Again getting alerts because 20GB+ swap is being used on the server, where 18.32 GB is from MariaDB.

I have performance_schema enabled, but where do you recommend me to look in there?
I don't seem to find any memory usage history there. I'm not sure where I can find useful information there.

Thank you very much.

Comment by Nuno [ 2022-07-27 ]

marko
Oh... looking at the swap usage history ("sar" logs, transformed into a Google Chart),
I see that every day the swap seems to be increasing consistently when MariaBackup runs, and then never (or rarely) gets reclaimed back.

Comment by Marko Mäkelä [ 2022-08-03 ]

nunop, I only know InnoDB, and it should not allocate too much outside its buffer pool. I can’t think of anything where the InnoDB heap memory usage should have increased between 10.5 and 10.8.

I hope that serg, sanja or danblack can provide some advice how to find the largest users of memory. The increased resident set size (and thus swap memory) is not necessarily due to a memory leak; it could also be memory fragmentation. Using an alternative memory allocation library, such as tcmalloc or jemalloc, might reduce the fragmentation.

Comment by Nuno [ 2022-08-03 ]

Thanks marko. For info, I only use InnoDB in my databases.
The database structure hasn't changed since 10.5.

I'm not 100% sure whether I started having this issue straight from the upgrade, or was it after I changed one of those other my.cnf configs we discussed in the other Issues.
But I was restarting frequently while we were chatting here, to try the new suggestions, so it's hard to know.

I can see what happens on 10.8.4, once I revert some of those configs.

Will wait anyway for the feedback/suggestions from the others. Thanks everyone!

Comment by Marko Mäkelä [ 2022-08-03 ]

nunop, in 10.6 there were some extensive changes to class Item, which implements all subexpression types in the SQL parser. The memory allocations or copying related to strings were supposed to be optimized. Without knowing specific details, that would be my main suspect.

Comment by Nuno [ 2022-08-10 ]

Just noting here,

I still see the swap increasing slowly, but somehow also slowly going back, which is making this more stable for a longer time (still slowly increasing over time anyway), not requiring me to restart MariaDB every 3-4 days.

I think what helped was that I increased the log file from 5GB to 24GB (I read in the documentation that it's safe to have huge log files now), last time I restarted 10 days ago.
It seems to be also helping the InnoDB Buffer Pool taking longer to be filled up as well.

(but still, this wasn't an issue in 10.5)

Thanks marko for pointing some suspicions of what the cause could be.

Comment by Nuno [ 2022-08-13 ]

I've been monitoring this. Swap has actually been pretty stable around 17GB, slowly increasing as I said, but not the end of the world.

When MariaBackup happened today (takes 1~2 minutes as every day), it increased a bit of swap as usual, but it's strange how in the next 2 hours, it raised 3GB of swap, so now it's over 20GB swap (overall)...

MariaDB is using 15.56 GB of swap
Every other swapped process is using 239.16 MB, 158.53 MB, and then everything else is less than 50MB.

Now, I don't know if this matters much, but when I started getting alerts about Swap > 20GB (at 8:25, 1 hour and a half after MariaBackup ran),
the first alert shows that the free resident RAM is 925MB, but that's always like that, and it's just "cached RAM". The actual available resident RAM (which is what I always look at) is 36GB.
So, there is plenty of resident RAM available to use, at all times.

MariaDB's InnoDB is currently at 79.68% of 95GB.
It's using 70464808 bytes of actual resident RAM right now.

It's still a mystery to me what caused MariaDB increase several swap GB during those 2 hours..

Comment by Nuno [ 2022-09-19 ]

2 days ago I had another huge swap jump (again, at the time MariaBackup runs)

Seemed to be decreasing until I restarted MariaDB later in the day.

Yesterday and today also increased a bit, but not much at all.

Comment by Marko Mäkelä [ 2022-09-22 ]

nunop, if the problem is memory fragmentation in the allocator, using a different memory allocator might help. For GNU libc, you can find some environment variables documented in man mallopt.

Alternative allocators such as tcmalloc or jemalloc might also provide better diagnostics. https://smalldatum.blogspot.com/2022/09/understanding-some-jemalloc-stats-for.html may be worth a read. I am afraid that without having more details, it will be difficult to fix anything. We do run tests with AddressSanitizer. There are a few open bug reports that mention LeakSanitizer. Could you check if you might be hitting one of them?

Comment by Nuno [ 2022-09-23 ]

Hey! Thanks much for your tips there. I'll try to read them very soon.

I just want to add a note that I believe I've just realized that the cause of the swap seems to be the "rsync" to the other server(s).

This seems to match also the times I was manually rsyncing in the past days while testing the backups with 10.8.4.

I just don't understand why would "rsync" cause MariaDB to swap, though... (the files I'm rsyncing are the backup ones, not the real files)

Today was the first day that rsync happened twice, one to the HDD and one to the SSD. Took 1 hour, which is how much the swap increased in terms of time:

Yesterday I did an rsync at this time too:

Very strange... I'll continue to investigate based on this.

Comment by Nuno [ 2022-10-30 ]

marko

I think I've just figured out why this happens....

That's just how RHEL 8 / AlmaLinux 8 works.

Relevant links with the actual explanation:
https://access.redhat.com/solutions/6785021
https://access.redhat.com/solutions/6954667

And I confirm that even though my "vm.swappiness=1", most processes are still using the default value of "60", because they inherit that default value before "sysctl" tunes the system.

grep . /sys/fs/cgroup/memory/*/memory.swappiness
grep . /sys/fs/cgroup/memory/*/*/memory.swappiness

RHEL 8.7 (still in BETA) brings a new sysctl parameter "vm.force_cgroupv2_swappiness=1" which will resolve this issue.

Until then, the workaround is to have some script that runs after boot to update all the "memory.swappiness" files.
I'll see what I can do, without breaking the whole thing!!

As a conclusion, I believe this Issue can be closed, as it's no longer relevant for MariaDB itself.

Comment by Marko Mäkelä [ 2022-10-31 ]

nunop, it is great that you were able to figure it out. I leave it to danblack to decide if anything could be improved in our default configuration files or documentation.

Comment by Daniel Black [ 2022-11-01 ]

I'm thinking if Red Hat are well on the way to providing a solution to document it would quickly be obsolete.

Was the swap usage having a negative impact on QPS? Was swapin/swap out rates being rather high?

Maybe manual limits can be applied in the interim - https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemorySwapMax=bytes

Comment by Nuno [ 2022-11-01 ]

Hey danblack

Thank you for your reply.

Not sure if I understand what you mean with your first sentence, sorry!

In relation to QPS (queries per second?), I don't think I noticed any performance impact, but it might be because the swap is on an NVMe disk.

However, it's quite stressful/frightening to see the swap going close to 100% or sometimes even reaching 100%, when there is 35% ~ 40% RAM available still.... (on a server with 128GB RAM ..)
(besides all the email alerts I keep getting due to swap being almost full)

The strange thing to me is that swap increases a lot while "rsync" is running, to transfer the backup to another server, while there's no evidence that while rsync runs, a lot of RAM is being used.
This doesn't make any sense to me!
But every day, consistently, during the 1 hour while rsync runs, the swap keeps increasing and increasing. After that, it takes the rest of the day to slowly bring down the swap used (e.g. to ~60% used).
Every day, however, it builds up more and more, the swap amount that is permanently used.
That's why I feel the need to keep restarting MariaDB once a week or so, afraid that the server can go down because of this.
That's "totally unacceptable". But hopefully this will be resolved once the new AlmaLinux 8.7 is available, hopefully in the next 2 weeks, based on the history of versions released.

Thank you for the tip about MemorySwapMax - I'll see if it can be useful in the meantime!

Have a very great day.

Comment by Daniel Black [ 2022-11-01 ]

First sentence, someday soon Red Hat seem to be going to deploy a solution. If I documented a work around it might not be compatible with the Red Hat one and would might leave users in a worse situtation.

Thanks for clarifying QPS, that was the right question.

I'd certainly hope that OOM wasn't the next outcome on reaching 100% swap with free RAM, but I appreciate the stress of it.

I'd assume rsync is just using up page cache with all the files it read/wrote and somehow the MariaDB memory had a lower priority. I don't understand the logic in swappiness that lead to this.

With MemorySwapMax, you can write to the memory.swap.max cgroup file for mariadb a value at runtime without a restart.

Have a good day too.

Comment by Daniel Black [ 2022-11-01 ]

Closing as Not a bug meaning not our bug.

Comment by Nuno [ 2022-11-01 ]

Cheers!

Yeah, based on this link (from one of my previous replies) - https://access.redhat.com/solutions/6785021

They say that the "right/best" thing to do is to start using "CgroupV2".
Just later, in the second link I had sent, they mention about the new sysctl option.

But yeah, I agree that this is a bug/issue with RedHat, and not MariaDB, so I'm happy with you not having to document anything, as it is an OS issue, and quite specific.
Eventually they should make "CgroupV2" the default thing on new versions of RHEL, so...

Thanks!

Comment by Marko Mäkelä [ 2022-11-02 ]

nunop:

The strange thing to me is that swap increases a lot while "rsync" is running

Adding some calls to posix_fadvise() could help the Linux kernel to avoid polluting the file system cache with large files that are not going to be accessed any time soon. I encountered https://bugzilla.redhat.com/show_bug.cgi?id=841076 but did not check the current rsync source code.

There might also be an option (some LD_PRELOAD library "shim" similar to libeatmydata.so) that would inject some posix_fadvise() calls at suitable places. Yet another option might be to patch rsync to use O_DIRECT, but that would require all file accesses and memory buffers to be aligned with the underlying physical block size (typically 512 or 4096 bytes).

Comment by Richard Stracke [ 2023-01-20 ]

Another idea,

AnonHugePages should work for applications without configuring. (transparent hugepages)

transparent hugepages is enabled by default.
https://access.redhat.com/solutions/46111

intended to bring hugepage support automatically to applications, without requiring custom configuration. Transparent hugepage support works by scanning memory mappings in the background (via the "khugepaged" kernel thread), attempting to find or create (by moving memory around) contiguous 2MB ranges of 4KB mappings, that can be replaced with a single hugepage.

but this can sometimes not work

If an application maps a large range but only touches the first few bytes, it would traditionally consume only a single 4KB page of physical memory. With THP enabled, khugepaged can come and extend that 4KB page into a 2MB page, effectively bloating memory usage by 512x (An example reproducer on this bug report actually demonstrates the 512x worst case!)."

https://blog.nelhage.com/post/transparent-hugepages/

Comment by Nuno [ 2023-01-20 ]

Guys,
This Issue can probably be closed.

Since I'm using vm.force_cgroup_v2_swappiness=1 (added in the latest version of RHEL8 / AlmaLinux 8), this is not longer an issue to me.

It does eventually get to a lot of RAM used, but at least it takes months to get there, rather than once every 1-2 weeks!
But also, I'm likely "overusing" the RAM available anyway (in terms of calculated max possible RAM used), and the server has a lot more than just MariaDB, so it's likely not MariaDB's fault here.

As I said anyway, with the sysctl option above, I'm no longer having this issue anymore, so I'm happy!!

Thank you very much!

Comment by Marko Mäkelä [ 2023-01-20 ]

nunop, this ticket has already been closed as "not a (MariaDB) bug". Thank you for your update.

Generated at Thu Feb 08 10:05:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.