[MDEV-20455] memory leak in 10.4 series Created: 2019-08-30 Updated: 2021-10-07 Resolved: 2020-11-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Server |
| Affects Version/s: | 10.4.6, 10.4.7, 10.4.8 |
| Fix Version/s: | 10.4.14 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Matthias Merz | Assignee: | Oleksandr Byelkin |
| Resolution: | Fixed | Votes: | 6 |
| Labels: | Memory_leak, wsrep | ||
| Environment: |
Docker image from docker-hub, 10.4.6-bionic 3-node galera setup on a debian 10 host. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
MariaDB is configured for 64GB Innodb buffer_pool, which should lead to approx. 70-80GB of Memory consumption. Over time, this increases, sometimes in larger steps, sometimes gradually. After 47h of "uptime" we are currently at:
this will in the end lead to an OOM condition in some days, but after OOM-kill, the galera IST will not work, triggering Unfortunately,
Please find the config file attached. The memory consumtion is probably triggered by client access, because if we redirect our loadbalancer to the next backend, memory grows there. OTOH, memory usage won't decrease when not receiving queries, even after days. (Had to cut the experiment after 3 days, because node2/3 was threatening to break down) |
| Comments |
| Comment by Matthias Merz [ 2019-10-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I was also able to reproduce this memory leak without any galera plugin loaded, so the problem seems to come from MariaDB itself. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by dbteam [ 2020-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Does this happen with version MariaDB 10.4.13? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Merz [ 2020-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
yes, still happens with 10.4.13, but unfortunately I still cannot reproduce it. Just happens in some rare access patterns, so probably some sort of concurrency issue (we have many short-lived connections, mixed with some larger transactions)
This seems not to match a RAM RSS of 140GB, or am I completely mistaken?
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Oleksandr Byelkin [ 2020-07-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
are there some more specific info about queries and so on, now I do not see something to work with (how to repeat)? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Merz [ 2020-07-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
That was my original question I know, this is very hard to debug, as long as we cannot reproduce it. It seems to be triggered by some access pattern in our live workloads, nothing to reproduce on a test system. That's why I had mainly asked about how to find out, in which allocation area the memory might be consumed. As you can see, the buffer pool looks OK, but there's a huge amount of RAM not showing up in the status variables which I had checked. So where in server status can that info be retrieved? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mattias Bergvall [ 2020-07-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello. I have this issue with MariaDB-server-10.4.13-1.el7.centos.x86_64. May 20 16:37:35 Updated: galera-25.3.29-1.rhel7.el7.centos.x86_64 Same history except minor differences in the timings on all 5 nodes. I'm not using MaxScale (the application didn't like it, since it sometimes did reads immediately after write and thus found inconsistencies). What I've observed is that since May 25th/May 26th, all machines but one, are increasing their swap usage until there is none more available. And it is the nodes that don't get any query traffic, the ones that are just replicating the data that are "leaking". Attaching a graph showing the swap usage in percent before and after the upgrade where the change in behavior is obvious: And also supplying a 14-day view of the same metrics where it is obvious that the node that gets hit by queries (the green line) shows a different behavior: This is the "top" output on the node that is "the master/active":
And this is the same from one of the others:
Hope this helps! | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Oleksandr Byelkin [ 2020-07-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Soory, but it does not help much. I have no doubts that the problem exists (most above are try to prove existance of the problem), I need something to localize the problem. What engine you use, do you use replication, what kind of load is there and so on. Status variables before, durimg and an the end of the period may help... | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mattias Bergvall [ 2020-07-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This is a 4-months grap of swap free in my 5-node cluster. The upgrade to 10.4 was made on may 25th-may26th. The spikes correspond to node reboots. The green line since most of june is the main node that has gotten all the traffic for the last 55+ days. All others have been rebooted regularly. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mattias Bergvall [ 2020-07-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The storage engine in use is InnoDB only. Snapshot from the "main" node:
And from a "passive" node:
/etc/my.cnf.d/server (empty lines and comments removed. Names and stuff have been replaced by "-----"):
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by LuborJ [ 2020-09-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I have similar setup with same problem. We have only 10G innodb_buffer_pool_size on 128GB RAM hardware and MariaDB use all of memory every few days. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mattias Bergvall [ 2020-11-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue seems to have been resolved in MariaDB-server-10.4.14-1.el7.centos.x86_64 � | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2020-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. I'll close it then. Given that we don't have enough information to reproduce the issue reliably, let's just assume it was fixed, as you no longer experience it. If anyone still has this, please do comment and we'll reopen the issue. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Bierbaum [ 2021-03-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
We are seeing this same issue on 10.4.15 after a recent upgrade from an older version. These are running on mariadb 10.4.15, bionic, installed via apt-get using puppet. We have two large hosts that are nearly identical in their setup. One runs a heavy automated load and the other is used for more ad-hoc queries with both set to innodb_buffer_pool_size=22G with innodb_buffer_pool_instances = 22. The one with light ad-hoc queries is steady at 24 GB of ram usage. The one with heavy automated load is currently using 49 gb of ram, and slowly growing, see attached memory usage Innodb does not seem to be aware of much of the ram usage.
digging through variables and status has also been unfruitful in diagnosing the memory usage.
Also, running "FLUSH TABLES" doesn't appear to effect memory usage. We set the innodb_buffer_pool_size to a fairly low 22GB of the total 80 GB of memory on the host to try and slow and stabilize the issue. Memory usage may have started to level out, but is far beyond what was estimated using the innodb_buffer_pool_size=22GB. I can provide more details as it would be helpful, but I'm not sure what would help diagnose the issue. We are seeing very similar behavior in our smaller and more easily changed test environments as well. On that host, clear memory steps are taken when timed & automated query load hits every 6 hours, with oom reapering happening about every 2 days as it's currently configured. that host uses innodb_buffer_pool_size = 1456M with innodb_buffer_pool_instances = 2 on 8 GB of host memory. All these hosts are carrying out fairly heavy multi replication channel replication. Thanks | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maurice Gasco [ 2021-04-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm having the same issue on 10.4.18, CentOS 8 LimitNOFILE=200000 With 1200 active connections, Memory grows to 140Gb in less than 24 hours | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Bierbaum [ 2021-04-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm currently testing on lower environments, but it appears that switching memory allocators may solve the issues for us. I had found a few mentions of tcmalloc on other tickets on this jira.mariadb.org as well as a few other sites. ( see https://jira.mariadb.org/browse/MDEV-21372?focusedCommentId=144092&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-144092 and https://dba.stackexchange.com/a/273834 )
place in /etc/systemd/system/mariadb.service.d/override.conf
then
produces
The memory use appears stable on Ubuntu 18.04 with either mariadb 10.4.15 and 10.5.9. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eugene [ 2021-05-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hit very same issue. mariadb 10.4.15 with galera-26.4.5 on Gentoo, linux kernel 5.10.31.
Used memory on mysqld grew from 160 to 240GB (within two weeks) with system malloc library. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-09-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Can anyone seeing this issue confirm if they are NOT using replication? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Merz [ 2021-09-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
In my original report, I had also tried 10.4.6 without any replication, also observing memory consumption going up. At least then, this was not replication related. Also I see the mem growth only on the node(s) receiving SQL-requests, not on the replicated nodes without interactive sessions. So I'd assume the original bug to not be related to galera and at least (unfortunately) not OTOH it's practically impossible in our environment to reproduce this behaviour. It feels like it's gotten better (currently we are in 10.4.21), but never went away completely. And also another "single" instance on Debian buster (version 10.3.29) currently takes 196GB Resident-Set with innodb_buffer_pool_size=64G (and no MyISAM or other table types in a mentionable amount) | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-09-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you mmerz for the input, much appreciated. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by shield [ 2021-10-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Adding some information on the topic of NOT using replication. Memory usage initially increase to what one expects - as assigned to the service. Then, over a time frame of 2 to 4 weeks, memory usage increases very slowly until completely depleted. Configuration: Most of these servers have 32G RAM with a swappiness of 1, with settings: Calculating predicted RAM usage based on this query: ...a server with 32G is configured to use 16422.6MB. If you need more information, I'll gladly provide. |