[MDEV-20698] Master slowly running out of memory and gets killed by oom-killer Created: 2019-09-30 Updated: 2020-09-30 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 10.3.17 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Matthias | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | crash, need_verification | ||
| Environment: |
CentOS 7 (3.10.0-1062.1.1.el7.x86_64) |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Hi guys, we experience a serious problem since we upgraded from MariaDB 10.1.25 to MariaDB 10.3.17 about two weeks ago. The memory consumption of our master server slowly increases over time for no obvious reason and then the service gets killed by the oom-killer. This timespan is only about one day! Like for example today at 9:00 AM, the memory consumption of mysqld was 1,952 MB, while about two hours later (at 11:13 AM), it hit 3,484 MB (that's about 60% of available RAM). Even though the innodb_buffer_pool_size is limited to 1 GB. What's weird about this is, that before the upgrade the cluster consisting of one master and one slave worked like a charm. Even though we have a cluster controlled by MaxScale, which performs automatic failover if the master gets killed, this is not very pleasing if it happens regularily while people are working. Does anybody have an idea where this could suddenly come from or how to find out what's going on? Is the database corrupt? If you need more information or details, I will provide you with as much info as I can. I really hope that you guys can help me, as I absolutely ran out of ideas. Thank you in advance! Regards, |
| Comments |
| Comment by Eugene Kosov (Inactive) [ 2019-09-30 ] |
|
Hi. Just a quick question. Do you use triggers? |
| Comment by Matthias [ 2019-09-30 ] |
|
Hi Eugene, thank you for your answer! Yes, we use triggers. And we use a lot of stored procedures to perform write operations to the database (so no "direct" INSERT, UPDATE and DELETE operations). We also have some events running, but usually not during the day (only at night). Can this cause those problems? What has changed since version 10.1.25, where we never had a problem for over 9 months...? Thanks! |
| Comment by Eugene Kosov (Inactive) [ 2019-10-02 ] |
|
Sorry for the long answer. Honestly, I have no idea what memory can leak. And I don't think I can guess. The hard way to guess requires significant effort from you. I see several ways: 2) it looks like a regression. A LOT of stuff changed between 10.1.25 and 10.3.17. It's unthinkable for me to go through a list of all commits and find a relevant one. I don't even know at what part of code to look. Maybe you can try your workflow with with different releases? Like, try some 10.2.xx and see whether you can reproduce your issue or not. Ideally I'd like to get versions 10.x.y and 10.x.(y+1) where regression appeared. Well, maybe (y+3) or something. This will look somehow reasonable to go through a list of commits and see what changed. 3) Try to use some tool like tcmalloc heap profiler or bcc heap profiler I mentioned in https://jira.mariadb.org/browse/MDEV-19287? I wanna function names, not just it's addresses. That may require installing some addition package with debug symbols. I've tried simple queries in simple tests but no luck. Sorry, I can't spend my time on guessing what query may leak when I have bugs with reproducible tests cases. Triggers was just my guess. If you could help with a reasonable simple test case I would be very glad. |
| Comment by Elena Stepanova [ 2019-10-06 ] |
|
matze, Is |
| Comment by Matthias [ 2019-10-10 ] |
|
@Eugene: thank you for your answer! We will try to set up a staging server with different versions (10.2.x, 10.4.x) and run some tests. But in the end, I'm afraid we have to set up the whole cluster again (which would be the best and cleanest, I think). After some research we found out that what causes memory to leak might be our plugins (user-defined functions). We have one that logs messages to syslog, one that sends messages to a RabbitMQ broker and one (self-written) that can be used to send emails (using libcurl internally). Although those UDFs worked great for more than 9 months without any problem (or leaking memory), as we updated MariaDB we also updated the operating system (CentOS 7) and its libraries (like glibc and others). But we didn't recompile the plugins and just left them as-is. We're not 100% sure yet, but this seems to be the cause of the memory leak, as those functions get called very often during the day (causing the memory consumption to rise) and almost never during the night (keeping the consumed memory almost "constant"). Regards, |
| Comment by Matthias [ 2019-10-10 ] |
|
@Elena, the issue you mentioned is not really related to this one. In this issue the memory leaks in little chunks until the server runs out of memory. The other issue is caused by dumping the stored procedures only (specifically by the command SHOW CREATE PROCEDURE). Even when called within the MySQL Workbench. During normal operation, a lot of stored procedures are called very often, because the whole business logic is made of stored procedures. I'm not sure if "CALL sp_..." and "SHOW CREATE PROCEDURE" are somehow related? |
| Comment by Eugene Kosov (Inactive) [ 2019-10-10 ] |
|
matze HI. If you have a simple memory leaks from malloc() or indirect calls to malloc() by libcurl or something similar, than you can easily catch those without recompiling MariaDB or your plugins using TCMalloc by Google. You can plug it in through LD_PRELOAD and use as described here http://goog-perftools.sourceforge.net/doc/heap_checker.html I general, I doubt that such ubiquitous libraries like glibc or libcurl have any memory leaks. MariaDB doesn't have any 'simple' memory leaks which can be detected by automatic tools. Different memory arenas exists in MariaDB. F.ex, triggers, stored procedures and prepared statements are stored in such arenas. And there is no simple way to detect leaks in such arenas. And I think you may have such a leak. To find this you need to run MariaDB for sufficient time and gain statistic on memory allocations from different function calls. Again, TCMalloc can collect such statistics. I think --inuse_space from https://gperftools.github.io/gperftools/heapprofile.html best suites our needs. I have no idea for how long you need to collect memory statistics to get enough of it. You may checks TCMalloc output by yourself until you find something suspicious. |
| Comment by Matthias [ 2019-10-25 ] |
|
Hi Eugene, thank you for your answer. I'm going to try out the tools you mentioned as soon as possible. Thanks and regards, |
| Comment by Eugene Kosov (Inactive) [ 2019-10-28 ] |
|
As I understand binary compatibility it means that you can use every minor library with your the same compiled application. And if there is no bugs in your application or library there should be no memory leaks. Well, maybe it's possible to use libc functions in a non-standard way and depend on undocumented behaviour, but I don't expect it's common. Anyway, Given enough use, there is no such thing as a private implementation |
| Comment by Matthias [ 2019-10-30 ] |
|
Hi Eugene, what you said about binary compatibility sounds right. Still I recompiled two of our three plugins and as of today it seems like nothing has changed. I don't think I use libc functions in an non-standard way, as I tried to be very careful writing this one plugin. But if you like, I can attach the source code of those two plugins I recompiled. The third plugin is hosted on Github (https://github.com/ssimicro/lib_mysqludf_amqp). Yesterday, I installed TCMalloc (gperftools) on our test system and did some tests with it to get a feeling for this tool. I hope this helps a little bit. Regards, |
| Comment by Daniel Black [ 2020-09-30 ] |
|
Suggest looking the the bcc-tools memleak (example usage and output in MDEV-22809). |