Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20698

Master slowly running out of memory and gets killed by oom-killer

Details

    • Bug
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.3.17
    • None
    • Server
    • CentOS 7 (3.10.0-1062.1.1.el7.x86_64)

    Description

      Hi guys,

      we experience a serious problem since we upgraded from MariaDB 10.1.25 to MariaDB 10.3.17 about two weeks ago.

      The memory consumption of our master server slowly increases over time for no obvious reason and then the service gets killed by the oom-killer. This timespan is only about one day! Like for example today at 9:00 AM, the memory consumption of mysqld was 1,952 MB, while about two hours later (at 11:13 AM), it hit 3,484 MB (that's about 60% of available RAM). Even though the innodb_buffer_pool_size is limited to 1 GB.
      To prevent the master from getting killed by oom-killer, I manually restart the MariaDB service every evening.

      What's weird about this is, that before the upgrade the cluster consisting of one master and one slave worked like a charm. Even though we have a cluster controlled by MaxScale, which performs automatic failover if the master gets killed, this is not very pleasing if it happens regularily while people are working.

      Does anybody have an idea where this could suddenly come from or how to find out what's going on? Is the database corrupt?

      If you need more information or details, I will provide you with as much info as I can.

      I really hope that you guys can help me, as I absolutely ran out of ideas.

      Thank you in advance!

      Regards,
      matze

      Attachments

        Issue Links

          Activity

            matze Matthias created issue -
            kevg Eugene Kosov (Inactive) made changes -
            Field Original Value New Value

            Hi. Just a quick question. Do you use triggers?

            kevg Eugene Kosov (Inactive) added a comment - Hi. Just a quick question. Do you use triggers?
            matze Matthias added a comment -

            Hi Eugene,

            thank you for your answer!

            Yes, we use triggers. And we use a lot of stored procedures to perform write operations to the database (so no "direct" INSERT, UPDATE and DELETE operations). We also have some events running, but usually not during the day (only at night).

            Can this cause those problems? What has changed since version 10.1.25, where we never had a problem for over 9 months...?

            Thanks!

            matze Matthias added a comment - Hi Eugene, thank you for your answer! Yes, we use triggers. And we use a lot of stored procedures to perform write operations to the database (so no "direct" INSERT, UPDATE and DELETE operations). We also have some events running, but usually not during the day (only at night). Can this cause those problems? What has changed since version 10.1.25, where we never had a problem for over 9 months...? Thanks!
            matze Matthias made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            Sorry for the long answer.

            Honestly, I have no idea what memory can leak. And I don't think I can guess.

            The hard way to guess requires significant effort from you.

            I see several ways:
            1) You can reproduce memory leak with (a slice of ) your data on some staging server and exclude different queries one by one so we will ideally know what query leaks.

            2) it looks like a regression. A LOT of stuff changed between 10.1.25 and 10.3.17. It's unthinkable for me to go through a list of all commits and find a relevant one. I don't even know at what part of code to look. Maybe you can try your workflow with with different releases? Like, try some 10.2.xx and see whether you can reproduce your issue or not. Ideally I'd like to get versions 10.x.y and 10.x.(y+1) where regression appeared. Well, maybe (y+3) or something. This will look somehow reasonable to go through a list of commits and see what changed.

            3) Try to use some tool like tcmalloc heap profiler or bcc heap profiler I mentioned in https://jira.mariadb.org/browse/MDEV-19287? I wanna function names, not just it's addresses. That may require installing some addition package with debug symbols.

            I've tried simple queries in simple tests but no luck. Sorry, I can't spend my time on guessing what query may leak when I have bugs with reproducible tests cases. Triggers was just my guess. If you could help with a reasonable simple test case I would be very glad.

            kevg Eugene Kosov (Inactive) added a comment - Sorry for the long answer. Honestly, I have no idea what memory can leak. And I don't think I can guess. The hard way to guess requires significant effort from you. I see several ways: 1) You can reproduce memory leak with (a slice of ) your data on some staging server and exclude different queries one by one so we will ideally know what query leaks. 2) it looks like a regression. A LOT of stuff changed between 10.1.25 and 10.3.17. It's unthinkable for me to go through a list of all commits and find a relevant one. I don't even know at what part of code to look. Maybe you can try your workflow with with different releases? Like, try some 10.2.xx and see whether you can reproduce your issue or not. Ideally I'd like to get versions 10.x.y and 10.x.(y+1) where regression appeared. Well, maybe (y+3) or something. This will look somehow reasonable to go through a list of commits and see what changed. 3) Try to use some tool like tcmalloc heap profiler or bcc heap profiler I mentioned in https://jira.mariadb.org/browse/MDEV-19287? I wanna function names, not just it's addresses. That may require installing some addition package with debug symbols. I've tried simple queries in simple tests but no luck. Sorry, I can't spend my time on guessing what query may leak when I have bugs with reproducible tests cases. Triggers was just my guess. If you could help with a reasonable simple test case I would be very glad.

            matze, Is MDEV-20699 a new development of the same problem, or is it unrelated? Can it be the same issue, only while mysqldump executes the "guilty" statements in a quick manner, while during normal operation the same statements occur more rarely and hence OOM takes more time?

            elenst Elena Stepanova added a comment - matze , Is MDEV-20699 a new development of the same problem, or is it unrelated? Can it be the same issue, only while mysqldump executes the "guilty" statements in a quick manner, while during normal operation the same statements occur more rarely and hence OOM takes more time?
            elenst Elena Stepanova made changes -
            Labels crash crash need_feedback
            matze Matthias added a comment -

            @Eugene: thank you for your answer!

            We will try to set up a staging server with different versions (10.2.x, 10.4.x) and run some tests. But in the end, I'm afraid we have to set up the whole cluster again (which would be the best and cleanest, I think).

            After some research we found out that what causes memory to leak might be our plugins (user-defined functions). We have one that logs messages to syslog, one that sends messages to a RabbitMQ broker and one (self-written) that can be used to send emails (using libcurl internally). Although those UDFs worked great for more than 9 months without any problem (or leaking memory), as we updated MariaDB we also updated the operating system (CentOS 7) and its libraries (like glibc and others). But we didn't recompile the plugins and just left them as-is. We're not 100% sure yet, but this seems to be the cause of the memory leak, as those functions get called very often during the day (causing the memory consumption to rise) and almost never during the night (keeping the consumed memory almost "constant").
            What do you think? Is it possible that it's the plugins that now cause those memory leaks (in combination with new versions of e.g. glibc)?
            Unfortunately, I'm on vacation next week so I can't recompile the plugins and run some tests, but I'll be back on October 21st.
            Until then, the master server automatically gets restartet every morning and then has enough memory to survive one more day...

            Regards,
            Matthias

            matze Matthias added a comment - @Eugene: thank you for your answer! We will try to set up a staging server with different versions (10.2.x, 10.4.x) and run some tests. But in the end, I'm afraid we have to set up the whole cluster again (which would be the best and cleanest, I think). After some research we found out that what causes memory to leak might be our plugins (user-defined functions). We have one that logs messages to syslog, one that sends messages to a RabbitMQ broker and one (self-written) that can be used to send emails (using libcurl internally). Although those UDFs worked great for more than 9 months without any problem (or leaking memory), as we updated MariaDB we also updated the operating system (CentOS 7) and its libraries (like glibc and others). But we didn't recompile the plugins and just left them as-is. We're not 100% sure yet, but this seems to be the cause of the memory leak, as those functions get called very often during the day (causing the memory consumption to rise) and almost never during the night (keeping the consumed memory almost "constant"). What do you think? Is it possible that it's the plugins that now cause those memory leaks (in combination with new versions of e.g. glibc)? Unfortunately, I'm on vacation next week so I can't recompile the plugins and run some tests, but I'll be back on October 21st. Until then, the master server automatically gets restartet every morning and then has enough memory to survive one more day... Regards, Matthias
            matze Matthias added a comment -

            @Elena, the issue you mentioned is not really related to this one. In this issue the memory leaks in little chunks until the server runs out of memory. The other issue is caused by dumping the stored procedures only (specifically by the command SHOW CREATE PROCEDURE). Even when called within the MySQL Workbench. During normal operation, a lot of stored procedures are called very often, because the whole business logic is made of stored procedures. I'm not sure if "CALL sp_..." and "SHOW CREATE PROCEDURE" are somehow related?

            matze Matthias added a comment - @Elena, the issue you mentioned is not really related to this one. In this issue the memory leaks in little chunks until the server runs out of memory. The other issue is caused by dumping the stored procedures only (specifically by the command SHOW CREATE PROCEDURE). Even when called within the MySQL Workbench. During normal operation, a lot of stored procedures are called very often, because the whole business logic is made of stored procedures. I'm not sure if "CALL sp_..." and "SHOW CREATE PROCEDURE" are somehow related?

            matze HI. If you have a simple memory leaks from malloc() or indirect calls to malloc() by libcurl or something similar, than you can easily catch those without recompiling MariaDB or your plugins using TCMalloc by Google. You can plug it in through LD_PRELOAD and use as described here http://goog-perftools.sourceforge.net/doc/heap_checker.html

            I general, I doubt that such ubiquitous libraries like glibc or libcurl have any memory leaks.

            MariaDB doesn't have any 'simple' memory leaks which can be detected by automatic tools. Different memory arenas exists in MariaDB. F.ex, triggers, stored procedures and prepared statements are stored in such arenas. And there is no simple way to detect leaks in such arenas. And I think you may have such a leak. To find this you need to run MariaDB for sufficient time and gain statistic on memory allocations from different function calls. Again, TCMalloc can collect such statistics. I think --inuse_space from https://gperftools.github.io/gperftools/heapprofile.html best suites our needs.

            I have no idea for how long you need to collect memory statistics to get enough of it. You may checks TCMalloc output by yourself until you find something suspicious.

            kevg Eugene Kosov (Inactive) added a comment - matze HI. If you have a simple memory leaks from malloc() or indirect calls to malloc() by libcurl or something similar, than you can easily catch those without recompiling MariaDB or your plugins using TCMalloc by Google. You can plug it in through LD_PRELOAD and use as described here http://goog-perftools.sourceforge.net/doc/heap_checker.html I general, I doubt that such ubiquitous libraries like glibc or libcurl have any memory leaks. MariaDB doesn't have any 'simple' memory leaks which can be detected by automatic tools. Different memory arenas exists in MariaDB. F.ex, triggers, stored procedures and prepared statements are stored in such arenas. And there is no simple way to detect leaks in such arenas. And I think you may have such a leak. To find this you need to run MariaDB for sufficient time and gain statistic on memory allocations from different function calls. Again, TCMalloc can collect such statistics. I think --inuse_space from https://gperftools.github.io/gperftools/heapprofile.html best suites our needs. I have no idea for how long you need to collect memory statistics to get enough of it. You may checks TCMalloc output by yourself until you find something suspicious.
            matze Matthias added a comment -

            Hi Eugene,

            thank you for your answer. I'm going to try out the tools you mentioned as soon as possible.
            But this weekend, I'm going to recompile our plugins, drop and recreate the functions and observe the memory consumption on monday.
            I also absolutely doubt that glibc and libcurl themselfs have memory leaks. But could it be possible that after the update of glibc and libcurl, our plugins (which were at that point still compiled against the former version of these libraries) may leak memory because something inside glibc (like function entry points) changed, which our plugins aren't aware of?
            This is just a guess...

            Thanks and regards,
            Matthias

            matze Matthias added a comment - Hi Eugene, thank you for your answer. I'm going to try out the tools you mentioned as soon as possible. But this weekend, I'm going to recompile our plugins, drop and recreate the functions and observe the memory consumption on monday. I also absolutely doubt that glibc and libcurl themselfs have memory leaks. But could it be possible that after the update of glibc and libcurl, our plugins (which were at that point still compiled against the former version of these libraries) may leak memory because something inside glibc (like function entry points) changed, which our plugins aren't aware of? This is just a guess... Thanks and regards, Matthias

            As I understand binary compatibility it means that you can use every minor library with your the same compiled application. And if there is no bugs in your application or library there should be no memory leaks. Well, maybe it's possible to use libc functions in a non-standard way and depend on undocumented behaviour, but I don't expect it's common.

            Anyway, Given enough use, there is no such thing as a private implementation
            This is quote from https://www.hyrumslaw.com/

            kevg Eugene Kosov (Inactive) added a comment - As I understand binary compatibility it means that you can use every minor library with your the same compiled application. And if there is no bugs in your application or library there should be no memory leaks. Well, maybe it's possible to use libc functions in a non-standard way and depend on undocumented behaviour, but I don't expect it's common. Anyway, Given enough use, there is no such thing as a private implementation This is quote from https://www.hyrumslaw.com/
            matze Matthias made changes -
            Attachment mysqld.prof.zip [ 49328 ]
            Attachment mariadb_tcmalloc.service [ 49329 ]
            Attachment mysqld.prof.svg [ 49330 ]
            matze Matthias added a comment -

            Hi Eugene,

            what you said about binary compatibility sounds right. Still I recompiled two of our three plugins and as of today it seems like nothing has changed. I don't think I use libc functions in an non-standard way, as I tried to be very careful writing this one plugin. But if you like, I can attach the source code of those two plugins I recompiled. The third plugin is hosted on Github (https://github.com/ssimicro/lib_mysqludf_amqp).

            Yesterday, I installed TCMalloc (gperftools) on our test system and did some tests with it to get a feeling for this tool.
            Because our test system does not seem to be affected by this memory leak unlike our production server, I tried something else to produce a high memory consumption.
            As I described in this issue (https://jira.mariadb.org/browse/MDEV-20699), the memory is also increasing very fast when calling SHOW CREATE PROCEDURE, leading to the server getting killed by oom-killer sooner or later.
            I will attach the profiling files and a SVG created with pprof. Unfortunately, there are no function names but only addresses, so I can't tell what part of mysqld uses how much memory. What am I doing wrong? Do you know how to get the function names? Do I need the debug build of MariaDB?
            Please also see the attached systemd service definition that sets the corresponding environment and preloads libtcmalloc.

            I hope this helps a little bit.

            Regards,
            Matthias

            matze Matthias added a comment - Hi Eugene, what you said about binary compatibility sounds right. Still I recompiled two of our three plugins and as of today it seems like nothing has changed. I don't think I use libc functions in an non-standard way, as I tried to be very careful writing this one plugin. But if you like, I can attach the source code of those two plugins I recompiled. The third plugin is hosted on Github ( https://github.com/ssimicro/lib_mysqludf_amqp ). Yesterday, I installed TCMalloc (gperftools) on our test system and did some tests with it to get a feeling for this tool. Because our test system does not seem to be affected by this memory leak unlike our production server, I tried something else to produce a high memory consumption. As I described in this issue ( https://jira.mariadb.org/browse/MDEV-20699 ), the memory is also increasing very fast when calling SHOW CREATE PROCEDURE, leading to the server getting killed by oom-killer sooner or later. I will attach the profiling files and a SVG created with pprof. Unfortunately, there are no function names but only addresses, so I can't tell what part of mysqld uses how much memory. What am I doing wrong? Do you know how to get the function names? Do I need the debug build of MariaDB? Please also see the attached systemd service definition that sets the corresponding environment and preloads libtcmalloc. I hope this helps a little bit. Regards, Matthias
            elenst Elena Stepanova made changes -
            Labels crash need_feedback crash
            julien.fritsch Julien Fritsch made changes -
            Labels crash crash need_verification
            danblack Daniel Black added a comment -

            Suggest looking the the bcc-tools memleak (example usage and output in MDEV-22809).

            danblack Daniel Black added a comment - Suggest looking the the bcc-tools memleak (example usage and output in MDEV-22809 ).
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 100044 ] MariaDB v4 [ 141567 ]

            People

              Unassigned Unassigned
              matze Matthias
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.