Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5285

SkySQL OOM Crash? Memory not being released? testing

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 6.3.1
    • 23.02.2
    • PrimProc
    • None
    • SkySQL AWS 32x 128 single node

    Description

      Currently theres a customer whose memory appears not to be released in skysql.

      The current work around is that RDBA/ SRE has to manually mcsShutdown and mcsStart every couple days. However the customer often has to file a ticket saying its crashed and to restart before the scheduled stop/start to clear memory.

      Link to Logs & Configs in comment below

      Attachments

        Issue Links

          Activity

            allen.herrera Allen Herrera created issue -
            allen.herrera Allen Herrera made changes -
            Field Original Value New Value
            Description Currently theres a customer whose memory appears not to be released in skysql.

            The current work around is that RDBA/ SRE has to manually mcsShutdown and mcsStart every couple days. However the customer often has to file a ticket saying its crashed and to restart before the scheduled stop/start to clear memory.

            Currently theres a customer whose memory appears not to be released in skysql.

            The current work around is that RDBA/ SRE has to manually mcsShutdown and mcsStart every couple days. However the customer often has to file a ticket saying its crashed and to restart before the scheduled stop/start to clear memory.

            *Link to Logs in comment below*
            allen.herrera Allen Herrera made changes -
            Description Currently theres a customer whose memory appears not to be released in skysql.

            The current work around is that RDBA/ SRE has to manually mcsShutdown and mcsStart every couple days. However the customer often has to file a ticket saying its crashed and to restart before the scheduled stop/start to clear memory.

            *Link to Logs in comment below*
            Currently theres a customer whose memory appears not to be released in skysql.

            The current work around is that RDBA/ SRE has to manually mcsShutdown and mcsStart every couple days. However the customer often has to file a ticket saying its crashed and to restart before the scheduled stop/start to clear memory.

            *Link to Logs & Configs in comment below*
            allen.herrera Allen Herrera made changes -
            allen.herrera Allen Herrera made changes -
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s Icebox [ 22302 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee David Hall [ david.hall ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked lower
            David.Hall David Hall (Inactive) made changes -
            Affects Version/s 6.3.1 [ 25801 ]
            allen.herrera Allen Herrera made changes -
            Attachment columnstoreMetrics.txt [ 66487 ]

            Yeah..

            The query log shows a very rich set of OLTP + queries. I don't think it is possible to try each one inhouse and see the leak.
            however we do see that exemgr process % of memory grows by 5-6% of total memory every 24h.

            We also know that exemgr was refactored and merged into primproc in 22.08.x . It is unknown if this change fixes the problem.
            The question is should we try now or wait till Sky upgrades the image .

            toddstoffel allen.herrera gdorman

            alexey.vorovich alexey vorovich (Inactive) added a comment - Yeah.. The query log shows a very rich set of OLTP + queries. I don't think it is possible to try each one inhouse and see the leak. however we do see that exemgr process % of memory grows by 5-6% of total memory every 24h. We also know that exemgr was refactored and merged into primproc in 22.08.x . It is unknown if this change fixes the problem. The question is should we try now or wait till Sky upgrades the image . toddstoffel allen.herrera gdorman
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Fix Version/s 22.11.1 [ 28458 ]
            Fix Version/s Icebox [ 22302 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee David Hall [ david.hall ] Leonid Fedorov [ JIRAUSER48443 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Fix Version/s 23.02 [ 28209 ]
            Fix Version/s 23.03.1 [ 28458 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Labels triage mcs_cs_datamesh
            allen.herrera Allen Herrera made changes -
            Attachment columnstoreMetrics-11-16-2022.txt [ 67247 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            alexey.vorovich alexey vorovich (Inactive) made changes -
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            leonid.fedorov Leonid Fedorov added a comment - - edited

            I created the profiling allocator shared object

            wget https://github.com/jemalloc/jemalloc/releases/download/5.3.0/jemalloc-5.3.0.tar.bz2
            ./configure --disable-fill --with-jemalloc-prefix="" --enable-shared --enable-prof
            make 
            

            it's attached to the issue, can be downloaded here: jemalloc

            leonid.fedorov Leonid Fedorov added a comment - - edited I created the profiling allocator shared object wget https://github.com/jemalloc/jemalloc/releases/download/5.3.0/jemalloc-5.3.0.tar.bz2 ./configure --disable-fill --with-jemalloc-prefix="" --enable-shared --enable-prof make it's attached to the issue, can be downloaded here: jemalloc
            leonid.fedorov Leonid Fedorov made changes -
            Attachment libjemalloc.so-1.2 [ 67567 ]
            leonid.fedorov Leonid Fedorov made changes -
            Attachment libjemalloc.so-1.2 [ 67567 ]
            leonid.fedorov Leonid Fedorov made changes -
            Attachment libjemalloc.so-1.2 [ 67568 ]
            leonid.fedorov Leonid Fedorov added a comment - - edited

            this profiling allocator should be installed on one node with this steps.

            mkdir /heap_profile
            --- put shared object in this directory, with the name jemalloc.so
            chmod 777 /heap_profile
            

            then edit

            /usr/lib/systemd/system/mcs-primproc.service 
            

            and replace line

            ExecStart=/usr/bin/env bash -c "LD_PRELOAD=$(ldconfig -p | grep -m1 libjemalloc | awk '{print $1}') exec /usr/bin/PrimProc" 
            

            with

            ExecStart=/usr/bin/env bash -c "MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,prof_final:true,stats_print:true,abort:false,abort_conf:false,prof_prefix:/heap_profile/PrimProc_heap_profile LD_PRELOAD=/heap_profile/jemalloc.so exec /usr/bin/PrimProc"
            

            reload systemctl config

            systemctl daemon-reload
            

            and restart primproc service

            service mcs-primproc restart
            

            After some payload there should be generated /heap_profile/*.profile files. with heap usage information. We want them for inspection

            leonid.fedorov Leonid Fedorov added a comment - - edited this profiling allocator should be installed on one node with this steps. mkdir /heap_profile --- put shared object in this directory, with the name jemalloc.so chmod 777 /heap_profile then edit /usr/lib/systemd/system/mcs-primproc.service and replace line ExecStart=/usr/bin/env bash -c "LD_PRELOAD=$(ldconfig -p | grep -m1 libjemalloc | awk '{print $1}') exec /usr/bin/PrimProc" with ExecStart=/usr/bin/env bash -c "MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,prof_final:true,stats_print:true,abort:false,abort_conf:false,prof_prefix:/heap_profile/PrimProc_heap_profile LD_PRELOAD=/heap_profile/jemalloc.so exec /usr/bin/PrimProc" reload systemctl config systemctl daemon-reload and restart primproc service service mcs-primproc restart After some payload there should be generated /heap_profile/*.profile files. with heap usage information. We want them for inspection

            alan.mologorsky lets convert the instructions above from leonid.fedorov to one applicable to an existing docker container which DOES not have systemd.

            Rough outline , that I am asking you to expand and try it

            • run 63x container
            • mcs cluster stop
            • stop cmapi-server and mariadb
            • Instructions for setting MALLOC_CONF
            • start cmapi and mariadb
            • mcs cluster start
            • run workload
            • collect profile

            everyone understands that this is non-persistent setup and will not survive pod restart. This is just the frst step

            leonid.fedorov pls edit your instructions to note the location of shred object. maybe create jmalloc_test folder on https://cspkg.s3.amazonaws.com/

            alexey.vorovich alexey vorovich (Inactive) added a comment - alan.mologorsky lets convert the instructions above from leonid.fedorov to one applicable to an existing docker container which DOES not have systemd. Rough outline , that I am asking you to expand and try it run 63x container mcs cluster stop stop cmapi-server and mariadb Instructions for setting MALLOC_CONF start cmapi and mariadb mcs cluster start run workload collect profile everyone understands that this is non-persistent setup and will not survive pod restart. This is just the frst step leonid.fedorov pls edit your instructions to note the location of shred object. maybe create jmalloc_test folder on https://cspkg.s3.amazonaws.com/

            mkdir /heap_profile
             
            chmod 777 /heap_profile
             
            curl -o  /heap_profile/jemalloc.so https://jira.mariadb.org/secure/attachment/67568/libjemalloc.so-1.2
             
            sed -i 's@$MCS_INSTALL_BIN/PrimProc@MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,lg_prof_interval:33,prof_final:true,stats_print:true,abort:false,abort_conf:false,prof_prefix:/heap_profile/PrimProc_heap_profile LD_PRELOAD=/heap_profile/jemalloc.so $MCS_INSTALL_BIN/PrimProc@g' /usr/share/columnstore/cmapi/mcs_node_control/custom_dispatchers/container.sh
            

            leonid.fedorov Leonid Fedorov added a comment - mkdir /heap_profile   chmod 777 /heap_profile   curl -o /heap_profile/jemalloc.so https://jira.mariadb.org/secure/attachment/67568/libjemalloc.so-1.2   sed -i 's@$MCS_INSTALL_BIN/PrimProc@MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,lg_prof_interval:33,prof_final:true,stats_print:true,abort:false,abort_conf:false,prof_prefix:/heap_profile/PrimProc_heap_profile LD_PRELOAD=/heap_profile/jemalloc.so $MCS_INSTALL_BIN/PrimProc@g' /usr/share/columnstore/cmapi/mcs_node_control/custom_dispatchers/container.sh
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Fix Version/s 23.02.1 [ 28701 ]
            Fix Version/s 23.02 [ 28209 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Fix Version/s 23.02 [ 28209 ]
            Fix Version/s 23.02.1 [ 28701 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Summary SkySQL OOM Crash? Memory not being released? SkySQL OOM Crash? Memory not being released? testing
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Component/s PrimProc [ 13700 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Fix Version/s 23.02.2 [ 28713 ]
            Fix Version/s 23.02 [ 28209 ]
            julien.fritsch Julien Fritsch made changes -
            Labels mcs_cs_datamesh

            People

              leonid.fedorov Leonid Fedorov
              allen.herrera Allen Herrera
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.