Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14460

Memory leak with only SELECT statements

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.2.9
    • 10.2.15
    • None
    • Debian 8. 16 GB RAM. 8 CPU (Xeon E5-1620 v2).

    Description

      The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application (ERP5), using NEO/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

      The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:

      SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1
      SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1
      

      More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

      The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

      Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).

      Attachments

        Issue Links

          Activity

            jmuchemb Julien Muchembled created issue -

            Can you please make a full test case that we can reproduce?
            This includes creating the needed tables, having a script that fills them with data and running the queries that causes the memory overrun.

            Alternatively is to link MariaDB with tcmalloc and run the test on your machines. This would give us
            a log of the memory usage that could help us locate the problem

            https://mariadb.com/kb/en/library/debugging-a-running-server-on-linux/

            monty Michael Widenius added a comment - Can you please make a full test case that we can reproduce? This includes creating the needed tables, having a script that fills them with data and running the queries that causes the memory overrun. Alternatively is to link MariaDB with tcmalloc and run the test on your machines. This would give us a log of the memory usage that could help us locate the problem https://mariadb.com/kb/en/library/debugging-a-running-server-on-linux/
            monty Michael Widenius made changes -
            Field Original Value New Value
            Assignee Sergei Petrunia [ psergey ]
            jmuchemb Julien Muchembled made changes -
            Attachment mysqld.svg [ 44753 ]
            jmuchemb Julien Muchembled added a comment - - edited

            If the pattern (when filling or reading tables) matters, writing a test case can be a lot of work. For the moment, I prefer the other way.

            We use jemalloc. Something must be missing on their wiki because I was unable to get any profiling data. So I switched to tcmalloc.

            [edit: removed, too many mistakes from me]

            jmuchemb Julien Muchembled added a comment - - edited If the pattern (when filling or reading tables) matters, writing a test case can be a lot of work. For the moment, I prefer the other way. We use jemalloc. Something must be missing on their wiki because I was unable to get any profiling data. So I switched to tcmalloc. [edit: removed, too many mistakes from me]
            jmuchemb Julien Muchembled made changes -
            Attachment mysqld.svg [ 44753 ]
            jmuchemb Julien Muchembled made changes -
            Attachment mysqld.svg [ 44754 ]

            Reuploaded. Apparently, I must call pprof with a single .heap file.

            jmuchemb Julien Muchembled added a comment - Reuploaded. Apparently, I must call pprof with a single .heap file.

            I don't know if it matters but I've just realized that all select are done in a single transaction.

            jmuchemb Julien Muchembled added a comment - I don't know if it matters but I've just realized that all select are done in a single transaction.

            Looking at positions where memory is leaked:

            Case1   0x00007f0b31ab6ab0 
            Case2   0x00007f0b31ab6ab0
            case3   0x00007f0b31abb3eb 
            case4   0x00007f0b31d011ca Biggest. called from case3
                   
            Case4   0x00007f0b32ca2d78 (called from ha_initialize_handlerton)
             
            ha_toku start 7f0b32c48000
            ha_toku   end 7f0b32feb000
            

            So, most of the leaks are outside ha_tokudb.so, it seems?
            jmuchemb, can you share

            • which binaries are you using?
            • can you provide the output of nm --demangle --numeric-sort -S /path/to/ha_rocksdb.so ? This should make it possible for us to resolve the addresses?
            psergei Sergei Petrunia added a comment - Looking at positions where memory is leaked: Case1 0x00007f0b31ab6ab0 Case2 0x00007f0b31ab6ab0 case3 0x00007f0b31abb3eb case4 0x00007f0b31d011ca Biggest. called from case3 Case4 0x00007f0b32ca2d78 (called from ha_initialize_handlerton)   ha_toku start 7f0b32c48000 ha_toku end 7f0b32feb000 So, most of the leaks are outside ha_tokudb.so , it seems? jmuchemb , can you share which binaries are you using? can you provide the output of nm --demangle --numeric-sort -S /path/to/ha_rocksdb.so ? This should make it possible for us to resolve the addresses?

            Sorry, for the confusion. I edited my first reply to remove parts that aren't pertinent anymore. Can you please check the svg I reuploaded ? All addresses are resolved.

            jmuchemb Julien Muchembled added a comment - Sorry, for the confusion. I edited my first reply to remove parts that aren't pertinent anymore. Can you please check the svg I reuploaded ? All addresses are resolved.

            (jmuchemb, thanks for the re-uploaded file. It helped)

            Discussed this with Herman Lee.

            Discussion results:

            • The memory is consumed by rocksdb::TransactionBaseImpl::SetSavePoint
            • Debugging, we can see that this scenario

            begin;
            select #1;
            select #2;
            ... 
            select #n;
            

            will create a new savepoint for each select. A transaction doing lots of select will end up using a lot of memory.

            This can be fixed, for example, by not creating a savepoint when the statement is a SELECT statement. We can see this in ha_rocksdb::external_lock() call.

            That call is done for each table, for statements that do both reads and writes (e.g. INSERT ... SELECT), the first call is the external_lock(F_WRLCK) for the table that we're going to write to. Note: if we haven't allocated a savepoint at statement start, we should not try to remove it at statement end.

            psergei Sergei Petrunia added a comment - ( jmuchemb , thanks for the re-uploaded file. It helped) Discussed this with Herman Lee. Discussion results: The memory is consumed by rocksdb::TransactionBaseImpl::SetSavePoint Debugging, we can see that this scenario begin; select #1; select #2; ... select #n; will create a new savepoint for each select. A transaction doing lots of select will end up using a lot of memory. This can be fixed, for example, by not creating a savepoint when the statement is a SELECT statement. We can see this in ha_rocksdb::external_lock() call. That call is done for each table, for statements that do both reads and writes (e.g. INSERT ... SELECT), the first call is the external_lock(F_WRLCK) for the table that we're going to write to. Note: if we haven't allocated a savepoint at statement start, we should not try to remove it at statement end.
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            elenst Elena Stepanova made changes -
            Fix Version/s 10.2 [ 14601 ]
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            julien.fritsch Julien Fritsch made changes -
            Description The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application ([ERP5|https://www.erp5.org]), using [NEO|https://neo.nexedi.com/]/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

            The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:
            {code}
            SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1
            SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1
            {code}
            More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

            The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

            Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).
            The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application ([ERP5|http://www.erp5.org]), using [NEO|https://neo.nexedi.com/]/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

            The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:
            {code}
            SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1
            SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1
            {code}
            More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

            The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

            Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).

            Started attempt to implement it and encountered this: https://github.com/facebook/mysql-5.6/issues/802 (not related to memory consumption as such but shows that both us and upstream have RollbackToSavepoint() call in the wrong place).

            psergei Sergei Petrunia added a comment - Started attempt to implement it and encountered this: https://github.com/facebook/mysql-5.6/issues/802 (not related to memory consumption as such but shows that both us and upstream have RollbackToSavepoint() call in the wrong place).
            psergei Sergei Petrunia added a comment - - edited

            The patch for issue #802 here: https://github.com/spetrunia/mysql-5.6/commit/612ec75814f30076f53d294cd0267533c807022f is also a fix for this MDEV.

            That tree is based on facebook/mysql-5.6, but the patch should apply to MariaDB as well.

            Next steps: I'll discuss this the upstream and then cherry-pick into MariaDB (so that we get it in MariaDB sooner than the merge from upstream happens)

            psergei Sergei Petrunia added a comment - - edited The patch for issue #802 here: https://github.com/spetrunia/mysql-5.6/commit/612ec75814f30076f53d294cd0267533c807022f is also a fix for this MDEV. That tree is based on facebook/mysql-5.6, but the patch should apply to MariaDB as well. Next steps: I'll discuss this the upstream and then cherry-pick into MariaDB (so that we get it in MariaDB sooner than the merge from upstream happens)

            Applied the patch for upstream issue #802. It fixes this issue as well.

            psergei Sergei Petrunia added a comment - Applied the patch for upstream issue #802. It fixes this issue as well.
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.2.15 [ 23006 ]
            Fix Version/s 10.2 [ 14601 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            psergei Sergei Petrunia made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 83924 ] MariaDB v4 [ 153223 ]

            People

              psergei Sergei Petrunia
              jmuchemb Julien Muchembled
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.