[MDEV-14460] Memory leak with only SELECT statements Created: 2017-11-21  Updated: 2018-04-19  Resolved: 2018-04-12

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.2.9
Fix Version/s: 10.2.15

Type: Bug Priority: Critical
Reporter: Julien Muchembled Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None
Environment:

Debian 8. 16 GB RAM. 8 CPU (Xeon E5-1620 v2).


Attachments: File mysqld.svg    
Issue Links:
Relates
relates to MDEV-13739 Huge slowness at inserting rows, CPU-... Confirmed

 Description   

The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application (ERP5), using NEO/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:

SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1
SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1

More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).



 Comments   
Comment by Michael Widenius [ 2017-12-05 ]

Can you please make a full test case that we can reproduce?
This includes creating the needed tables, having a script that fills them with data and running the queries that causes the memory overrun.

Alternatively is to link MariaDB with tcmalloc and run the test on your machines. This would give us
a log of the memory usage that could help us locate the problem

https://mariadb.com/kb/en/library/debugging-a-running-server-on-linux/

Comment by Julien Muchembled [ 2017-12-07 ]

If the pattern (when filling or reading tables) matters, writing a test case can be a lot of work. For the moment, I prefer the other way.

We use jemalloc. Something must be missing on their wiki because I was unable to get any profiling data. So I switched to tcmalloc.

[edit: removed, too many mistakes from me]

Comment by Julien Muchembled [ 2017-12-07 ]

Reuploaded. Apparently, I must call pprof with a single .heap file.

Comment by Julien Muchembled [ 2017-12-07 ]

I don't know if it matters but I've just realized that all select are done in a single transaction.

Comment by Sergei Petrunia [ 2017-12-07 ]

Looking at positions where memory is leaked:

Case1   0x00007f0b31ab6ab0 
Case2   0x00007f0b31ab6ab0
case3   0x00007f0b31abb3eb 
case4   0x00007f0b31d011ca Biggest. called from case3
       
Case4   0x00007f0b32ca2d78 (called from ha_initialize_handlerton)
 
ha_toku start 7f0b32c48000
ha_toku   end 7f0b32feb000

So, most of the leaks are outside ha_tokudb.so, it seems?
jmuchemb, can you share

  • which binaries are you using?
  • can you provide the output of nm --demangle --numeric-sort -S /path/to/ha_rocksdb.so ? This should make it possible for us to resolve the addresses?
Comment by Julien Muchembled [ 2017-12-07 ]

Sorry, for the confusion. I edited my first reply to remove parts that aren't pertinent anymore. Can you please check the svg I reuploaded ? All addresses are resolved.

Comment by Sergei Petrunia [ 2017-12-11 ]

(jmuchemb, thanks for the re-uploaded file. It helped)

Discussed this with Herman Lee.

Discussion results:

  • The memory is consumed by rocksdb::TransactionBaseImpl::SetSavePoint
  • Debugging, we can see that this scenario

begin;
select #1;
select #2;
... 
select #n;

will create a new savepoint for each select. A transaction doing lots of select will end up using a lot of memory.

This can be fixed, for example, by not creating a savepoint when the statement is a SELECT statement. We can see this in ha_rocksdb::external_lock() call.

That call is done for each table, for statements that do both reads and writes (e.g. INSERT ... SELECT), the first call is the external_lock(F_WRLCK) for the table that we're going to write to. Note: if we haven't allocated a savepoint at statement start, we should not try to remove it at statement end.

Comment by Sergei Petrunia [ 2018-03-30 ]

Started attempt to implement it and encountered this: https://github.com/facebook/mysql-5.6/issues/802 (not related to memory consumption as such but shows that both us and upstream have RollbackToSavepoint() call in the wrong place).

Comment by Sergei Petrunia [ 2018-04-05 ]

The patch for issue #802 here: https://github.com/spetrunia/mysql-5.6/commit/612ec75814f30076f53d294cd0267533c807022f is also a fix for this MDEV.

That tree is based on facebook/mysql-5.6, but the patch should apply to MariaDB as well.

Next steps: I'll discuss this the upstream and then cherry-pick into MariaDB (so that we get it in MariaDB sooner than the merge from upstream happens)

Comment by Sergei Petrunia [ 2018-04-12 ]

Applied the patch for upstream issue #802. It fixes this issue as well.

Generated at Thu Feb 08 08:13:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.