[MDEV-14460] Memory leak with only SELECT statements - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2.9
Fix Version/s: 10.2.15
Component/s: Storage Engine - RocksDB
Labels:
None
Environment:
Debian 8. 16 GB RAM. 8 CPU (Xeon E5-1620 v2).

Description

The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application (ERP5), using NEO/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:

SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1

SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1

More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mysqld.svg
86 kB
2017-12-07 14:40

Issue Links

relates to

MDEV-13739 Huge slowness at inserting rows, CPU-bound

Confirmed

Activity

Ascending order - Click to sort in descending order

Julien Muchembled created issue - 2017-11-21 17:47

Michael Widenius added a comment - 2017-12-05 07:43

Can you please make a full test case that we can reproduce?
This includes creating the needed tables, having a script that fills them with data and running the queries that causes the memory overrun.

Alternatively is to link MariaDB with tcmalloc and run the test on your machines. This would give us
a log of the memory usage that could help us locate the problem

https://mariadb.com/kb/en/library/debugging-a-running-server-on-linux/

Michael Widenius added a comment - 2017-12-05 07:43 Can you please make a full test case that we can reproduce? This includes creating the needed tables, having a script that fills them with data and running the queries that causes the memory overrun. Alternatively is to link MariaDB with tcmalloc and run the test on your machines. This would give us a log of the memory usage that could help us locate the problem https://mariadb.com/kb/en/library/debugging-a-running-server-on-linux/

Michael Widenius made changes - 2017-12-05 07:44

Field	Original Value	New Value
Assignee		Sergei Petrunia [ psergey ]

Julien Muchembled made changes - 2017-12-07 14:00

Attachment

mysqld.svg [ 44753 ]

Julien Muchembled added a comment - 2017-12-07 14:29 - edited

If the pattern (when filling or reading tables) matters, writing a test case can be a lot of work. For the moment, I prefer the other way.

We use jemalloc. Something must be missing on their wiki because I was unable to get any profiling data. So I switched to tcmalloc.

[edit: removed, too many mistakes from me]

Julien Muchembled added a comment - 2017-12-07 14:29 - edited If the pattern (when filling or reading tables) matters, writing a test case can be a lot of work. For the moment, I prefer the other way. We use jemalloc. Something must be missing on their wiki because I was unable to get any profiling data. So I switched to tcmalloc. [edit: removed, too many mistakes from me]

Julien Muchembled made changes - 2017-12-07 14:40

Attachment

mysqld.svg [ 44753 ]

Julien Muchembled made changes - 2017-12-07 14:40

Attachment

mysqld.svg [ 44754 ]

Julien Muchembled added a comment - 2017-12-07 14:47

Reuploaded. Apparently, I must call pprof with a single .heap file.

Julien Muchembled added a comment - 2017-12-07 14:47 Reuploaded. Apparently, I must call pprof with a single .heap file.

Julien Muchembled added a comment - 2017-12-07 14:59

I don't know if it matters but I've just realized that all select are done in a single transaction.

Julien Muchembled added a comment - 2017-12-07 14:59 I don't know if it matters but I've just realized that all select are done in a single transaction.

Sergei Petrunia added a comment - 2017-12-07 18:58

Looking at positions where memory is leaked:

Case1   0x00007f0b31ab6ab0

Case2   0x00007f0b31ab6ab0

case3   0x00007f0b31abb3eb

case4   0x00007f0b31d011ca Biggest. called from case3

Case4   0x00007f0b32ca2d78 (called from ha_initialize_handlerton)

ha_toku start 7f0b32c48000

ha_toku   end 7f0b32feb000

So, most of the leaks are outside ha_tokudb.so, it seems?
jmuchemb, can you share

which binaries are you using?
can you provide the output of nm --demangle --numeric-sort -S /path/to/ha_rocksdb.so ? This should make it possible for us to resolve the addresses?

Sergei Petrunia added a comment - 2017-12-07 18:58 Looking at positions where memory is leaked: Case1 0x00007f0b31ab6ab0 Case2 0x00007f0b31ab6ab0 case3 0x00007f0b31abb3eb case4 0x00007f0b31d011ca Biggest. called from case3 Case4 0x00007f0b32ca2d78 (called from ha_initialize_handlerton) ha_toku start 7f0b32c48000 ha_toku end 7f0b32feb000 So, most of the leaks are outside ha_tokudb.so , it seems? jmuchemb , can you share which binaries are you using? can you provide the output of nm --demangle --numeric-sort -S /path/to/ha_rocksdb.so ? This should make it possible for us to resolve the addresses?

Julien Muchembled added a comment - 2017-12-07 19:33

Sorry, for the confusion. I edited my first reply to remove parts that aren't pertinent anymore. Can you please check the svg I reuploaded ? All addresses are resolved.

Julien Muchembled added a comment - 2017-12-07 19:33 Sorry, for the confusion. I edited my first reply to remove parts that aren't pertinent anymore. Can you please check the svg I reuploaded ? All addresses are resolved.

Sergei Petrunia added a comment - 2017-12-11 22:52

(jmuchemb, thanks for the re-uploaded file. It helped)

Discussed this with Herman Lee.

Discussion results:

The memory is consumed by rocksdb::TransactionBaseImpl::SetSavePoint
Debugging, we can see that this scenario

begin;

select #1;

select #2;

...

select #n;

will create a new savepoint for each select. A transaction doing lots of select will end up using a lot of memory.

This can be fixed, for example, by not creating a savepoint when the statement is a SELECT statement. We can see this in ha_rocksdb::external_lock() call.

That call is done for each table, for statements that do both reads and writes (e.g. INSERT ... SELECT), the first call is the external_lock(F_WRLCK) for the table that we're going to write to. Note: if we haven't allocated a savepoint at statement start, we should not try to remove it at statement end.

Sergei Petrunia added a comment - 2017-12-11 22:52 ( jmuchemb , thanks for the re-uploaded file. It helped) Discussed this with Herman Lee. Discussion results: The memory is consumed by rocksdb::TransactionBaseImpl::SetSavePoint Debugging, we can see that this scenario begin; select #1; select #2; ... select #n; will create a new savepoint for each select. A transaction doing lots of select will end up using a lot of memory. This can be fixed, for example, by not creating a savepoint when the statement is a SELECT statement. We can see this in ha_rocksdb::external_lock() call. That call is done for each table, for statements that do both reads and writes (e.g. INSERT ... SELECT), the first call is the external_lock(F_WRLCK) for the table that we're going to write to. Note: if we haven't allocated a savepoint at statement start, we should not try to remove it at statement end.

Julien Fritsch made changes - 2017-12-15 20:31

Link

This issue is part of MDEV-14666 [ MDEV-14666 ]

Julien Fritsch made changes - 2017-12-15 22:35

Link

This issue is part of PT-67 [ PT-67 ]

Elena Stepanova made changes - 2018-01-13 23:40

Fix Version/s

10.2 [ 14601 ]

Sergei Golubchik made changes - 2018-02-02 09:42

Priority

Major [ 3 ]

Critical [ 2 ]

Julien Fritsch made changes - 2018-03-29 15:56

Description

The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application ([ERP5|https://www.erp5.org]), using [NEO|https://neo.nexedi.com/]/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:
{code}
SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1
SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1
{code}
More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).

The same IO-bound scenario was run on 2 identical machines, one with RocksDB and the other with TokuDB. In a Zope application ([ERP5|http://www.erp5.org]), using [NEO|https://neo.nexedi.com/]/MariaDB as storage, we filled the disks with a few data streams. The memory leak happened when reading a single whole data stream.

The result is that ERP5/NEO does 2 queries in loop. Here is an example from general log:
{code}
SELECT tid, compression, data.hash, value, value_tid FROM obj FORCE INDEX(`partition`) LEFT JOIN data ON (obj.data_id = data.id) WHERE `partition` = 6 AND oid = 94294686 AND tid < 271213171275484455 ORDER BY tid DESC LIMIT 1
SELECT tid FROM obj FORCE INDEX(`partition`) WHERE `partition`=6 AND oid=94294686 AND tid>271177614835290848 ORDER BY tid LIMIT 1
{code}
More about NEO's architecture is described at https://www.nexedi.com/blog/NXD-Document.Blog.Optimising.MariaDB.Big.Data

The data stream is 1.5TB uncompressed, taking about 650MB on-disk. RocksDB got OOM-killed when 1TB was read. TokuDB respected its memory constraints until the end.

Each machine has 2 disks of 2TB, each one with a MariaDB DB that holds half of the Zope DB. IOW, we have 2 RocksDB databases of about 2TB. 8GB of RAM for the 2 mysqld (rocksdb_block_cache_size = 2G).

Sergei Petrunia added a comment - 2018-03-30 09:08

Started attempt to implement it and encountered this: https://github.com/facebook/mysql-5.6/issues/802 (not related to memory consumption as such but shows that both us and upstream have RollbackToSavepoint() call in the wrong place).

Sergei Petrunia added a comment - 2018-03-30 09:08 Started attempt to implement it and encountered this: https://github.com/facebook/mysql-5.6/issues/802 (not related to memory consumption as such but shows that both us and upstream have RollbackToSavepoint() call in the wrong place).

Sergei Petrunia added a comment - 2018-04-05 22:05 - edited

The patch for issue #802 here: https://github.com/spetrunia/mysql-5.6/commit/612ec75814f30076f53d294cd0267533c807022f is also a fix for this MDEV.

That tree is based on facebook/mysql-5.6, but the patch should apply to MariaDB as well.

Next steps: I'll discuss this the upstream and then cherry-pick into MariaDB (so that we get it in MariaDB sooner than the merge from upstream happens)

Sergei Petrunia added a comment - 2018-04-05 22:05 - edited The patch for issue #802 here: https://github.com/spetrunia/mysql-5.6/commit/612ec75814f30076f53d294cd0267533c807022f is also a fix for this MDEV. That tree is based on facebook/mysql-5.6, but the patch should apply to MariaDB as well. Next steps: I'll discuss this the upstream and then cherry-pick into MariaDB (so that we get it in MariaDB sooner than the merge from upstream happens)

Sergei Petrunia added a comment - 2018-04-12 22:57

Applied the patch for upstream issue #802. It fixes this issue as well.

Sergei Petrunia added a comment - 2018-04-12 22:57 Applied the patch for upstream issue #802. It fixes this issue as well.

Sergei Petrunia made changes - 2018-04-12 22:57

Fix Version/s		10.2.15 [ 23006 ]
Fix Version/s	10.2 [ 14601 ]
Resolution		Fixed [ 1 ]
Status	Open [ 1 ]	Closed [ 6 ]

Sergei Petrunia made changes - 2018-04-19 13:09

Link

This issue relates to MDEV-13739 [ MDEV-13739 ]

Sergei Golubchik made changes - 2021-12-06 21:46

Workflow

MariaDB v3 [ 83924 ]

MariaDB v4 [ 153223 ]

People

Assignee:: Sergei Petrunia

Reporter:: Julien Muchembled

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2017-11-21 17:47

Updated:: 2024-07-08 00:24

Resolved:: 2018-04-12 22:57

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration