[MCOL-4043] Memory leaks Part 1. & errors -> need a valgrind pass Created: 2020-06-05  Updated: 2020-11-12  Resolved: 2020-07-07

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.4.4, 1.5.3
Fix Version/s: 1.5.3

Type: Bug Priority: Major
Reporter: Patrick LeBlanc (Inactive) Assignee: Gagan Goel (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File exemgr_valgrind.log     File vg.log    
Issue Links:
Issue split
split to MCOL-4150 Memory leaks & errors -> need a valgr... Closed

 Description   

I ran our regression test on our 1.5 RC, and at the end of it, ExeMgr was using 45GB of virt mem, and about 2.6GB of real mem when it was idle. Gagan compared 1.4 & 1.5, found mem usage to be similar. Provoked an investigation.

I attached the valgrind log from running VG on ExeMgr, on our latest 1.4.4 RC (enterprise server @ 6c7e73d7, engine @ 8ab3cc301). For this test I ran all of working_tpch1, and working_tpch1_compareLogOnly, except for MCOL_3503.sql. moda() has leaks not recorded in this log because I found that early and commented out the test that uses it the most to see how much of a difference it made in the final mem usage.

At the end of that test, ExeMgr was using around 1.2GB when idle. The leaks reported in the log do not add up to anywhere close to 1.2GB. My suspicion is that all of these little bits of mem over time are causing a good deal of mem fragmentation. It could also be that a good deal of mem still has a reference, so VG doesn't consider it leaked. Need to start fixing these little things we have in the log, then reevaluate.

Jun 22 update: Testing @ the columnstore-1.5.2-1 tag, I noticed test211 makes ExeMgr allocate around 1GB/min that it doesn't use. To be clear, the mem usage shows up as virtual mem usage, not resident mem usage. Resident mem usage also appears to be climbing but with a much lower slope.

When a particular patch merges easily into 1.2, be sure to include it.



 Comments   
Comment by David Hall (Inactive) [ 2020-06-19 ]

The DynamicParseTreeVec added to track parsetree objects that are causing leaks is a hard kludge. Perhaps the code should actually be doing a deep copy of the parsetree, instead of storing a pointer,

Comment by Gagan Goel (Inactive) [ 2020-06-19 ]

Server 10.4 (community) commit: 205b0ce6ad21dbafe8def505307b4922398db5b2
Engine develop-1.4 commit: 8f92f24b02f01822214f0f140a7ad0cd774bd392

Attached are the Exemgr logs under valgrind on these commits (plus the patch from https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1297): exemgr_valgrind.log. Same tests under working_tpch1 and working_tpch1_compareLogOnly were run for this log.

memleaks due to MODA functions are now gone, after David's fix.
memleaks in ordering::OrderByData::OrderByData() are now gone.
Mentions of 'definitely lost' bytes for joblist::simpleScalarFilterToParseTree() are down from 14 to 1. Not sure how that single leak is still there.

David.Hall has expressed a legitimate concern on the memleak fix in https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1297 where we are storing pointers to parsetree objects for later deletion. This is because pointers to these parsetree objects are used in creating the job steps. We should revisit this design so that we don't have to use this 'hacky' leak fix.

Comment by Patrick LeBlanc (Inactive) [ 2020-06-24 ]

From the PR, queries to reproduce the problem and verify the sol'n:

{{drop table if exists cs1;
create table cs1 (a int)engine=columnstore;
insert into cs1 values (123);
select * from cs1 where a=(select a from cs1);
select * from cs1 having a=(select a from cs1);
select * from cs1 where a=(select a from cs1) and a!=(select a from cs1);
select * from cs1 where a=(select a from cs1) or a!=(select a from cs1);
select * from (select * from cs1 where a=(select a from cs1)) aa;
select * from cs1 aa where exists (select * from cs1 bb where bb.a=(select a from cs1));
select * from cs1 aa where exists (select * from cs1 bb where aa.a=bb.a and bb.a=(select a from cs1));}}

Comment by Patrick LeBlanc (Inactive) [ 2020-06-24 ]

I merged what's there so far, assigning back to Gagan for what's left.

Comment by Gregory Dorman (Inactive) [ 2020-07-07 ]

This merge corrects some but not all issues with memory. We will close this, and MCOL-4150 will continue on.

Comment by Gagan Goel (Inactive) [ 2020-07-08 ]

Additional queries to reproduce the leaks in ordering::OrderByData::OrderByData() and verify the solution:

create table cs1 (a int, b bigint)engine=columnstore;
insert into cs1 values (123, 1234000);
select rank() over (partition by a,b order by a,b desc) as ranking from cs1;

Generated at Thu Feb 08 02:47:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.