[MCOL-4922] Unusual increase in memory consumption by ExeMgr which cause OOM error. Created: 2021-11-11  Updated: 2022-02-18

Status: Open
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 6.2.1
Fix Version/s: Icebox

Type: Bug Priority: Minor
Reporter: Denis Khalikov Assignee: Denis Khalikov
Resolution: Unresolved Votes: 0
Labels: None

Attachments: PNG File Screenshot from 2021-11-10 20-07-56.png     PNG File Screenshot from 2021-11-10 20-12-44.png    

 Description   

Unusual increase in memory consumption by ExeMgr during execution of the specific query which cause OOM error.
Running the same query iteratively I found that the memory consumption by ExeMgr is increasing iteratively as well and finally cause a OOM error.

The query:

select
	s_name,
	count(*) as numwait
from
	supplier,
	lineitem l1,
	orders,
	nation
where
	s_suppkey = l1.l_suppkey
	and o_orderkey = l1.l_orderkey
	and o_orderstatus = 'F'
	and l1.l_receiptdate > l1.l_commitdate
	and exists (
		select
			*
		from
			lineitem l2
		where
			l2.l_orderkey = l1.l_orderkey
			and l2.l_suppkey <> l1.l_suppkey
	)
	and not exists (
		select
			*
		from
			lineitem l3
		where
			l3.l_orderkey = l1.l_orderkey
			and l3.l_suppkey <> l1.l_suppkey
			and l3.l_receiptdate > l3.l_commitdate
	)
	and s_nationkey = n_nationkey
	and n_name = 'EGYPT'
group by
	s_name
order by
	numwait desc,
	s_name
LIMIT 100;

I use sysbench and 10G of data to run the query.



 Comments   
Comment by Roman [ 2021-12-09 ]

dleeyh Could I ask you to retest this one?

Comment by Daniel Lee (Inactive) [ 2021-12-09 ]

drrtuyWhat do you want me to test? Thanks.

My understanding is that the bug reporter still sees the issue in the latest build.

Comment by Daniel Lee (Inactive) [ 2021-12-09 ]

Build tested: 6.2.2-1 (#3480)

From what I can tell, we have a stability issue. Here is what I found on a 32GB VM.

After building a 10 gb TPCH database, I executed the query and got the following error soon after ExeMgr's memory utilization reached 12.1%

[centos8:root~]# mariadb tpch10 < /data/qa/shares/Testcase.txt 
ERROR 1815 (HY000) at line 1: Internal error: TupleBPS::run() caught DistributedEngineComm::write: Broken Pipe error

I repeat the same test again and the queries returned results successfully.

I then rebuild the 10GB TPCH database and tried the test again. The first two tries also failed with the same error, but the next two tests were successful. The 5th test got stuck, with ExeMgr's memory utilization remained at 1.1%. PrimProc's CPU usage remained at about 98%.

When the test was successful, it took only few seconds to execute. Now it has been minutes and PrimProc remained at 98% CPU and 10.2% MEM.

Comment by Daniel Lee (Inactive) [ 2021-12-09 ]

Packages for these earlier builds are no longer available. Please rebuild that in Drone and let me know the build number. Thanks.

Comment by Roman [ 2021-12-18 ]

Any updates [~denis0x0D dleeyh ?

Comment by Roman [ 2022-02-15 ]

Any progress on this dleeyh or should I close it?

Generated at Thu Feb 08 02:54:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.