[MCOL-5385] GROUP_CONCAT execution doesn't count RAM used to store data properly crashing PrimProc runtime Created: 2023-01-09  Updated: 2023-11-17  Resolved: 2023-05-24

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 22.08.7
Fix Version/s: 23.02.3

Type: Bug Priority: Blocker
Reporter: Roman Assignee: Gagan Goel (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MCOL-5379 Columnstore Query causing signal 11 c... Closed
relates to MCOL-5429 GROUP_CONCAT execution allocates enou... Closed
relates to MCOL-5498 Potential memory leak in GROUP_CONCAT... Confirmed
Sprint: 10.0.20
Assigned for Review: Roman Roman
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

There is a case reported by a customer. 3 queries with GROUP_CONCAT on top were going to produce a relatively long strings. This causes some memory allocation related log messages and eventually causes SEGV signal in PP. Here are the log messages:

Jan  5 16:02:04 vm-uks-edf-maria-slave threadpool[1724863]: 04.186263 |0|0|0| E 22 CAL0005: ThreadPool: Caught exception during execution:  boost::thread_resource_error: Resource temporarily unavailable [generic:11]

Jan  5 16:02:04 vm-uks-edf-maria-slave joblist[1724863]: 04.185910 |701574|0|0| C 05 CAL0000: TupleAggregateStep::doThreadedAggregate() caught std::bad_alloc         %%10%%

Here is the crash trace itself:

Date/time: 2023-01-05 16:02:04
Signal: 11
/usr/bin/PrimProc(+0xc012a)[0x55f6d483812a]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f66c51bd420]
/usr/lib/x86_64-linux-gnu/libjoblist.so(_ZN7joblist8TupleBPS23processByteStreamVectorERSt6vectorIN5boost10shared_ptrIN11messageqcpp10ByteStreamEEESaIS6_EEjjRS1_INS_7_CPInfoESaISA_EEPNS_4FIFOIN8rowgroup6RGDataEEEj+0xe9)[0x7f66c6347769]
/usr/lib/x86_64-linux-gnu/libjoblist.so(_ZN5boost6detail8function26void_function_obj_invoker0IN7joblist19ByteStreamProcessorEvE6invokeERNS1_15function_bufferE+0x3e)[0x7f66c635543e]
/usr/lib/x86_64-linux-gnu/libthreadpool.so(_ZN10threadpool10ThreadPool11beginThreadEv+0x5df)[0x7f66c513c1cf]
/usr/bin/PrimProc(+0xc142b)[0x55f6d483942b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f66c51b1609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f66c4c9c133]

Internally GROUP_CONCAT(GroupConcatNoOrder::getResult()) code allocates a buffer on heap and doesn't account the RAM allocated b/c code threats the buffer as a temporary one.



 Comments   
Comment by Roman [ 2023-01-09 ]

The suggested workaround is to reduce the current buffer cache RAM allowance down from 50% to 30% and check if this helps.
The suggested solution is twofold:

  • remove the intermediate buffering used in GroupConcat*::getResult() methods to reduce the RAM footprint using ostringstring buffer + move semantics.
  • start accounting RAM allocation in getResult()
Comment by Kirill Perov [ 2023-01-31 ]

drrtuy, Looks like some problem still remain.
Quick repeat:
1. Prepare table from regression test000 tpch1.lineitem

2. set group_concat_max_len=3221225472;

3. select l_linenumber,LENGTH(group_concat(l_comment)) from lineitem group by l_linenumber order by l_linenumber;

4. appeared: ERROR 1815 (HY000): Internal error: InetStreamSocket::readToMagic: Remote is closed, memory usage grow to 15GB.

tested on develop build from 31.01.23

Comment by Kirill Perov [ 2023-02-01 ]

drrtuy, Similar to previous comment crash happens at ubuntu 22 and columnstore 22.08.7.
The only difference is that memory usage is not growing during request running;

Comment by Kirill Perov [ 2023-02-20 ]

both develop from 19 Feb 2023 and 22.08.7 raise error:
ERROR 1815 (HY000): Internal error: InetStreamSocket::readToMagic: Remote is closed

no sign of OOMKiller in syslog:

Feb 20 22:31:38 kirillperov-testvm-ub22 ExeMgr[2882]: 38.362023 |2147483685|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='tpch1' and tablename='lineitem' --columnRIDs/FE; ||
Feb 20 22:31:38 kirillperov-testvm-ub22 ExeMgr[2882]: 38.422169 |2147483685|0|0| D 16 CAL0042: End SQL statement
Feb 20 22:31:38 kirillperov-testvm-ub22 ExeMgr[2882]: 38.445229 |37|0|0| D 16 CAL0041: Start SQL statement: select l_linenumber,LENGTH(group_concat(l_comment)) from lineitem group by l_linenumber order by l_linenumber; |tpch1|
Feb 20 22:31:39 kirillperov-testvm-ub22 env[2882]: PrimProc: /usr/include/boost/smart_ptr/scoped_array.hpp:81: T& boost::scoped_array<T>::operator[](std::ptrdiff_t) const [with T = unsigned char; std::ptrdiff_t = long int]: Assertion `i >= 0' failed.
Feb 20 22:31:39 kirillperov-testvm-ub22 messagequeue[1208]: 39.239273 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed %%10%%
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: mcs-primproc.service: Main process exited, code=killed, status=6/ABRT
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: mcs-primproc.service: Failed with result 'signal'.
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: mcs-primproc.service: Consumed 5.747s CPU time.
Feb 20 22:31:39 kirillperov-testvm-ub22 mariadbd[1208]: ClientRotator caught exception: InetStreamSocket::connect: connect() error: Connection refused to: InetStreamSocket: sd: 96 inet: 127.0.0.1 port: 8601
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: mcs-primproc.service: Scheduled restart job, restart counter is at 1.
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: Stopped mcs-primproc.
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: mcs-primproc.service: Consumed 5.747s CPU time.
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: Starting mcs-primproc...
Feb 20 22:31:39 kirillperov-testvm-ub22 env[3134]: Starting PrimitiveServer: st = 1, sq = 10, pw = 128, pq = 10240, nb = 1023706, nt = 32, nc = 1, ra = 512, db = 128, mb = 512, rd = 0, tr = 0, ss = 67108864, bp = 32
Feb 20 22:31:39 kirillperov-testvm-ub22 env[3125]: PrimProc main process has started
Feb 20 22:31:39 kirillperov-testvm-ub22 env[3134]: FairThreadPool started 32 thread/-s.
Feb 20 22:31:39 kirillperov-testvm-ub22 env[3134]: Starting ExeMgr: st = 50, qs = 20, mx = 95, cf = /etc/columnstore/Columnstore.xml
Feb 20 22:31:39 kirillperov-testvm-ub22 systemd[1]: Started mcs-primproc.

Comment by Gagan Goel (Inactive) [ 2023-04-19 ]

For QA: I have tested two scenarios mentioned in the comments in this ticket to confirm the PrimProc crash is fixed. One is from Sergey on 2023-03-17 and another one from Kirill on 2023-01-31. The lineitem query now executes without a PrimProc crash. On my system with 128gb RAM, the lineitem query with group_concat took about 7-8mins (when group_concat_max_len=3221225472 ).

Please reconfirm the above two test cases before and after the fix.

In addition, we need to confirm if the customer issue reported in CS0512018 is also fixed.

Comment by Daniel Lee (Inactive) [ 2023-04-26 ]

Build verified:

engine: 499859035c1af97ad1c8ee6d31392735ee348390
server: 11c83d9ae9eb249d00589cc6ab71e7f4e67ffa27
buildNo: 7532

1 GB DBT3 dataset
set group_concat_max_len=3221225472;

Reproduced the issues in release 23.02.2 and verified the fix in the mentioned build.

Verified in both Rocky 8 and Ubuntu 20.04

With fixed

VM memory 8gb
 
MariaDB [tpch1]> select l_linenumber,LENGTH(group_concat(l_comment)) from lineitem group by l_linenumber order by l_linenumber;
ERROR 1815 (HY000): Internal error: TupleAggregateStep::threadedAggregateRowGroups()[3] MCS-2003: Aggregation/Distinct memory limit is exceeded.
 
 
VM memory 16gb
 
MariaDB [tpch1]> select l_linenumber,LENGTH(group_concat(l_comment)) from lineitem group by l_linenumber order by l_linenumber;
+--------------+---------------------------------+
| l_linenumber | LENGTH(group_concat(l_comment)) |
+--------------+---------------------------------+
|            1 |                        41242078 |
|            2 |                        35359379 |
|            3 |                        29465964 |
|            4 |                        23560639 |
|            5 |                        17676387 |
|            6 |                        11794549 |
|            7 |                         5899421 |
+--------------+---------------------------------+
7 rows in set (1 min 21.366 sec)
 
 
MariaDB [mytest]> set group_concat_max_len=3221225472;
Query OK, 0 rows affected (0.000 sec)
 
Result matched with InnoDB.
 
 
MariaDB [mytest]> CREATE TABLE t(i integer, t text(65000)) ENGINE=columnstore;
Query OK, 0 rows affected (0.220 sec)
 
MariaDB [mytest]> INSERT INTO t SELECT seq / 10, CONCAT('long enough string to fill the text, sequence is ', seq) FROM seq_1_to_1000000;
Query OK, 1000000 rows affected (3.284 sec)
Records: 1000000  Duplicates: 0  Warnings: 0
 
MariaDB [mytest]> set group_concat_max_len=3221225472;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [mytest]> SELECT i, GROUP_CONCAT(t) FROM t GROUP BY i;
.
.
.
 
|  74546 | long enough string to fill the text, sequence is 745455,long enough string to fill the text, sequence is 745456,long enough string to fill the text, sequence is 745457,long enough string to fill the text, sequence is 745458,long enough string to fill the text, sequence is 745459,long enough string to fill the text, sequence is 745460,long enough string to fill the text, sequence is 745461,long enough string to fill the text, sequence is 745462,long enough string to fill the text, sequence is 745463,long enough string to fill the text, sequence is 745464 |
+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
100001 rows in set (3.156 sec)

23.02.2

VM memory 8gb
 
MariaDB [tpch1]> select l_linenumber,LENGTH(group_concat(l_comment)) from lineitem group by l_linenumber order by l_linenumber;
ERROR 1815 (HY000): Internal error: An unexpected condition within the query caused an internal processing error within Columnstore. Please check the log files for more details. Additional Information: error in TupleAggregateSte
 
 
VM memory 16gb
 
MariaDB [tpch1]> select l_linenumber,LENGTH(group_concat(l_comment)) from lineitem group by l_linenumber order by l_linenumber;
ERROR 1815 (HY000): Internal error: InetStreamSocket::readToMagic: Remote is closed
 
PrimProc crashed
 
MariaDB [mytest]> CREATE TABLE t(i integer, t text(65000)) ENGINE=columnstore;
Query OK, 0 rows affected (0.220 sec)
 
MariaDB [mytest]> INSERT INTO t SELECT seq / 10, CONCAT('long enough string to fill the text, sequence is ', seq) FROM seq_1_to_1000000;
Query OK, 1000000 rows affected (3.284 sec)
Records: 1000000  Duplicates: 0  Warnings: 0
 
MariaDB [mytest]> set group_concat_max_len=3221225472;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [mytest]> SELECT i, GROUP_CONCAT(t) FROM t GROUP BY i;
ERROR 1815 (HY000): Internal error: InetStreamSocket::readToMagic: Remote is closed

Generated at Thu Feb 08 02:57:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.