[MCOL-4841] ExeMgr Crashes after large join - IDB-2004: Cannot connect to ExeMgr - requires restart Created: 2021-08-19  Updated: 2022-08-24  Resolved: 2022-03-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 6.2.2
Fix Version/s: 6.3.1

Type: Bug Priority: Critical
Reporter: Allen Herrera Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: mcs_cs

Attachments: File DockerExemgrCrash.mov     HTML File crashDockerReproductionSteps    
Issue Links:
PartOf
Problem/Incident
causes MCOL-5009 test202 causes deadlock Closed
Relates
relates to MCOL-4626 Columnstore cluster becomes non opera... Stalled
Sprint: 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17

 Description   

Exemgr (sometimes primproc) crashes on large queries with joins in SkySQL and docker. Requiring a restart of the cluster/node to spin exemgr back up.

clients and users see the following when sending additional queries
ERROR 1815 (HY000): Internal error: IDB-2004: Cannot connect to ExeMgr.

Expected Results:
A graceful error of the query, ideally with a reason why (need more ram, storage, cpu etc.) that doesn't require restarting services to make other queries afterward.

Steps to reproduced attached along with a video of the crash



 Comments   
Comment by David Hall (Inactive) [ 2021-10-19 ]

The suggested query has the following traits:
6,000,000 rows per table
The query:
select * from TEST
inner join TEST2 on TEST2.var = TEST.var
inner join TEST3 on TEST3.number2 = TEST.number2;

(These queries are with a 60,000 row table)
JOIN on var:

select count(*) from TEST group by var;
+----------+
| count(*) |
+----------+
|     3762 |
|     3763 |
|     3741 |
|     3762 |
|     3781 |
|     3701 |
|     3755 |
|     3631 |
|     3708 |
|     3803 |
|     3839 |
|     3819 |
|     3741 |
|     3694 |
|     3747 |
|     3753 |
+----------+

So for each row in TEST, JOIN ON var will result in around 3700 result rows, giver or take.

Then join on number2

select count(*) from TEST group by number2;
 .
 .
 .
|       77 |
|       62 |
|       59 |
+----------+
1001 rows in set (0.074 sec)

For each result of the first join around 60-ish rows will be generated.

Total generated rows for each row in test is 3700*60=222000 usually a little more, sometimes a few less.
222000*6000000 = 1,332,000,000,000 rows returned for this query
So to test, I used a set of tables with 60,000 rows each. This exercises the code without taking months to complete. It's still large enough to cause the OOM problems.

The cause:
Whether a PM or UM join, the design is to work on a block at a time. Block = 8192 rows.
8192 *222000 = 1,818,624,000 generated rows per block of the leftmost table. There are multiple threads, each working on a block, so each wants 1818624000 rows worth of memory. Ouch.

There are three places that may contribute to the problem.

1) If it becomes a UM JOIN, then ExeMgr is accepting raw rows from PrimProc and doing the join. Even with disk join, it tries to do a block at a time, so the memory needs are huge and cause OOM in small to medium size machines.

2) If it does a PM join (the default without xml manipulation), since there are no functions involved, it attempts to do the join one block at time (multiple threads) and generate a join list. If there are functions involved in the results, then it materializes the rows and performs the functions. Either way, it builds a single TCP message to send to ExeMgr per block.

ExeMgr wants buffer space for at least three messages, so it builds a huge (in bytes) queue to hold the data until it can be processed. Once it reaches three buffers full and still can't keep up, it turns on flow control to tell PrimProc to stop sending so fast.

But PrimProc doesn't slow processing, so it starts building the output queue waiting for ExeMgr to tell it to send more data. This queue can increase indefinitely, and eventually will cause OOM even with large memory machines.

The reason ExeMgr can't keep up is because it has a blocking queue sending data to mysqld. So it slows processing to match with mysqld. For some reason mysqld is a bottleneck. More on that later.

Solution:
When it's a UM join, I added code inside an existing loop to send data to the next jobstep (and hence to mysqld) after 8192 final (not small table) rows are materialized. This prevents massive memory requirements in UM joins.

I added blocking code to bppsendthread that will slowdown PrimProc to prevent buffer runaway. I've attempted to prevent overuse of locks and signaling by writing in such a way that the mechanism has no impact until the buffer reaches a preset size.

I re-engineered the join code in batchprimitiveprocessor to loop when the total join size gets large and to send results smaller than 8192 small table rows.

I also release the join memory as it's being streamed to bytestream for sending. It had been streaming, then releasing. If join memory was GBs, then streaming needed GBs more memory, so the total is twice as much as really needed.

All of these changes should not affect performance for any queries.

After all this, running the 60000 row triple join test completes without using very much memory at all. It completes in around 3 days. I also tested with functions in the result set because that is a different code path.

mysqld
As noted earlier, after the patch, the bottleneck appears to be in mysqld. I haven't yet determined if it's in plugin code or server code. My next step is to profile mysql while this query is running. There may be room for improvement. Maybe.

Comment by David Hall (Inactive) [ 2021-10-21 ]

I ran two tests. I use mysql -q to prevent the client from attempting to buffer the output. Otherwise mysql client consumes memory to OOM.

First test with the query in the example against 60,000 row tables:

MariaDB [test]> select * from TEST inner join TEST2 on TEST2.var = TEST.var inner join TEST3 on TEST3.number2 = TEST.number2;
.
.
.
| 9          |        7746.89 |           6.99 | 7    | 8          |        6173.07 |           6.29 | 7    | 5          |        8591.84 |           6.99 | 5    |
| 9          |        7746.89 |           6.99 | 7    | 8          |        6173.07 |           6.29 | 7    | 6          |        5128.58 |           6.99 | 9    |
| 9          |        7746.89 |           6.99 | 7    | 8          |        6173.07 |           6.29 | 7    | 0          |        7631.18 |           6.99 | 0    |
| 9          |        7746.89 |           6.99 | 7    | 8          |        6173.07 |           6.29 | 7    | 9          |        6605.33 |           6.99 | 8    |
+------------+----------------+----------------+------+------------+----------------+----------------+------+------------+----------------+----------------+------+
13471208170 rows in set (5.745 sec)

I started this test Saturday morning. It was finished by Thursday morning == around 5 days. Note the elapsed time reported is incorrect. There may be rollover or something.

The second test is this query:

select TEST.number1, TEST2.number2, abs(TEST.number1) from TEST inner join TEST2 on TEST2.var = TEST.var inner join TEST3 on TEST3.number2 = TEST.number2;

I use this test to exercise the code path taken for when there are functions in the result set. It ran for 3 days.

I'm not sure why the discrepancy in elapsed time, but given that PrimProc and ExeMgr were not working hard in either test, I expect the difference may be explained by the larger returned row size in the first test causing x window to work harder..

During the test, I ran test001 in parallel. It ran just fine.

Comment by David Hall (Inactive) [ 2021-10-25 ]

For QA: Here's the DDL|DML for creating the tables using floats rather than decimal. You could reduce the number of rows by modifying seq_1_to_60000 in the code to something smaller. Not too small or it won't prove the point.

CREATE TABLE T1 (
group_name CHAR(2),
number1 float,
number2 float,
var varchar(1)
) ENGINE=COLUMNSTORE;
 
INSERT INTO T1
SELECT CAST(ROUND(RAND() * 10, 0) AS CHAR),
ROUND(RAND() * 10000, 2),
ROUND(RAND() * 10, 2),
substring(MD5(RAND()),1,1)
FROM seq_1_to_60000;
 
CREATE TABLE T2 (
group_name CHAR(2),
number1 float,
number2 float,
var varchar(1)
) ENGINE=COLUMNSTORE;
 
INSERT INTO T2
SELECT CAST(ROUND(RAND() * 10, 0) AS CHAR),
ROUND(RAND() * 10000, 2),
ROUND(RAND() * 10, 2),
substring(MD5(RAND()),1,1)
FROM seq_1_to_60000;

Comment by David Hall (Inactive) [ 2021-10-26 ]

The queries I tried are

select * from T1 inner join T2 on T2.var = T1.var inner join T3 on T3.number2 = T1.number2;
13512057376 rows in set (11.189 sec)

select TEST.number1, TEST2.number2, abs(TEST.number1) from TEST inner join TEST2 on TEST2.var = TEST.var inner join TEST3 on TEST3.number2 = TEST.number2;
13471208170 rows in set (0.692 sec)

select TEST.number1, sum(TEST.number1) from TEST inner join TEST2 on TEST2.var = TEST.var inner join TEST3 on TEST3.number2 = TEST.number2 group by TEST.number1;
57751 rows in set (1 hour 45 min 41.120 sec)

insert into testjoin select * from T1 inner join T2 on T2.var = T1.var inner join T3 on T3.number2 = T1.number2;

The elapsed times reported are meaningless when the query runs for more than 24 hours. There is rollover in the time field, I suppose.

QA: You must start mysql with the -q option or mysql will attempt to buffer until OOM. So Select will start spewing results shortly after the query is submitted. However, if an aggregation is included, such as the SUM() above, it accumulates before displaying, so nothing will show until it's done.

Comment by David Hall (Inactive) [ 2021-10-26 ]

Somewhat strange: When running against a table with numerical fields defined as decimal(12,2), the query returns 13471208170 rows, but when I recreate the table using floats, we get 13512057376 rows. I can't explain the discrepancy.

Comment by David Hall (Inactive) [ 2021-10-26 ]

Interesting:
I started a SELECT and an INSERT...SELECT at the same time. SELECT finished much sooner. There's no x windows in INSERT...SELECT. mysqld is pegging a single core.

Update: INSERT...SELECT quit after running for 5 days. Out of disk space — ~32Gib. It rolled back without issue.

Comment by David Hall (Inactive) [ 2022-01-21 ]

Didn't mean to close. Jira misunderstanding

Comment by Roman [ 2022-01-24 ]

David.Hall I personally think that out of band accounting wouldn't solve the issue and the only way is to customize std::allocator(or whatever type used for memory-intensive structs) to account memory consumption using atomic counter. To reduce contention we should probably introduce a hierarchy so that every JobList accounts on its own syncing periodically with a global counter in RM and start syncing with a global counter every time crossing a certain avaiable memory limit. What do you think?

Comment by Roman [ 2022-02-09 ]

David.Hall Could you elaborate on this issue for QA?

Comment by Daniel Lee (Inactive) [ 2022-03-25 ]

Build verified: 6.3.1-1 (#4101)

Rocky8, 16GB memory

Using test case in the ticket, with 600k rows instead 60k rows, I was able to reproduce the reported issue in 6.2.2-1.
Verified the fix in 6.3.1-1, build 4101. Must use -q when starting mariadb.

During question execution, process memory utilization remained about the same.

 6899 mysql     20   0 5555188   4.0g  11316 S   0.3  25.7   0:06.03 ExeMgr
 6852 mysql     20   0   10.4g 495056   9244 S   0.0   3.1   0:02.56 PrimProc
 6741 mysql     20   0 1638168 112332  41232 S   8.3   0.7   0:04.51 mariadbd

Generated at Thu Feb 08 02:53:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.