[MCOL-4841] ExeMgr Crashes after large join - IDB-2004: Cannot connect to ExeMgr - requires restart Created: 2021-08-19 Updated: 2022-08-24 Resolved: 2022-03-25 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 6.2.2 |
| Fix Version/s: | 6.3.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Allen Herrera | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mcs_cs | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Sprint: | 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17 | ||||||||||||||||||||
| Description |
|
Exemgr (sometimes primproc) crashes on large queries with joins in SkySQL and docker. Requiring a restart of the cluster/node to spin exemgr back up. clients and users see the following when sending additional queries Expected Results: Steps to reproduced attached along with a video of the crash |
| Comments |
| Comment by David Hall (Inactive) [ 2021-10-19 ] | ||||||||||||||||||||||||||||||
|
The suggested query has the following traits: (These queries are with a 60,000 row table)
So for each row in TEST, JOIN ON var will result in around 3700 result rows, giver or take. Then join on number2
For each result of the first join around 60-ish rows will be generated. Total generated rows for each row in test is 3700*60=222000 usually a little more, sometimes a few less. The cause: There are three places that may contribute to the problem. 1) If it becomes a UM JOIN, then ExeMgr is accepting raw rows from PrimProc and doing the join. Even with disk join, it tries to do a block at a time, so the memory needs are huge and cause OOM in small to medium size machines. 2) If it does a PM join (the default without xml manipulation), since there are no functions involved, it attempts to do the join one block at time (multiple threads) and generate a join list. If there are functions involved in the results, then it materializes the rows and performs the functions. Either way, it builds a single TCP message to send to ExeMgr per block. ExeMgr wants buffer space for at least three messages, so it builds a huge (in bytes) queue to hold the data until it can be processed. Once it reaches three buffers full and still can't keep up, it turns on flow control to tell PrimProc to stop sending so fast. But PrimProc doesn't slow processing, so it starts building the output queue waiting for ExeMgr to tell it to send more data. This queue can increase indefinitely, and eventually will cause OOM even with large memory machines. The reason ExeMgr can't keep up is because it has a blocking queue sending data to mysqld. So it slows processing to match with mysqld. For some reason mysqld is a bottleneck. More on that later. Solution: I added blocking code to bppsendthread that will slowdown PrimProc to prevent buffer runaway. I've attempted to prevent overuse of locks and signaling by writing in such a way that the mechanism has no impact until the buffer reaches a preset size. I re-engineered the join code in batchprimitiveprocessor to loop when the total join size gets large and to send results smaller than 8192 small table rows. I also release the join memory as it's being streamed to bytestream for sending. It had been streaming, then releasing. If join memory was GBs, then streaming needed GBs more memory, so the total is twice as much as really needed. All of these changes should not affect performance for any queries. After all this, running the 60000 row triple join test completes without using very much memory at all. It completes in around 3 days. I also tested with functions in the result set because that is a different code path. mysqld | ||||||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2021-10-21 ] | ||||||||||||||||||||||||||||||
|
I ran two tests. I use mysql -q to prevent the client from attempting to buffer the output. Otherwise mysql client consumes memory to OOM. First test with the query in the example against 60,000 row tables:
I started this test Saturday morning. It was finished by Thursday morning == around 5 days. Note the elapsed time reported is incorrect. There may be rollover or something. The second test is this query:
I use this test to exercise the code path taken for when there are functions in the result set. It ran for 3 days. I'm not sure why the discrepancy in elapsed time, but given that PrimProc and ExeMgr were not working hard in either test, I expect the difference may be explained by the larger returned row size in the first test causing x window to work harder.. During the test, I ran test001 in parallel. It ran just fine. | ||||||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2021-10-25 ] | ||||||||||||||||||||||||||||||
|
For QA: Here's the DDL|DML for creating the tables using floats rather than decimal. You could reduce the number of rows by modifying seq_1_to_60000 in the code to something smaller. Not too small or it won't prove the point.
| ||||||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2021-10-26 ] | ||||||||||||||||||||||||||||||
|
The queries I tried are
The elapsed times reported are meaningless when the query runs for more than 24 hours. There is rollover in the time field, I suppose. QA: You must start mysql with the -q option or mysql will attempt to buffer until OOM. So Select will start spewing results shortly after the query is submitted. However, if an aggregation is included, such as the SUM() above, it accumulates before displaying, so nothing will show until it's done. | ||||||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2021-10-26 ] | ||||||||||||||||||||||||||||||
|
Somewhat strange: When running against a table with numerical fields defined as decimal(12,2), the query returns 13471208170 rows, but when I recreate the table using floats, we get 13512057376 rows. I can't explain the discrepancy. | ||||||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2021-10-26 ] | ||||||||||||||||||||||||||||||
|
Interesting: Update: INSERT...SELECT quit after running for 5 days. Out of disk space — ~32Gib. It rolled back without issue. | ||||||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2022-01-21 ] | ||||||||||||||||||||||||||||||
|
Didn't mean to close. Jira misunderstanding | ||||||||||||||||||||||||||||||
| Comment by Roman [ 2022-01-24 ] | ||||||||||||||||||||||||||||||
|
David.Hall I personally think that out of band accounting wouldn't solve the issue and the only way is to customize std::allocator(or whatever type used for memory-intensive structs) to account memory consumption using atomic counter. To reduce contention we should probably introduce a hierarchy so that every JobList accounts on its own syncing periodically with a global counter in RM and start syncing with a global counter every time crossing a certain avaiable memory limit. What do you think? | ||||||||||||||||||||||||||||||
| Comment by Roman [ 2022-02-09 ] | ||||||||||||||||||||||||||||||
|
David.Hall Could you elaborate on this issue for QA? | ||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-03-25 ] | ||||||||||||||||||||||||||||||
|
Build verified: 6.3.1-1 (#4101) Rocky8, 16GB memory Using test case in the ticket, with 600k rows instead 60k rows, I was able to reproduce the reported issue in 6.2.2-1. During question execution, process memory utilization remained about the same.
|