[MCOL-3314] Exemgr crash on query happening when we increase 2 variables, MaxOutStandingRequests and RequestSize Created: 2019-05-15  Updated: 2020-08-25  Resolved: 2019-07-11

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.2.4
Fix Version/s: 1.1.0, 1.2.5

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

1um 2pm 1.2 development branch


Sprint: 2019-05, 2019-06

 Description   

Customer reported:

We are working on the latest 1.2-develop compiled version of columnstore. We have tested this issue with BOTH the gcc and Intel compiled versions. It appears there is a failure in ExeMgr and this causes the queries to fail, and in turn restarting ExeMgr as it should. This produces a corefile (21 G) and I will attach that output as well as the support report:

#0 0x000055e1a9f7cd35 in construct (this=0x7f955fbf5db0, __val=@0x7f986f800000: <error reading variable>, __p=0x7f958a037ef8) at /usr/include/c++/4.8.2/ext/new_allocator.h:130
#1 construct<unsigned int> (__a=..., __arg=@0x7f986f800000: <error reading variable>, __p=0x7f958a037ef8) at /usr/include/c++/4.8.2/ext/alloc_traits.h:216
#2 std::vector<unsigned int, std::allocator<unsigned int> >::_M_insert_aux (this=0x7f955fbf5db0, __position=<error reading variable: Cannot access memory at address 0x7f986f800000>, __x=<optimized out>) at /usr/include/c++/4.8.2/bits/vector.tcc:353

This appears to be happening when we increase 2 variables, MaxOutStandingRequests and RequestSize. This works with no errors: MaxOutStandingRequests = 60 RequestSize=2. However this causes the query failures and ExeMgr restart: MaxOutStandingRequests=120 RequestSize=2.



 Comments   
Comment by David Hill (Inactive) [ 2019-05-16 ]

from customer

it seems like this is a new bug, After a few hours and just now restarting from scratch it is still holding. So either it is harder to hit that bug or it is a new one.

As a note, there seems to be a clear difference in resources usage from the 2 versions, and there are slow down of the exemgr in 1.2.3 but it does not crash.

Comment by David Hall (Inactive) [ 2019-05-17 ]

I don't believe this has anything to do with MaxOutStandingRequests or RequestSize. RequestSize is deprecated and does nothing. MaxOutStandingRequests controls how fast the PM can pump data to the UM. This throttle is so the PM can't overwhelm the UM. The crash here is in the setup code for the variance() function and is nowhere near MaxOutStandingRequests and it's uses.

It appears there's an access to a vector past the end of the vector. In many cases, this just causes garbage to be used, but will sometimes show up as a memory access error as in this example.

When garbage is used in this situation, there is no harm since this value is just a place holder. That's why it doesn't show up in the result set.

Comment by David Hall (Inactive) [ 2019-05-17 ]

For QA: Not sure how to reproduce this. A very rare crash. It depends on how the OS sets up the stack and the Heap, as well as the STL and pre-allocations of vectors. Otherwise, there are no behavioral differences with this PR.

Comment by David Hall (Inactive) [ 2019-05-23 ]

I think this is the only place needing changes. Anyway, this is where it broke at customer. I looked also at prep1PhaseDistinctAggregate, and because it goes thru an extra step to get there, it happens to be correct.

Comment by Daniel Lee (Inactive) [ 2019-06-18 ]

Has the latest code been provided to customer?

Support needs to follow up with customer to see if the solution works.

Thanks.

Comment by Daniel Lee (Inactive) [ 2019-07-11 ]

Builds verified: 1.1.8-1 nightly, 1.2.5-1 RC (1st one)

Verified by regression

1.1.8-1
server commit:
09faec8
engine commit:
cbaba7f

1.2.5-1
server commit:
f44f7d9
engine commit:
4e477ab

Generated at Thu Feb 08 02:41:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.