[MCOL-3314] Exemgr crash on query happening when we increase 2 variables, MaxOutStandingRequests and RequestSize Created: 2019-05-15 Updated: 2020-08-25 Resolved: 2019-07-11 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ExeMgr |
| Affects Version/s: | 1.2.4 |
| Fix Version/s: | 1.1.0, 1.2.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | David Hill (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
1um 2pm 1.2 development branch |
||
| Sprint: | 2019-05, 2019-06 |
| Description |
|
Customer reported: We are working on the latest 1.2-develop compiled version of columnstore. We have tested this issue with BOTH the gcc and Intel compiled versions. It appears there is a failure in ExeMgr and this causes the queries to fail, and in turn restarting ExeMgr as it should. This produces a corefile (21 G) and I will attach that output as well as the support report: #0 0x000055e1a9f7cd35 in construct (this=0x7f955fbf5db0, __val=@0x7f986f800000: <error reading variable>, __p=0x7f958a037ef8) at /usr/include/c++/4.8.2/ext/new_allocator.h:130 This appears to be happening when we increase 2 variables, MaxOutStandingRequests and RequestSize. This works with no errors: MaxOutStandingRequests = 60 RequestSize=2. However this causes the query failures and ExeMgr restart: MaxOutStandingRequests=120 RequestSize=2. |
| Comments |
| Comment by David Hill (Inactive) [ 2019-05-16 ] |
|
from customer it seems like this is a new bug, After a few hours and just now restarting from scratch it is still holding. So either it is harder to hit that bug or it is a new one. As a note, there seems to be a clear difference in resources usage from the 2 versions, and there are slow down of the exemgr in 1.2.3 but it does not crash. |
| Comment by David Hall (Inactive) [ 2019-05-17 ] |
|
I don't believe this has anything to do with MaxOutStandingRequests or RequestSize. RequestSize is deprecated and does nothing. MaxOutStandingRequests controls how fast the PM can pump data to the UM. This throttle is so the PM can't overwhelm the UM. The crash here is in the setup code for the variance() function and is nowhere near MaxOutStandingRequests and it's uses. It appears there's an access to a vector past the end of the vector. In many cases, this just causes garbage to be used, but will sometimes show up as a memory access error as in this example. When garbage is used in this situation, there is no harm since this value is just a place holder. That's why it doesn't show up in the result set. |
| Comment by David Hall (Inactive) [ 2019-05-17 ] |
|
For QA: Not sure how to reproduce this. A very rare crash. It depends on how the OS sets up the stack and the Heap, as well as the STL and pre-allocations of vectors. Otherwise, there are no behavioral differences with this PR. |
| Comment by David Hall (Inactive) [ 2019-05-23 ] |
|
I think this is the only place needing changes. Anyway, this is where it broke at customer. I looked also at prep1PhaseDistinctAggregate, and because it goes thru an extra step to get there, it happens to be correct. |
| Comment by Daniel Lee (Inactive) [ 2019-06-18 ] |
|
Has the latest code been provided to customer? Support needs to follow up with customer to see if the solution works. Thanks. |
| Comment by Daniel Lee (Inactive) [ 2019-07-11 ] |
|
Builds verified: 1.1.8-1 nightly, 1.2.5-1 RC (1st one) Verified by regression 1.1.8-1 1.2.5-1 |