[MCOL-257] Under very heavy load, PrimProc spuriously crashes Created: 2016-07-25 Updated: 2016-09-09 Resolved: 2016-08-23 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ExeMgr |
| Affects Version/s: | 1.0.1 |
| Fix Version/s: | 1.0.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | David Hall (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Centos 6.5 on VS with 6 CPUs and 16GB mem assigned |
||
| Sprint: | 1.0.2-2 |
| Description |
|
I was running tests for concurrency, which stresses the system and suddenly PrimProc crashed. I tried again with PrimProc running in GDB. This pointed to a bad pointer. It appears that jobs are getting backed up in the Priority ThreadPool (they get queued if no threads available), and while they're waiting, objects upon which these jobs rely are going out of scope and being destroyed. I believe this can be fixed with a judicious application of shared_ptr. |
| Comments |
| Comment by David Hall (Inactive) [ 2016-07-28 ] |
|
A simple replace of fBPPHandler from a simple member to a boost:shared_ptr that is instantiated in the constructor. All the . are replaced with -> and the & used to get a pointer to pass around is removed. |
| Comment by David Hall (Inactive) [ 2016-07-28 ] |
|
The offending object is fBPPHandler, which is a simple member of ReadThread. This is passed around as a pointer via the address operator (&). It's possible for ReadThread to go out of scope before all the threads to which the pointer was sent are complete, thus deleting the BPPHandler and leaving those threads with an invalid pointer. |
| Comment by Ben Thompson (Inactive) [ 2016-08-01 ] |
|
Review Completed |
| Comment by David Hall (Inactive) [ 2016-08-02 ] |
|
It's very difficult to test for a spurious crash – or lack thereof. There was never a sure way to reproduce this issue. It happened rarely. I looked at the code a while and saw a weakness and fixed it. I haven't seen the crash since. I can't think of a way to prove it's fixed. |
| Comment by Daniel Lee (Inactive) [ 2016-08-23 ] |
|
As there is not a specific test scenario to be tested, The ticket is verified by the means of regression and Autopilot tests. |