[MCOL-257] Under very heavy load, PrimProc spuriously crashes Created: 2016-07-25  Updated: 2016-09-09  Resolved: 2016-08-23

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.0.1
Fix Version/s: 1.0.3

Type: Bug Priority: Critical
Reporter: David Hall (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Centos 6.5 on VS with 6 CPUs and 16GB mem assigned


Sprint: 1.0.2-2

 Description   

I was running tests for concurrency, which stresses the system and suddenly PrimProc crashed. I tried again with PrimProc running in GDB. This pointed to a bad pointer. It appears that jobs are getting backed up in the Priority ThreadPool (they get queued if no threads available), and while they're waiting, objects upon which these jobs rely are going out of scope and being destroyed. I believe this can be fixed with a judicious application of shared_ptr.



 Comments   
Comment by David Hall (Inactive) [ 2016-07-28 ]

A simple replace of fBPPHandler from a simple member to a boost:shared_ptr that is instantiated in the constructor. All the . are replaced with -> and the & used to get a pointer to pass around is removed.

Comment by David Hall (Inactive) [ 2016-07-28 ]

The offending object is fBPPHandler, which is a simple member of ReadThread. This is passed around as a pointer via the address operator (&). It's possible for ReadThread to go out of scope before all the threads to which the pointer was sent are complete, thus deleting the BPPHandler and leaving those threads with an invalid pointer.

Comment by Ben Thompson (Inactive) [ 2016-08-01 ]

Review Completed

Comment by David Hall (Inactive) [ 2016-08-02 ]

It's very difficult to test for a spurious crash – or lack thereof. There was never a sure way to reproduce this issue. It happened rarely. I looked at the code a while and saw a weakness and fixed it. I haven't seen the crash since. I can't think of a way to prove it's fixed.

Comment by Daniel Lee (Inactive) [ 2016-08-23 ]

As there is not a specific test scenario to be tested, The ticket is verified by the means of regression and Autopilot tests.

Generated at Thu Feb 08 02:19:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.