[MCOL-2009] Fix jobstep abort Created: 2018-12-10  Updated: 2020-08-25  Resolved: 2019-01-24

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: None
Fix Version/s: 1.1.7, 1.2.3

Type: Bug Priority: Major
Reporter: Patrick LeBlanc (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 3
Labels: None

Issue Links:
Relates
relates to MCOL-1702 Joblist thread pool leaks if mariadb ... Closed
relates to MCOL-2104 Killed query locks ExeMgr and PrimPro... Closed
Sprint: 2018-21, 2019-01

 Description   

Summary of the backtraces in the associated support ticket:

  • mysqld is waiting for results from a syscat query
  • exemgr has ~500 jobstep threads in the process of aborting, but blocked trying to send data to the next jobstep
  • primproc is idle

From the email I sent to the team.
"There are several joblists running, and all are in the process of aborting. TupleBPS threads are blocked trying to send data downstream. On abort, a joblist needs to be aborted ‘down-up’. On noticing the query was aborted, all jobsteps need to stop sending data downstream, then consume all of their remaining input to be sure that upstream jobsteps get unblocked, so that they can abort next. My suspicion is that there is a jobstep that isn’t implementing that completely right. From the backtraces I can’t tell which jobstep it is though, because it has already gone away (without draining its input)."

It should be easy to find now that we know what to look for. Start by looking for references to the cancelled() fcn in each jobstep to find the abort logic. Odds are one of them is not draining its input before returning.



 Comments   
Comment by Ben Thompson (Inactive) [ 2019-01-16 ]

MCOL-1702 is an easy example of a method to reproduce thread blocking

Comment by Ben Thompson (Inactive) [ 2019-01-17 ]

QA:

  • Joblist thread pool debugging must be enabled in Columnstore.xml
    <JobList>
    <ThreadPoolDebug>Y</ThreadPoolDebug>
  • run a query such as
    select * from tpch1.lineitem;
  • use ctrl+c to kill the query before results are returned
  • repeat previous 2 steps and monitor the number of Active threads in
    /var/log/mariadb/columnstore/trace/ThreadPool_ExeMgrJobList.log

    10:37:56.6967 Name ExeMgrJobList Active 1 ThdCnt 10 Max 100 Q 0
    

Pre this fix the number of Active threads increased over time.

Comment by Daniel Lee (Inactive) [ 2019-01-21 ]

Build verified:

1.1.7-1
server commit:
b5a7a22
engine commit:
d87b9a6

1.2.3-1
server commit:
61f32f2
engine commit:
83b2d4c

The issue has been fixed in 1.1.7-1, but still exist in 1.2.3-1

Comment by Daniel Lee (Inactive) [ 2019-01-21 ]

The issue still exist in 1.2.3-1

Comment by Ben Thompson (Inactive) [ 2019-01-23 ]

Merged into 1.2

Comment by Daniel Lee (Inactive) [ 2019-01-24 ]

Build verified: 1.2.3-1

server commit:
61f32f2
engine commit:
ee2cb7b

Generated at Thu Feb 08 02:33:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.