Summary of the backtraces in the associated support ticket:
- mysqld is waiting for results from a syscat query
- exemgr has ~500 jobstep threads in the process of aborting, but blocked trying to send data to the next jobstep
- primproc is idle
From the email I sent to the team.
"There are several joblists running, and all are in the process of aborting. TupleBPS threads are blocked trying to send data downstream. On abort, a joblist needs to be aborted ‘down-up’. On noticing the query was aborted, all jobsteps need to stop sending data downstream, then consume all of their remaining input to be sure that upstream jobsteps get unblocked, so that they can abort next. My suspicion is that there is a jobstep that isn’t implementing that completely right. From the backtraces I can’t tell which jobstep it is though, because it has already gone away (without draining its input)."
It should be easy to find now that we know what to look for. Start by looking for references to the cancelled() fcn in each jobstep to find the abort logic. Odds are one of them is not draining its input before returning.