[MCOL-4019] controllernode hangs on SIGTERM Created: 2020-05-26 Updated: 2020-11-12 Resolved: 2020-06-12 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 1.5.3 |
| Fix Version/s: | 1.5.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Roman | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Sprint: | 2020-7 | ||||||||
| Description |
|
As of pre-release 1.5 code controllernode must gracefully finishes all workernode connections and returns however it hangs infinitely and can be only killed with -SIGKILL. Here is the state it hangs in.
JFYI threadpool::ThreadPool::stop() waits on fPruneThread->join(). |
| Comments |
| Comment by Patrick LeBlanc (Inactive) [ 2020-06-02 ] | |||||||||
|
My suspicion is that the threads it's trying to join are blocked on recv when the term signal comes in. Possible sol'n is to reduce the timeout on the recv() call, so it can poll it's status vars more often, and know it should close & exit. Once every couple of secs wouldn't add any measurable overhead. | |||||||||
| Comment by Roman [ 2020-06-04 ] | |||||||||
|
Not at all. I've looked into the problem in workernod and controllernode. There are no other threads other then main so nobody is blocked. It looks like a missed thread saved. | |||||||||
| Comment by Roman [ 2020-06-05 ] | |||||||||
|
The problem caused by the fact we link everything against almost everything so here we go. Joblist library has a static ThreadPool member that is loaded on startup and got desctructed on shutdown
| |||||||||
| Comment by Roman [ 2020-06-09 ] | |||||||||
|
The problem happens if we use joblist with a daemon that forks in the very beginning like workernode/controllernode do by default. The joblist namespace contains a static ThreadPool variable so dynamic loader initiates it before it forks. Then main process exits and fork knows nothing about the thread that was created previously. When later the binary recieves SIGTERM and exits dynamic loader tries to join the thread allocated in a separate process and hangs untill it is killed. | |||||||||
| Comment by Roman [ 2020-06-09 ] | |||||||||
|
Plz review. | |||||||||
| Comment by Patrick LeBlanc (Inactive) [ 2020-06-10 ] | |||||||||
|
Good find! | |||||||||
| Comment by Roman [ 2020-06-11 ] | |||||||||
|
4QA: to test this one needs to:
At this point workernode must have been terminated. | |||||||||
| Comment by Daniel Lee (Inactive) [ 2020-06-11 ] | |||||||||
|
Build tested: 1.5.0-1 (drone 20200611 b66) Tested the scenario above. It worked as described. When running the last kill command again, the worknode process did get terminated. Is this expected? | |||||||||
| Comment by Roman [ 2020-06-12 ] | |||||||||
|
This info is much appreciated but it is outside the scope of this issue |