[MCOL-4019] controllernode hangs on SIGTERM Created: 2020-05-26  Updated: 2020-11-12  Resolved: 2020-06-12

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.5.3
Fix Version/s: 1.5.1

Type: Bug Priority: Critical
Reporter: Roman Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
is part of MCOL-3836 Columnstore OAM replacement Closed
Sprint: 2020-7

 Description   

As of pre-release 1.5 code controllernode must gracefully finishes all workernode connections and returns however it hangs infinitely and can be only killed with -SIGKILL. Here is the state it hangs in.

#0  0x00007f56cefa8a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055602b2e3411 in boost::condition_variable::wait (this=0x55602d206cc8, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73
#2  0x00007f56c9f83784 in boost::thread::join_noexcept() () from /lib64/libboost_thread-mt.so.1.53.0
#3  0x00007f56ce37b63f in boost::thread::join (this=0x55602d1a37c0) at /usr/include/boost/thread/detail/thread.hpp:751
#4  0x00007f56cb7a1612 in threadpool::ThreadPool::stop (this=0x7f56ce7c22c0 <joblist::JobStep::jobstepThreadPool>) at /data/mdb-server/storage/columnstore/utils/threadpool/threadpool.cpp:137
#5  0x00007f56cb7a0ead in threadpool::ThreadPool::~ThreadPool (this=0x7f56ce7c22c0 <joblist::JobStep::jobstepThreadPool>, __in_chrg=<optimized out>) at /data/mdb-server/storage/columnstore/utils/threadpool/threadpool.cpp:60
#6  0x00007f56c811d05a in __cxa_finalize () from /lib64/libc.so.6
#7  0x00007f56ce1bc7c3 in __do_global_dtors_aux () from /lib64/libjoblist.so
#8  0x00007ffe9b51ace0 in ?? ()
#9  0x00007f56cf1c907a in _dl_fini () from /lib64/ld-linux-x86-64.so.2

JFYI threadpool::ThreadPool::stop() waits on fPruneThread->join().



 Comments   
Comment by Patrick LeBlanc (Inactive) [ 2020-06-02 ]

My suspicion is that the threads it's trying to join are blocked on recv when the term signal comes in. Possible sol'n is to reduce the timeout on the recv() call, so it can poll it's status vars more often, and know it should close & exit. Once every couple of secs wouldn't add any measurable overhead.

Comment by Roman [ 2020-06-04 ]

Not at all. I've looked into the problem in workernod and controllernode. There are no other threads other then main so nobody is blocked. It looks like a missed thread saved.

Comment by Roman [ 2020-06-05 ]

The problem caused by the fact we link everything against almost everything so here we go. Joblist library has a static ThreadPool member that is loaded on startup and got desctructed on shutdown

#0  threadpool::ThreadPool::ThreadPool (this=0x7ffff73e42c0 <joblist::JobStep::jobstepThreadPool>, maxThreads=100, queueSize=0) at /data/mdb-server/storage/columnstore/columnstore/utils/threadpool/threadpool.cpp:48
#1  0x00007ffff6f4cbdd in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at /data/mdb-server/storage/columnstore/columnstore/dbcon/joblist/jobstep.cpp:60
#2  0x00007ffff6f4d8f7 in _GLOBAL__sub_I_jobstep.cpp(void) () at /data/mdb-server/storage/columnstore/columnstore/dbcon/joblist/jobstep.cpp:212
#3  0x00007ffff7dea9b3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7ddc17a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000002 in ?? ()
#6  0x00007fffffffe73b in ?? ()
#7  0x00007fffffffe74f in ?? ()
#8  0x0000000000000000 in ?? ()

Comment by Roman [ 2020-06-09 ]

The problem happens if we use joblist with a daemon that forks in the very beginning like workernode/controllernode do by default. The joblist namespace contains a static ThreadPool variable so dynamic loader initiates it before it forks. Then main process exits and fork knows nothing about the thread that was created previously. When later the binary recieves SIGTERM and exits dynamic loader tries to join the thread allocated in a separate process and hangs untill it is killed.

Comment by Roman [ 2020-06-09 ]

Plz review.

Comment by Patrick LeBlanc (Inactive) [ 2020-06-10 ]

Good find!

Comment by Roman [ 2020-06-11 ]

4QA: to test this one needs to:

  • Run loadbrm manually
  • Run 'workernode DBRM_Workernode1 fg'
  • Run kill -15 $(pidof workernode)

At this point workernode must have been terminated.
At the same time if one runs 'workernode DBRM_Workernode1' then kill -15 doesn't terminate the process.

Comment by Daniel Lee (Inactive) [ 2020-06-11 ]

Build tested: 1.5.0-1 (drone 20200611 b66)

Tested the scenario above. It worked as described. When running the last kill command again, the worknode process did get terminated. Is this expected?

Comment by Roman [ 2020-06-12 ]

This info is much appreciated but it is outside the scope of this issue
IMHO.
Regards,
Roman Nozdrin
ColumnStore Engineering
MariaDB Corporation

Generated at Thu Feb 08 02:47:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.