[MXS-2584] Race condition between startup/shutdown and signal delivery Created: 2019-07-01  Updated: 2020-07-03  Resolved: 2020-07-03

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: None
Fix Version/s: 2.3.21, 2.4.11

Type: Bug Priority: Minor
Reporter: markus makela Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
causes MXS-2818 malloc deadlock in sigfatal_handler Closed
Relates
relates to MXS-599 Give signal handling an overhaul. Closed
Epic Link: MaxScale Core
Sprint: MXS-SPRINT-87

 Description   

Ran into a repeating crash when running the mxs621_unreadable_cnf test. Upon further inspection, there is a race condition in the shutdown code where the workers have already stopped when the signal is delivered.

#0  0x00007fed7bcd5207 in raise () from /lib64/libc.so.6
#1  0x00007fed7bcd68f8 in abort () from /lib64/libc.so.6
#2  0x00007fed7bcce026 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007fed7bcce0d2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fed7e6a58f4 in mxb_log_message (priority=1, modname=0x0, file=0x417b28 "/home/vagrant/MaxScale/server/core/gateway.cc", line=422, 
    function=0x41a3f0 <sigfatal_handler(int)::__func__> "sigfatal_handler", format=0x417cb0 "Fatal: MaxScale 2.3.9 received fatal signal %d. Attempting backtrace.")
    at /home/vagrant/MaxScale/maxutils/maxbase/src/log.cc:710
#5  0x000000000040c8cc in sigfatal_handler (i=6) at /home/vagrant/MaxScale/server/core/gateway.cc:422
#6  <signal handler called>
#7  0x00007fed7bcd5207 in raise () from /lib64/libc.so.6
#8  0x00007fed7bcd68f8 in abort () from /lib64/libc.so.6
#9  0x00007fed7bcce026 in __assert_fail_base () from /lib64/libc.so.6
#10 0x00007fed7bcce0d2 in __assert_fail () from /lib64/libc.so.6
#11 0x00007fed7e6a58f4 in mxb_log_message (priority=1, modname=0x0, file=0x417b28 "/home/vagrant/MaxScale/server/core/gateway.cc", line=422, 
    function=0x41a3f0 <sigfatal_handler(int)::__func__> "sigfatal_handler", format=0x417cb0 "Fatal: MaxScale 2.3.9 received fatal signal %d. Attempting backtrace.")
    at /home/vagrant/MaxScale/maxutils/maxbase/src/log.cc:710
#12 0x000000000040c8cc in sigfatal_handler (i=11) at /home/vagrant/MaxScale/server/core/gateway.cc:422
#13 <signal handler called>
#14 0x00007fed7e669a32 in maxscale::RoutingWorker::get (worker_id=-1) at /home/vagrant/MaxScale/server/core/routingworker.cc:456
#15 0x00007fed7e62f239 in maxscale_shutdown () at /home/vagrant/MaxScale/server/core/misc.cc:54
#16 0x000000000040c671 in sigterm_handler (i=15) at /home/vagrant/MaxScale/server/core/gateway.cc:350
#17 <signal handler called>
#18 0x00007fed78badc90 in cleanup_fscreatecon () from /lib64/libkrb5support.so.0
#19 0x00007fed7ea45fba in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#20 0x00007fed7bcd8b69 in __run_exit_handlers () from /lib64/libc.so.6
#21 0x00007fed7bcd8bb7 in exit () from /lib64/libc.so.6
#22 0x00007fed7bcc13dc in __libc_start_main () from /lib64/libc.so.6
#23 0x000000000040c319 in _start ()



 Comments   
Comment by markus makela [ 2019-07-01 ]

Looks like disabling SIGTERM and SIGINT before the RoutingWorker::finish() call solves it. This would prevent signals from being delivered after the workers have been deleted. There's still a small window during startup that can cause messages to be posted to workers that haven't been initialized which is use of uninitialized memory. Checking whether the routing workers have been initialized (i.e. this_unit.initialized in routingworker.cc) would solve both cases but there would still be a theoretical race condition. There's also the problem that any signals received during times that the workers aren't yet initialized would be ignored by this solution.

To do the shutdown signal handling properly, the workers themselves would have to perform the shutdown upon noticing that maxscale_is_shutting_down() returns true. This would be rather simple to do by adding a delayed_call for a function that does the shutdown if a termination signal has been received.

Another option would be to have a separate thread that waits on a semaphore that is posted inside the shutdown function called by the signal handler. The waiting thread would then post a message to the main routing worker to do the actual shutdown work. This would be faster than polling the variable with delayed_call.

Comment by markus makela [ 2019-07-01 ]

Lowered priority as this is quite hard to reproduce and only appears to happen when a startup fails due to some system error.

Comment by markus makela [ 2020-07-03 ]

Fixed by commit c9badcb09c0901bd5075f563ada1a744d6c0745b.

Generated at Thu Feb 08 04:15:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.