[MXS-2547] Stop MaxScale during Rest-API query cause process hung Created: 2019-06-06  Updated: 2020-01-08  Resolved: 2019-06-19

Status: Closed
Project: MariaDB MaxScale
Component/s: REST-API
Affects Version/s: 2.2.21, 2.3.7
Fix Version/s: 2.2.22, 2.3.9

Type: Bug Priority: Major
Reporter: lishubing Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None


 Description   

Stopping MaxScale will stop all the running workers(including worker#0), and the exit of worker#0 interrupt the processing of microhttpd API query.

In the resource_handle_request function, microhttpd thread will post a task to worker#0 and wait for the semaphore. But when worker#0 shutdowns without finish processing that semaphore, microhttpd thread will be blocked.

And then the stopping of MaxScale continues, calls MHD_stop_daemon afterwhile, that will call thread_join on microhttpd threads, then dramatically hung by the blocked thread.

You may reproduce the bug in this way:

  • First, call any Rest-API in an infinite loop (using Python):

while True: 
    requests.get('http://127.0.0.1:30000/v1/services', auth=('admin', 'mariadb'))

  • Second, using maxadmin interface to call 'shutdown maxscale'.

In most cases, the MaxScale process will be hung and stop responding to any request. (It's really easy to reproduce for me)

And here is a sample stack info of a blocked micorhttpd thread:

do_futex_wait.constprop 0x00007ffff7bccafb
__new_sem_wait_slow.constprop.0 0x00007ffff7bccb8f
sem_wait@@GLIBC_2.2.5 0x00007ffff7bccc2b
maxbase::Semaphore::wait semaphore.hh:115
maxbase::Worker::call(std::function<void ()>, maxbase::Worker::execute_mode_t) worker.cc:516
resource_handle_request resource.cc:1337
Client::process admin.cc:126
handle_client admin.cc:266
call_connection_handler connection.c:1834
MHD_connection_handle_idle connection.c:2909
call_handlers daemon.c:1154
MHD_epoll daemon.c:4386
MHD_select_thread daemon.c:4544
start_thread 0x00007ffff7bc6e25
clone 0x00007ffff1e09bad



 Comments   
Comment by lishubing [ 2019-06-06 ]

My workaround is: before stop MaxScale workers, call MHD_quiesce_daemon to stop API server from listening, then sleep(1) to wait for worker finish their task, then the process goes on and terminate Maxscale successfully.

Comment by markus makela [ 2019-06-10 ]

I think the REST API must be stopped before the workers are stopped to prevent this from happening.

Comment by lishubing [ 2019-06-11 ]

In my case, an external monitor service keeps catching MaxScale information by calling the REST API (with 3s interval). Apparently, the external service is not aware of when to stop calling the REST API, so it's a common case of calling REST API during shutdown MaxScale.

Back to your question, "REST API must be stopped before the workers are stopped", however, the shutdown procedure is simply exit workers. It means that when a shutdown is triggered, the worker is already stopped, so the REST API just cannot be stopped before the shutdown.

Comment by markus makela [ 2019-06-18 ]

Managed to partially reproduce this by adding a debug assertion that catches if a message is posted to a worker that has already stopped.

Comment by markus makela [ 2019-06-18 ]

Stacktrace:

Thread 3 (Thread 0x7f8fe67a0700 (LWP 19294)):
#0  0x00007f8fee85ce96 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f8fee85cf98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f8feeaa9d7b in maxscale::Semaphore::wait (this=0x7f8fe679e0e0, signal_approach=maxscale::Semaphore::IGNORE_SIGNALS) at /home/markusjm/MaxScale/include/maxscale/semaphore.hh:114
#3  0x00007f8feeaa8d99 in resource_handle_request (request=...) at /home/markusjm/MaxScale/server/core/resource.cc:1181
#4  0x00007f8fee9debb7 in Client::process (this=0x604000024e90, url="/v1/services", method="GET", upload_data=0x0, upload_size=0x7f8fe679eae8) at /home/markusjm/MaxScale/server/core/admin.cc:130
#5  0x00007f8fee9e0013 in handle_client (cls=0x0, connection=0x614000076040, url=0x62d00005a404 "/v1/services", method=0x62d00005a400 "GET", version=0x62d00005a411 "HTTP/1.1", upload_data=0x0, upload_data_size=0x7f8fe679eae8, con_cls=0x614000076098) at /home/markusjm/MaxScale/server/core/admin.cc:263
#6  0x00007f8feeb9e23f in call_connection_handler (connection=connection@entry=0x614000076040) at connection.c:1833
#7  0x00007f8feeb9fce8 in MHD_connection_handle_idle (connection=0x614000076040) at connection.c:2909
#8  0x00007f8feeba1995 in call_handlers (con=0x614000076040, read_ready=<optimized out>, write_ready=<optimized out>, force_close=<optimized out>) at daemon.c:1154
#9  0x00007f8feeba6280 in MHD_epoll (daemon=daemon@entry=0x61600002ca80, may_block=may_block@entry=1) at daemon.c:4386
#10 0x00007f8feeba75bf in MHD_select_thread (cls=0x61600002ca80) at daemon.c:4544
#11 0x00007f8fee8545a2 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f8fedced303 in clone () from /lib64/libc.so.6
 
Thread 2 (Thread 0x7f8fea1ff700 (LWP 19287)):
#0  0x00007f8fee85a4d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f8feeae0042 in skygw_message_wait (mes=0x60b0000005c0) at /home/markusjm/MaxScale/server/core/skygw_utils.cc:640
#2  0x00007f8feea5f94a in thr_filewriter_fun (data=0x607000000410) at /home/markusjm/MaxScale/server/core/log_manager.cc:2349
#3  0x00007f8fee8545a2 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f8fedced303 in clone () from /lib64/libc.so.6
 
Thread 1 (Thread 0x7f8fed7e0e00 (LWP 19286)):
#0  0x00007f8fee855ad8 in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007f8feeba7cdf in MHD_stop_daemon (daemon=0x61600002ca80) at daemon.c:6366
#2  0x00007f8fee9e10c7 in mxs_admin_shutdown () at /home/markusjm/MaxScale/server/core/admin.cc:412
#3  0x000000000040e17a in main (argc=4, argv=0x7ffcf4f137f8) at /home/markusjm/MaxScale/server/core/gateway.cc:2297

Generated at Thu Feb 08 04:14:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.