[MXS-3674] Deadlock in binlogrouter Created: 2021-07-19  Updated: 2021-08-13  Resolved: 2021-08-06

Status: Closed
Project: MariaDB MaxScale
Component/s: binlogrouter
Affects Version/s: 2.5.13
Fix Version/s: 2.5.15

Type: Bug Priority: Major
Reporter: markus makela Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None

Sprint: MXS-SPRINT-137

 Description   

It is possible for the binlogrouter to end up in a deadlock during a STOP SLAVE command. The following was encountered during a test run in the pinloki_select_master test:

Writer thread:
 
Thread 14 (Thread 0x7feb73fff700 (LWP 1890)):
#0  0x00007feb8dda1b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007feb8dda1bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007feb8dda1c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007feb8eef0ab1 in maxbase::Semaphore::wait (this=0x7feb73ffba30, signal_approach=maxbase::Semaphore::IGNORE_SIGNALS) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:144
#4  0x00007feb8efaf185 in maxbase::Worker::call(std::function<void ()>, maxbase::Worker::execute_mode_t) (this=0x7fffbfb46320, func=..., mode=maxbase::Worker::EXECUTE_AUTO) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:535
#5  0x00007feb80b67fbe in pinloki::Writer::get_connection_details (this=0x176f080) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/pinloki/writer.cc:77
#6  0x00007feb80b68235 in pinloki::Writer::has_master_changed (this=0x176f080, conn=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/pinloki/writer.cc:109
#7  0x00007feb80b688ba in pinloki::Writer::run (this=0x176f080) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/pinloki/writer.cc:151
...
 
Other thread:
 
Thread 9 (Thread 0x7feb82127700 (LWP 1888)):
#0  0x00007feb8dd9d017 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007feb8d0890f7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007feb80b67e73 in pinloki::Writer::~Writer (this=0x176f080, __in_chrg=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/pinloki/writer.cc:70
#3  0x00007feb80b416f6 in std::default_delete<pinloki::Writer>::operator() (this=0x176f408, __ptr=0x176f080) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/unique_ptr.h:78
#4  0x00007feb80b3f79f in std::unique_ptr<pinloki::Writer, std::default_delete<pinloki::Writer> >::reset (this=0x176f408, __p=0x176f080) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/unique_ptr.h:376
#5  0x00007feb80b37c22 in pinloki::Pinloki::stop_slave (this=0x176f280) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/pinloki/pinloki.cc:558
#6  0x00007feb80b4acb4 in pinloki::PinlokiSession::stop_slave (this=0x7feb78014610) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/pinloki/pinlokisession.cc:392
...



 Comments   
Comment by markus makela [ 2021-07-22 ]

Upon further inspection, the problem is not a direct deadlock between the two but a three-way deadlock between the following threads:

  1. RoutingWorker handling the STOP SLAVE command
  2. Writer waiting for the MainWorker
  3. MainWorker waiting for all RoutingWorkers

This should be reproducible by executing a STOP SLAVE at the same time a MaxCtrl command is executed that interacts with all RoutingWorker threads (e.g. maxctrl show threads).

Comment by markus makela [ 2021-08-03 ]

Captured another stacktrace, this time with all three threads.

MainWorker that waits on all RoutingWorker threads:

Thread 1 (Thread 0x7ff0d0ed3340 (LWP 31780)):
#0  0x00007ff0cf6b3b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007ff0cf6b3bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007ff0cf6b3c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007ff0d0802f71 in maxbase::Semaphore::wait (this=0x7fffd9738660, signal_approach=maxbase::Semaphore::IGNORE_SIGNALS) at /home/vagrant/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:144
#4  0x00007ff0d08498a9 in maxbase::Semaphore::wait_n (this=0x7fffd9738660, n_wait=0, signal_approach=maxbase::Semaphore::IGNORE_SIGNALS) at /home/vagrant/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:176
#5  0x00007ff0d084552d in maxscale::RoutingWorker::execute_concurrently(std::function<void ()>) (func=...) at /home/vagrant/MaxScale/server/core/routingworker.cc:1042
#6  0x00007ff0d086336f in maxscale::WorkerGlobal<std::unordered_map<unsigned int, unsigned long, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > > >::assign (this=0x9f1a28, t=std::unordered_map with 1 elements) at /home/vagrant/MaxScale/include/maxscale/workerlocal.hh:183
#7  0x00007ff0d085d001 in Server::<lambda()>::operator()(void) const (__closure=0x7ff0a00059a0) at /home/vagrant/MaxScale/server/core/server.cc:739
#8  0x00007ff0d085f70c in std::_Function_handler<void(), Server::set_gtid_list(const std::vector<std::pair<unsigned int, long unsigned int> >&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#9  0x00007ff0d084ae70 in std::function<void ()>::operator()() const (this=0x7ff0a0005ef8) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706
#10 0x00007ff0d08c13f0 in maxbase::Worker::CustomTask::execute (this=0x7ff0a0005ef0, worker=...) at /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:504
#11 0x00007ff0d08c1fa4 in maxbase::Worker::handle_message (this=0x7fffd973be20, queue=..., msg=...) at /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:666
#12 0x00007ff0d08c8e69 in maxbase::MessageQueue::handle_poll_events (this=0x9c4af0, pWorker=0x7fffd973be20, events=1) at /home/vagrant/MaxScale/maxutils/maxbase/src/messagequeue.cc:307
#13 0x00007ff0d08c9037 in maxbase::MessageQueue::poll_handler (pData=0x9c4af0, pWorker=0x7fffd973be20, events=1) at /home/vagrant/MaxScale/maxutils/maxbase/src/messagequeue.cc:342
#14 0x00007ff0d08c27a4 in maxbase::Worker::poll_waitevents (this=0x7fffd973be20) at /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:877
#15 0x00007ff0d08c1885 in maxbase::Worker::run (this=0x7fffd973be20, pSem=0x0) at /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:570
#16 0x00000000004172db in maxbase::Worker::run (this=0x7fffd973be20) at /home/vagrant/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:449
#17 0x0000000000413e90 in main (argc=1, argv=0x7fffd973d598) at /home/vagrant/MaxScale/server/core/gateway.cc:2140

RoutingWorker that's executing the STOP SLAVE which waits for the Writer thread:

Thread 12 (Thread 0x7ff0bf7fe700 (LWP 31789)):
#0  0x00007ff0cf6af017 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007ff0ce99b0f7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007ff0be23c903 in pinloki::Writer::~Writer (this=0x7ff0b0014970, __in_chrg=<optimized out>) at /home/vagrant/MaxScale/server/modules/routing/pinloki/writer.cc:70
#3  0x00007ff0be213626 in std::default_delete<pinloki::Writer>::operator() (this=0x9f9478, __ptr=0x7ff0b0014970) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/unique_ptr.h:78
#4  0x00007ff0be2116cf in std::unique_ptr<pinloki::Writer, std::default_delete<pinloki::Writer> >::reset (this=0x9f9478, __p=0x7ff0b0014970) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/unique_ptr.h:376
#5  0x00007ff0be209b52 in pinloki::Pinloki::stop_slave (this=0x9f92f0) at /home/vagrant/MaxScale/server/modules/routing/pinloki/pinloki.cc:558
#6  0x00007ff0be21cc5a in pinloki::PinlokiSession::stop_slave (this=0x7ff0b0014520) at /home/vagrant/MaxScale/server/modules/routing/pinloki/pinlokisession.cc:428
#7  0x00007ff0be241d58 in (anonymous namespace)::ResultVisitor::operator() (this=0x7ff0bf7f77a0, s=@0x7ff0bf7f7978: (anonymous namespace)::STOP) at /home/vagrant/MaxScale/server/modules/routing/pinloki/parser.cc:453

The Writer thread that is waiting for the MainWorker:

Thread 2 (Thread 0x7ff0a5ffb700 (LWP 32333)):
#0  0x00007ff0cf6b3b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007ff0cf6b3bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007ff0cf6b3c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007ff0d0802f71 in maxbase::Semaphore::wait (this=0x7ff0a5ff7b10, signal_approach=maxbase::Semaphore::IGNORE_SIGNALS) at /home/vagrant/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:144
#4  0x00007ff0d08c1667 in maxbase::Worker::call(std::function<void ()>, maxbase::Worker::execute_mode_t) (this=0x7fffd973be20, func=..., mode=maxbase::Worker::EXECUTE_AUTO) at /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:535
#5  0x00007ff0be23ca4e in pinloki::Writer::get_connection_details (this=0x7ff0b0014970) at /home/vagrant/MaxScale/server/modules/routing/pinloki/writer.cc:77
#6  0x00007ff0be23cdf1 in pinloki::Writer::run (this=0x7ff0b0014970) at /home/vagrant/MaxScale/server/modules/routing/pinloki/writer.cc:121

Generated at Thu Feb 08 04:23:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.