[MDEV-5551] thread pool leaves lots of killed connections Created: 2014-01-22  Updated: 2014-05-30  Resolved: 2014-05-30

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.34
Fix Version/s: 5.5.38

Type: Bug Priority: Major
Reporter: Rich Prohaska Assignee: Sergey Vojtovich
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

linux



 Description   

jira would not let me enter 5.5.30, so i used 5.5.34 instead.

we have a customer that has observed problems related to the
interaction of the thread pools in mariadb 5.5.30 and tokudb 7.1.0.
the problem is that connections that stuck for 100's of seconds in the
Killed state when a big tokudb transaction commit is in progress. the
customer claims that this problem was resolved by turning the thread
pool OFF. have there been any fixes to the thread pool implementation
post the 5.5.30 release?

here is some information that may be useful.

the processlist showed 2838 client connections of which:
851 Killed
483 Sleep
1467 Connect
1 show processlist
1 processing a commit on a big delete from a tokudb table
35 blocked on a row lock held by the big delete

the Killed connections look like:
84847752 sfi_mysql 10.0.0.60:1814 sfi Killed 93 NULL 0.000
with the connection address and time slightly different

a gdb snapshot of the system at the time of the failure:
291 total threads
130 threads waiting on fil_aio_wait
52 threads waiting on tokudb's work_on_kibbutz
16 thread tokudb deserialization thread waiting
36 mysql update thread blocked on the tokudb lock tree
1thread executing a big tokudb delete transaction
40 threads in the thread pool get_event function
1 tokudb checkpoint thread blocked waiting for the LP MO lock, which
is held by the big txn commit thread
that leaves 15 misc worker threads that are not doing anything interesting

the gdb stacks for the thread pool threads are:
Thread 78 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 71 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 70 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 69 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 68 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 62 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 57 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 52 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 51 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 49 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 45 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 44 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 43 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 42 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 41 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 39 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 37 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 36 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 35 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 33 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 32 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 31 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 26 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 24 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 22 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 21 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 20 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 19 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 18 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 17 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 16 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 13 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 12 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 11 : epoll_wait , io_poll_wait , listener , get_event ,
worker_main , start_thread , clone
Thread 8 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 7 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 5 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 4 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 3 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone
Thread 2 : pthread_cond_timedwait@@GLIBC_2.3.2 ,
inline_mysql_cond_timedwait , get_event , worker_main , start_thread ,
clone

thanks

guangpu feng gpfeng.cs@gmail.com
Jan 16 (6 days ago)

to me
We have encountered this problem before and it does't happy again after we fixed it, when we start mysqld with threadpool(ported from mariadb5.5.28 to our percona server 5.5.18), many killed sessions remained in "show processlist" after a month, the backstrace from pt-pmt is just like yours. I don't konw whether we have exactly the same problem, but I can tell why we have the problem and how we solved it:

in THD::awker, close sock will result in epoll_wait unregistering sockfd, which will prevent killed connections from exiting when pool-of-threads scheduler is used, because epoll_wait will never return for that connection.

here is the solution: just shutdown the socket, and let close_connection which will be called later to close sockfd.

Index: /PS5518/branches/threadpool/sql/sql_class.cc
===================================================================
--- /PS5518/branches/threadpool/sql/sql_class.cc	(revision 3788)
+++ /PS5518/branches/threadpool/sql/sql_class.cc	(revision 3823)
@@ -1746,7 +1746,8 @@
         reading the next statement.
       */
 
-      close_active_vio();
+      if (active_vio)
+        vio_shutdown(active_vio, SHUT_RDWR);
     }
 #endif

Hope it will be helpful.



 Comments   
Comment by Sergey Vojtovich [ 2014-05-30 ]

Looks like MariaDB has this fix for THD::awake() since initial thread pool implementation:

revno: 3168.1.1
committer: Vladislav Vaintroub <wlad@montyprogram.com>
branch nick: 5.5
timestamp: Thu 2011-12-08 19:17:49 +0100
message:
  Initial threadpool implementation for MariaDB 5.5

Comment by Sergey Vojtovich [ 2014-05-30 ]

Hi Rich,

I was unable to reproduce this bug from given description. Could you give additional hints?

Comment by Rich Prohaska [ 2014-05-30 ]

Sorry about the limited details. The customer that saw this problem a long time ago and resolved it by turning the thread pool off. Later, guangpu feng chimed in with a patch that he claims fixed a similar problem.

Comment by Sergey Vojtovich [ 2014-05-30 ]

Rich, so should we close this issue?

Comment by Rich Prohaska [ 2014-05-30 ]

Please close this ticket since there is not enough information to lead to a solution.

Comment by Sergey Vojtovich [ 2014-05-30 ]

Closing as agreed with reporter: there is not enough information to lead to a solution.

Generated at Thu Feb 08 07:05:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.