[MDEV-5951] Deadlock and/or threadpool problem with concurrent mysqldump and mytop on slave - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 5.5(EOL)
Fix Version/s: N/A
Component/s: Server
Labels:
None

Description

The problem Jb-boin: talked about on #maria channel, something that repeatedly happened in his environment, needs to be investigated.

IRC log of 2014-02-20-2014-02-21 (filtered):

Jb-boin: ah, ive got once again the bug with the full threadpool on a replication slave with only a mysqldump running on it

Jb-boin: i was having a mytop running on the server this time, it "stopped" refreshing after having a SHOW TRIGGERS LIKE 'XXX' running for 0.0sec that was on "checking permission" state while another dump was running only on datas on another table if it could be related

Jb-boin: the first mysqldump was dumping ony tables structures, including events, routines and triggers and with --slave-dump=2

Jb-boin: ah, it couldnt connect anymore to the server the same second the slave thread restarted :

Jb-boin: 140220 21:30:06 [Note] Error reading relay log event: slave SQL thread was killed

Jb-boin: 140220 21:32:29 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.008374' at position 509836511, relay log './mysqld-relay-bin.001083' position: 509836796

Jb-boin: the last refresh on mytop was at 21:32:28

Jb-boin: its on a replication slave that doesnt have anything else than mysqldump connecting to it as well as a mytop this time

Jb-boin: while the dumps where running it ended up with witht the threadpool blocked

elenst_: and you on some reason couldn't connect to the extra port

Jb-boin: yep

Jb-boin: and after some minutes it shows that on the error log :

Jb-boin: 140220 21:40:49 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation.

Jb-boin:  If 'extra_port' parameter is set, you can still connect to the database with superuser account (it must be TCP connection using extra_port as TCP port) and troubleshoot the situation. A likely cause of pool blocks are clients that lock resources for long time. 'show processlist' or 'show engine innodb status' can give additional hints.

Jb-boin: 140220 21:40:49 [Note] Threadpool has been blocked for 30 seconds

Jb-boin: i had the problem maybe 3 weeks ago, i did change my backup crontabs to avoid the structure and datas backups to run at the same moment and i did set these to connect to the extra port and not the standard one

Jb-boin: and since i didnt had the issue again but i did put back the old crons and i hit the issue again

Jb-boin: i have also set extra-max-connections=20 for safety

Jb-boin: but anyway today it was running on the default port

Jb-boin: the server was running fine until 21:32:29

Jb-boin: at 21:32:28 (from the mytop output) : Threads:    3 (   3/   1) 10/00/00/00

Jb-boin: a second later it couldnt connect anymore, the connection was just stalling, not refusing the connection

Jb-boin: knowing that the server has : max_connections = 1024 and thread_pool_max_threads = 1500

Jb-boin: a lsof outputs 1751146 lines while on another slave from the same master (with the same maria config and hardware) there are 108110 lines

Jb-boin: mhh, many ibd files seems to be "opened" 1581 times while only 112 times on the other slave

Jb-boin: tables from the DB it was doing a structure dump

Jb-boin: last time i checked netstat and connections to the socket and there was not more than expected

Jb-boin: only 2 tcp connections atm

Jb-boin: (the mysqldump that are still frozenly running

Jb-boin: i checked 2 different ibd on the same DB and they are present the same number of times on the lsof output

Jb-boin: and the same number of occurences on lsof for the table on the other DB the other mysqldump occurence was dumping

Jb-boin: i use my own version of mytop btw https://mariadb.atlassian.net/browse/MDEV-4476

Jb-boin: it does connect at the script start and does a SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST and SHOW GLOBAL STATUS every seconds and displays infos

elenst_: Jb-boin: RBR or SBR?

Jb-boin: mixed

elenst_: Jb-boin: does mysqldump also stop slave?

Jb-boin: yes, one of the two mysqldump has the --dump-slave=2 option so it does stop and starts it at every DB it dumps

elenst_: Jb-boin: does either mysqldump or mytop do show status / show variables?

Jb-boin: there are 9 dbs, both structure and datas dumps are ran on a loop : for DB in db1 db2 [...]; do mysqldump [...]; done

Jb-boin: mytop does a show status and show variables at every refresh

elenst_: so, we have 23 commands in your stack trace

Jb-boin: its every 5 seconds by default (not 1sec, sorry)

elenst_: one is slave executing event

elenst_: another is stop slave

elenst_: one is show variables probably

elenst_: and the rest are show status

elenst_: and all of them are waiting for something

Jb-boin: mytop also does some show global variables like [...] every time it refreshes

Jb-boin: so according to the mytop output and the logs, at 21:32:28 everything seems normal, at 21:32:29 replication is restarted and when mytop tries to refresh at 21:32:33 it hangs

elenst_: Jb-boin: restarted?

elenst_: like, it actually *started*?

Jb-boin: one does "mysqldump --no-data --dump-slave=2 --triggers --routines --triggers --events" and the other "--no-create-db --no-create-info --skip-triggers"

Jb-boin: its what is stated in the log

Jb-boin: 140220 21:32:29 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.008374' at position 509836511, relay log './mysqld-relay-bin.001083' position: 509836796

Jb-boin: i just did another BT and from a 'diff -u', both backtraces are exactely the same

elenst_: Jb-boin: what, again, was in those crontabs?

Jb-boin: both are run like for DB in db1 db2 db3 db4 [...]; do mysqldump [...]; done      one does "mysqldump --no-data --dump-slave=2 --triggers --routines --triggers --events" and the other "mysqldump --no-create-db --no-create-info --skip-triggers"

Jb-boin: the structure dump cron was starting at 21:30 and the datas one at 21:31

elenst_: while today they ran concurrently, right?

Jb-boin: well, as the structure dump took obviously more than a minute, yes it was concurrently runnign

Jb-boin: its just that we lack some important details but it seems to be related to the start/stop of the slave

elenst_: yup

Jb-boin: i think it did never happened before i did add --dump-slave

Jb-boin: and doing a kill doesnt stop the server

Jb-boin: it logs that the shutdown has started and that the slave io has been killed but it doesnt seems to change much

Jb-boin: http://ownz.us/threadlock2.bt

elenst_: Jb-boin: and the one with dump-slave runs first?

elenst_: also, can there be big (long) enough events coming from master, that would take many seconds to replicate?

Jb-boin: yes

Jb-boin: yes its possible

Jb-boin: for info, last time i had the issue it had been blocked for the whole night before i had to restart mysqld

Jb-boin: and i dont have dont long running transactions :)

Jb-boin: at most less than 10 minutes i would say

2014-02-23 (filtered):

elenst_: Jb_boin: can you paste your config files and the command line for mytop from that slave deadlock issue?

elenst_: ah, and give me a rough idea how long the datadump usually takes

Jb_boin: http://paste.debian.net/hidden/469fc238/

Jb_boin: the whole dump takes about 1:30

elenst_: Jb_boin: also, can you take a look what is the value of thread_pool_size?

Jb_boin: 16

elenst_: 1:30 as 1 min 30 sec or 1 hour 30 min?

Jb_boin: hour :)

elenst_: ok, thanks

elenst_: Jb_boin: you don't have anything around your mytop that would kill it periodically if it can't perform?

elenst_: some kind of watchdog

Jb_boin: no

elenst_: the problem is, your stack trace contained ~20 hanging show status queries

Jb_boin: it could be the zabbix agent that does query some status every minutes

Jb_boin: here are some of the "functions" on the zabbix agent side i call :

Jb_boin: UserParameter=mysql.variable[*],HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Nse "SELECT VARIABLE_VALUE FROM information_schema.global_variables WHERE VARIABLE_NAME = '$1';"

Jb_boin: UserParameter=mysql.slave_behind_master,HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Ne "SHOW SLAVE STATUS\G" |mksh -c 'while read; do  if ((++i == 34)); then print $REPLY; break; fi; done'

Jb_boin: mhh, in fact i dont query mysql.variable on this server

Jb_boin: but this one : UserParameter=mysql.status[*],HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Nse "SELECT VARIABLE_VALUE FROM information_schema.global_status WHERE VARIABLE_NAME = '$1';"

Jb_boin: with ~15 different parameters

Jb_boin: and that one : UserParameter=mysql.ping,HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysqladmin ping |grep -c alive

elenst_: Jb_boin: what happens with this zabbix if a query hangs?

Jb_boin: if i am right, it it doesnt receive an answer after some times it tries again then it will "mark" the service as down and it will only try to ping it every minutes until its up again and will re-check everything then

Jb_boin: but i am not sure if it queries in parallel or not

Jb_boin: but it seems that it does

elenst_: okay, i guess we can safely attribute "show status"es to it

elenst_: so, only "show slave status" and "select I_S", nothing else?

Jb_boin: nothing else if i am not forgetting anything

Attached files:
threadlock.bt.gz – stack trace while the slave was hanging
threadlock2.bt.gz – stack trace while the slave was hanging on shutdown
cnf – config file

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cnf
2014-02-26 15:15
2 kB
Elena Stepanova
threadlock.bt.gz
2014-02-26 15:15
35 kB
Elena Stepanova
threadlock2.bt.gz
2014-02-26 15:15
35 kB
Elena Stepanova

Activity

Transition	Time In Source Status	Execution Times

Vladislav Vaintroub made transition - 2019-04-04 15:56

Open

Closed

1863d 40m

People

Assignee:: Vladislav Vaintroub

Reporter:: Elena Stepanova

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2014-02-26 15:15

Updated:: 2019-04-04 15:56

Resolved:: 2019-04-04 15:56

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration