Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Not a Bug
-
5.5(EOL)
-
None
Description
The problem Jb-boin: talked about on #maria channel, something that repeatedly happened in his environment, needs to be investigated.
IRC log of 2014-02-20-2014-02-21 (filtered):
Jb-boin: ah, ive got once again the bug with the full threadpool on a replication slave with only a mysqldump running on it
|
Jb-boin: i was having a mytop running on the server this time, it "stopped" refreshing after having a SHOW TRIGGERS LIKE 'XXX' running for 0.0sec that was on "checking permission" state while another dump was running only on datas on another table if it could be related
|
Jb-boin: the first mysqldump was dumping ony tables structures, including events, routines and triggers and with --slave-dump=2
|
Jb-boin: ah, it couldnt connect anymore to the server the same second the slave thread restarted :
|
Jb-boin: 140220 21:30:06 [Note] Error reading relay log event: slave SQL thread was killed
|
Jb-boin: 140220 21:32:29 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.008374' at position 509836511, relay log './mysqld-relay-bin.001083' position: 509836796
|
Jb-boin: the last refresh on mytop was at 21:32:28
|
Jb-boin: its on a replication slave that doesnt have anything else than mysqldump connecting to it as well as a mytop this time
|
Jb-boin: while the dumps where running it ended up with witht the threadpool blocked
|
elenst_: and you on some reason couldn't connect to the extra port
|
Jb-boin: yep
|
Jb-boin: and after some minutes it shows that on the error log :
|
Jb-boin: 140220 21:40:49 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation.
|
Jb-boin: If 'extra_port' parameter is set, you can still connect to the database with superuser account (it must be TCP connection using extra_port as TCP port) and troubleshoot the situation. A likely cause of pool blocks are clients that lock resources for long time. 'show processlist' or 'show engine innodb status' can give additional hints.
|
Jb-boin: 140220 21:40:49 [Note] Threadpool has been blocked for 30 seconds
|
Jb-boin: i had the problem maybe 3 weeks ago, i did change my backup crontabs to avoid the structure and datas backups to run at the same moment and i did set these to connect to the extra port and not the standard one
|
Jb-boin: and since i didnt had the issue again but i did put back the old crons and i hit the issue again
|
Jb-boin: i have also set extra-max-connections=20 for safety
|
Jb-boin: but anyway today it was running on the default port
|
Jb-boin: the server was running fine until 21:32:29
|
Jb-boin: at 21:32:28 (from the mytop output) : Threads: 3 ( 3/ 1) 10/00/00/00
|
Jb-boin: a second later it couldnt connect anymore, the connection was just stalling, not refusing the connection
|
Jb-boin: knowing that the server has : max_connections = 1024 and thread_pool_max_threads = 1500
|
Jb-boin: a lsof outputs 1751146 lines while on another slave from the same master (with the same maria config and hardware) there are 108110 lines
|
Jb-boin: mhh, many ibd files seems to be "opened" 1581 times while only 112 times on the other slave
|
Jb-boin: tables from the DB it was doing a structure dump
|
Jb-boin: last time i checked netstat and connections to the socket and there was not more than expected
|
Jb-boin: only 2 tcp connections atm
|
Jb-boin: (the mysqldump that are still frozenly running
|
Jb-boin: i checked 2 different ibd on the same DB and they are present the same number of times on the lsof output
|
Jb-boin: and the same number of occurences on lsof for the table on the other DB the other mysqldump occurence was dumping
|
Jb-boin: i use my own version of mytop btw https://mariadb.atlassian.net/browse/MDEV-4476
|
Jb-boin: it does connect at the script start and does a SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST and SHOW GLOBAL STATUS every seconds and displays infos
|
elenst_: Jb-boin: RBR or SBR?
|
Jb-boin: mixed
|
elenst_: Jb-boin: does mysqldump also stop slave?
|
Jb-boin: yes, one of the two mysqldump has the --dump-slave=2 option so it does stop and starts it at every DB it dumps
|
elenst_: Jb-boin: does either mysqldump or mytop do show status / show variables?
|
Jb-boin: there are 9 dbs, both structure and datas dumps are ran on a loop : for DB in db1 db2 [...]; do mysqldump [...]; done
|
Jb-boin: mytop does a show status and show variables at every refresh
|
elenst_: so, we have 23 commands in your stack trace
|
Jb-boin: its every 5 seconds by default (not 1sec, sorry)
|
elenst_: one is slave executing event
|
elenst_: another is stop slave
|
elenst_: one is show variables probably
|
elenst_: and the rest are show status
|
elenst_: and all of them are waiting for something
|
Jb-boin: mytop also does some show global variables like [...] every time it refreshes
|
Jb-boin: so according to the mytop output and the logs, at 21:32:28 everything seems normal, at 21:32:29 replication is restarted and when mytop tries to refresh at 21:32:33 it hangs
|
elenst_: Jb-boin: restarted?
|
elenst_: like, it actually *started*?
|
Jb-boin: one does "mysqldump --no-data --dump-slave=2 --triggers --routines --triggers --events" and the other "--no-create-db --no-create-info --skip-triggers"
|
Jb-boin: its what is stated in the log
|
Jb-boin: 140220 21:32:29 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.008374' at position 509836511, relay log './mysqld-relay-bin.001083' position: 509836796
|
Jb-boin: i just did another BT and from a 'diff -u', both backtraces are exactely the same
|
elenst_: Jb-boin: what, again, was in those crontabs?
|
Jb-boin: both are run like for DB in db1 db2 db3 db4 [...]; do mysqldump [...]; done one does "mysqldump --no-data --dump-slave=2 --triggers --routines --triggers --events" and the other "mysqldump --no-create-db --no-create-info --skip-triggers"
|
Jb-boin: the structure dump cron was starting at 21:30 and the datas one at 21:31
|
elenst_: while today they ran concurrently, right?
|
Jb-boin: well, as the structure dump took obviously more than a minute, yes it was concurrently runnign
|
Jb-boin: its just that we lack some important details but it seems to be related to the start/stop of the slave
|
elenst_: yup
|
Jb-boin: i think it did never happened before i did add --dump-slave
|
Jb-boin: and doing a kill doesnt stop the server
|
Jb-boin: it logs that the shutdown has started and that the slave io has been killed but it doesnt seems to change much
|
Jb-boin: http://ownz.us/threadlock2.bt
|
elenst_: Jb-boin: and the one with dump-slave runs first?
|
elenst_: also, can there be big (long) enough events coming from master, that would take many seconds to replicate?
|
Jb-boin: yes
|
Jb-boin: yes its possible
|
Jb-boin: for info, last time i had the issue it had been blocked for the whole night before i had to restart mysqld
|
Jb-boin: and i dont have dont long running transactions :)
|
Jb-boin: at most less than 10 minutes i would say
|
2014-02-23 (filtered):
elenst_: Jb_boin: can you paste your config files and the command line for mytop from that slave deadlock issue?
|
elenst_: ah, and give me a rough idea how long the datadump usually takes
|
Jb_boin: http://paste.debian.net/hidden/469fc238/
|
Jb_boin: the whole dump takes about 1:30
|
elenst_: Jb_boin: also, can you take a look what is the value of thread_pool_size?
|
Jb_boin: 16
|
elenst_: 1:30 as 1 min 30 sec or 1 hour 30 min?
|
Jb_boin: hour :)
|
elenst_: ok, thanks
|
elenst_: Jb_boin: you don't have anything around your mytop that would kill it periodically if it can't perform?
|
elenst_: some kind of watchdog
|
Jb_boin: no
|
elenst_: the problem is, your stack trace contained ~20 hanging show status queries
|
Jb_boin: it could be the zabbix agent that does query some status every minutes
|
Jb_boin: here are some of the "functions" on the zabbix agent side i call :
|
Jb_boin: UserParameter=mysql.variable[*],HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Nse "SELECT VARIABLE_VALUE FROM information_schema.global_variables WHERE VARIABLE_NAME = '$1';"
|
Jb_boin: UserParameter=mysql.slave_behind_master,HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Ne "SHOW SLAVE STATUS\G" |mksh -c 'while read; do if ((++i == 34)); then print $REPLY; break; fi; done'
|
Jb_boin: mhh, in fact i dont query mysql.variable on this server
|
Jb_boin: but this one : UserParameter=mysql.status[*],HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Nse "SELECT VARIABLE_VALUE FROM information_schema.global_status WHERE VARIABLE_NAME = '$1';"
|
Jb_boin: with ~15 different parameters
|
Jb_boin: and that one : UserParameter=mysql.ping,HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysqladmin ping |grep -c alive
|
elenst_: Jb_boin: what happens with this zabbix if a query hangs?
|
Jb_boin: if i am right, it it doesnt receive an answer after some times it tries again then it will "mark" the service as down and it will only try to ping it every minutes until its up again and will re-check everything then
|
Jb_boin: but i am not sure if it queries in parallel or not
|
Jb_boin: but it seems that it does
|
elenst_: okay, i guess we can safely attribute "show status"es to it
|
elenst_: so, only "show slave status" and "select I_S", nothing else?
|
Jb_boin: nothing else if i am not forgetting anything
|
Attached files:
threadlock.bt.gz – stack trace while the slave was hanging
threadlock2.bt.gz – stack trace while the slave was hanging on shutdown
cnf – config file
Attachments
Activity
Transition | Time In Source Status | Execution Times |
---|
|
1863d 40m | 1 |