Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5951

Deadlock and/or threadpool problem with concurrent mysqldump and mytop on slave

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 5.5(EOL)
    • N/A
    • Server
    • None

    Description

      The problem Jb-boin: talked about on #maria channel, something that repeatedly happened in his environment, needs to be investigated.

      IRC log of 2014-02-20-2014-02-21 (filtered):

      Jb-boin: ah, ive got once again the bug with the full threadpool on a replication slave with only a mysqldump running on it
      Jb-boin: i was having a mytop running on the server this time, it "stopped" refreshing after having a SHOW TRIGGERS LIKE 'XXX' running for 0.0sec that was on "checking permission" state while another dump was running only on datas on another table if it could be related
      Jb-boin: the first mysqldump was dumping ony tables structures, including events, routines and triggers and with --slave-dump=2
      Jb-boin: ah, it couldnt connect anymore to the server the same second the slave thread restarted :
      Jb-boin: 140220 21:30:06 [Note] Error reading relay log event: slave SQL thread was killed
      Jb-boin: 140220 21:32:29 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.008374' at position 509836511, relay log './mysqld-relay-bin.001083' position: 509836796
      Jb-boin: the last refresh on mytop was at 21:32:28
      Jb-boin: its on a replication slave that doesnt have anything else than mysqldump connecting to it as well as a mytop this time
      Jb-boin: while the dumps where running it ended up with witht the threadpool blocked
      elenst_: and you on some reason couldn't connect to the extra port
      Jb-boin: yep
      Jb-boin: and after some minutes it shows that on the error log :
      Jb-boin: 140220 21:40:49 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation.
      Jb-boin:  If 'extra_port' parameter is set, you can still connect to the database with superuser account (it must be TCP connection using extra_port as TCP port) and troubleshoot the situation. A likely cause of pool blocks are clients that lock resources for long time. 'show processlist' or 'show engine innodb status' can give additional hints.
      Jb-boin: 140220 21:40:49 [Note] Threadpool has been blocked for 30 seconds
      Jb-boin: i had the problem maybe 3 weeks ago, i did change my backup crontabs to avoid the structure and datas backups to run at the same moment and i did set these to connect to the extra port and not the standard one
      Jb-boin: and since i didnt had the issue again but i did put back the old crons and i hit the issue again
      Jb-boin: i have also set extra-max-connections=20 for safety
      Jb-boin: but anyway today it was running on the default port
      Jb-boin: the server was running fine until 21:32:29

      Jb-boin: at 21:32:28 (from the mytop output) : Threads:    3 (   3/   1) 10/00/00/00
      Jb-boin: a second later it couldnt connect anymore, the connection was just stalling, not refusing the connection
      Jb-boin: knowing that the server has : max_connections = 1024 and thread_pool_max_threads = 1500
      Jb-boin: a lsof outputs 1751146 lines while on another slave from the same master (with the same maria config and hardware) there are 108110 lines
      Jb-boin: mhh, many ibd files seems to be "opened" 1581 times while only 112 times on the other slave
      Jb-boin: tables from the DB it was doing a structure dump
      Jb-boin: last time i checked netstat and connections to the socket and there was not more than expected
      Jb-boin: only 2 tcp connections atm
      Jb-boin: (the mysqldump that are still frozenly running
      Jb-boin: i checked 2 different ibd on the same DB and they are present the same number of times on the lsof output
      Jb-boin: and the same number of occurences on lsof for the table on the other DB the other mysqldump occurence was dumping
      Jb-boin: i use my own version of mytop btw https://mariadb.atlassian.net/browse/MDEV-4476
      Jb-boin: it does connect at the script start and does a SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST and SHOW GLOBAL STATUS every seconds and displays infos
      elenst_: Jb-boin: RBR or SBR?
      Jb-boin: mixed
      elenst_: Jb-boin: does mysqldump also stop slave?
      Jb-boin: yes, one of the two mysqldump has the --dump-slave=2 option so it does stop and starts it at every DB it dumps
      elenst_: Jb-boin: does either mysqldump or mytop do show status / show variables?
      Jb-boin: there are 9 dbs, both structure and datas dumps are ran on a loop : for DB in db1 db2 [...]; do mysqldump [...]; done

      Jb-boin: mytop does a show status and show variables at every refresh
      elenst_: so, we have 23 commands in your stack trace
      Jb-boin: its every 5 seconds by default (not 1sec, sorry)
      elenst_: one is slave executing event
      elenst_: another is stop slave
      elenst_: one is show variables probably
      elenst_: and the rest are show status
      elenst_: and all of them are waiting for something
      Jb-boin: mytop also does some show global variables like [...] every time it refreshes
      Jb-boin: so according to the mytop output and the logs, at 21:32:28 everything seems normal, at 21:32:29 replication is restarted and when mytop tries to refresh at 21:32:33 it hangs
      elenst_: Jb-boin: restarted?
      elenst_: like, it actually *started*?
      Jb-boin: one does "mysqldump --no-data --dump-slave=2 --triggers --routines --triggers --events" and the other "--no-create-db --no-create-info --skip-triggers"
      Jb-boin: its what is stated in the log
      Jb-boin: 140220 21:32:29 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.008374' at position 509836511, relay log './mysqld-relay-bin.001083' position: 509836796
      Jb-boin: i just did another BT and from a 'diff -u', both backtraces are exactely the same
      elenst_: Jb-boin: what, again, was in those crontabs?
      Jb-boin: both are run like for DB in db1 db2 db3 db4 [...]; do mysqldump [...]; done      one does "mysqldump --no-data --dump-slave=2 --triggers --routines --triggers --events" and the other "mysqldump --no-create-db --no-create-info --skip-triggers"
      Jb-boin: the structure dump cron was starting at 21:30 and the datas one at 21:31
      elenst_: while today they ran concurrently, right?
      Jb-boin: well, as the structure dump took obviously more than a minute, yes it was concurrently runnign

      Jb-boin: its just that we lack some important details but it seems to be related to the start/stop of the slave
      elenst_: yup
      Jb-boin: i think it did never happened before i did add --dump-slave
      Jb-boin: and doing a kill doesnt stop the server
      Jb-boin: it logs that the shutdown has started and that the slave io has been killed but it doesnt seems to change much
      Jb-boin: http://ownz.us/threadlock2.bt
      elenst_: Jb-boin: and the one with dump-slave runs first?
      elenst_: also, can there be big (long) enough events coming from master, that would take many seconds to replicate?
      Jb-boin: yes
      Jb-boin: yes its possible
      Jb-boin: for info, last time i had the issue it had been blocked for the whole night before i had to restart mysqld
      Jb-boin: and i dont have dont long running transactions :)
      Jb-boin: at most less than 10 minutes i would say

      2014-02-23 (filtered):

      elenst_: Jb_boin: can you paste your config files and the command line for mytop from that slave deadlock issue?
      elenst_: ah, and give me a rough idea how long the datadump usually takes
      Jb_boin: http://paste.debian.net/hidden/469fc238/
      Jb_boin: the whole dump takes about 1:30
      elenst_: Jb_boin: also, can you take a look what is the value of thread_pool_size?
      Jb_boin: 16
      elenst_: 1:30 as 1 min 30 sec or 1 hour 30 min?
      Jb_boin: hour :)
      elenst_: ok, thanks
      elenst_: Jb_boin: you don't have anything around your mytop that would kill it periodically if it can't perform?
      elenst_: some kind of watchdog
      Jb_boin: no
      elenst_: the problem is, your stack trace contained ~20 hanging show status queries
      Jb_boin: it could be the zabbix agent that does query some status every minutes
      Jb_boin: here are some of the "functions" on the zabbix agent side i call :
      Jb_boin: UserParameter=mysql.variable[*],HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Nse "SELECT VARIABLE_VALUE FROM information_schema.global_variables WHERE VARIABLE_NAME = '$1';"
      Jb_boin: UserParameter=mysql.slave_behind_master,HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Ne "SHOW SLAVE STATUS\G" |mksh -c 'while read; do  if ((++i == 34)); then print $REPLY; break; fi; done'
      Jb_boin: mhh, in fact i dont query mysql.variable on this server
      Jb_boin: but this one : UserParameter=mysql.status[*],HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysql -Nse "SELECT VARIABLE_VALUE FROM information_schema.global_status WHERE VARIABLE_NAME = '$1';"
      Jb_boin: with ~15 different parameters

      Jb_boin: and that one : UserParameter=mysql.ping,HOME=/var/lib/zabbix PATH=$PATH:/usr/local/mysql/bin mysqladmin ping |grep -c alive
      elenst_: Jb_boin: what happens with this zabbix if a query hangs?
      Jb_boin: if i am right, it it doesnt receive an answer after some times it tries again then it will "mark" the service as down and it will only try to ping it every minutes until its up again and will re-check everything then
      Jb_boin: but i am not sure if it queries in parallel or not
      Jb_boin: but it seems that it does
      elenst_: okay, i guess we can safely attribute "show status"es to it
      elenst_: so, only "show slave status" and "select I_S", nothing else?
      Jb_boin: nothing else if i am not forgetting anything

      Attached files:
      threadlock.bt.gz – stack trace while the slave was hanging
      threadlock2.bt.gz – stack trace while the slave was hanging on shutdown
      cnf – config file

      Attachments

        1. cnf
          2 kB
          Elena Stepanova
        2. threadlock.bt.gz
          35 kB
          Elena Stepanova
        3. threadlock2.bt.gz
          35 kB
          Elena Stepanova

        Activity

          Transition Time In Source Status Execution Times
          Vladislav Vaintroub made transition -
          Open Closed
          1863d 40m 1

          People

            wlad Vladislav Vaintroub
            elenst Elena Stepanova
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.