Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7588

[PATCH] Slow connections with thread pool and replication

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • 5.5(EOL), 10.0(EOL)
    • 5.5.47, 10.0.23, 10.1.10
    • OTHER, Replication
    • Debian Linux 7, Linux 3.10, 4 cores / 8 threads, official MariaDB packages
    • 10.1.9-3

    Description

      If MariaDB is set up with pool-of-threads and a replication slave connects and issues the COM_BINLOG_DUMP command, every nth connection (in my setup every 8th) can take as much as 1 second to establish. That is, the TCP connection is completed fast, but it takes a long time (relatively speaking) before the server greeting is sent.

      The reason is apparently that the binlog dump thread doesn't notify the thread pool that it is waiting, so the active_thread_count remains > 0 and queue_put() will not do anything. Instead, the thread group is activated by the thread timer (thread_pool_stall_limit), which with the default value of 500ms give us a worst case close to 1 sec.

      I only have a superficial understanding of the MariaDB core, but it seems that mysql_binlog_send() ought to call thd_wait_begin. Either something like that or the active_thread_count check should be removed in queue_put().

      Attachments

        Activity

          pchri03 Peter Nørlund added a comment - - edited

          I've tried adding the_wait_begin/thd_wait_end into MYSQL_BIN_LOG::wait_for_update_bin_log where the thread waits for a conditional variable. This had an immediate effect on the reaction time. (Attached patch yield_wait_for_update_bin_log.diff)

          I couldn't compile the Debian package without commenting out the tests (and I actually had to backport two patches to even compile in the first place), but all the failed tests seems to be related the SSL, and at least one of the failures was caused by an expired certificate.

          Anyway, the patch is currently running in the internal databases at my work place, to see if it have any unwanted side-effects.

          pchri03 Peter Nørlund added a comment - - edited I've tried adding the_wait_begin/thd_wait_end into MYSQL_BIN_LOG::wait_for_update_bin_log where the thread waits for a conditional variable. This had an immediate effect on the reaction time. (Attached patch yield_wait_for_update_bin_log.diff) I couldn't compile the Debian package without commenting out the tests (and I actually had to backport two patches to even compile in the first place), but all the failed tests seems to be related the SSL, and at least one of the failures was caused by an expired certificate. Anyway, the patch is currently running in the internal databases at my work place, to see if it have any unwanted side-effects.
          jb-boin Jean Weisbuch added a comment -

          The SSL certificates issues are due to the fact the the certs used for the test suite have expired, it has been solved since on MDEV-7536 so you can disregard those tests.

          About your issue, i had issues on a slave running pool-of-threads that could be related : MDEV-5951.
          That slave was having dumps for each databases running sequentially while at the same time there was another thread dumping the databases structures (also sequentially) and one the table contents and i was hitting the "thread_pool_max_threads" limit (which was not low) on one of the dumps at some point while the only connections to the server were the dumps and a simple moitoring Zabbix script that was just checking some server variables every minutes.
          Disabling the threadpool made the issue to go away.

          Do you think this bug could produce that kind if issue?

          jb-boin Jean Weisbuch added a comment - The SSL certificates issues are due to the fact the the certs used for the test suite have expired, it has been solved since on MDEV-7536 so you can disregard those tests. – About your issue, i had issues on a slave running pool-of-threads that could be related : MDEV-5951 . That slave was having dumps for each databases running sequentially while at the same time there was another thread dumping the databases structures (also sequentially) and one the table contents and i was hitting the "thread_pool_max_threads" limit (which was not low) on one of the dumps at some point while the only connections to the server were the dumps and a simple moitoring Zabbix script that was just checking some server variables every minutes. Disabling the threadpool made the issue to go away. Do you think this bug could produce that kind if issue?

          I doubt that it is related to MDEV-5951, which looks like somebody forgetting to unlock LOCK_thread_count somewhere.

          pchri03 Peter Nørlund added a comment - I doubt that it is related to MDEV-5951 , which looks like somebody forgetting to unlock LOCK_thread_count somewhere.

          Thanks for the report and the patch. Below is a simple MTR test case to confirm the problem (not to be included to the regression suite).

          Results with and without the patch on the current 5.5 tree, debug build:

          without patch, pool-of-threads

          # Min connection time:   1366
          # Max connection time:   997304 
          # Total connection time: 16919926
          # Avg connection time:   338398.5200

          without patch, one-thread-per-connection

          # Min connection time:   1370
          # Max connection time:   1965 
          # Total connection time: 78925
          # Avg connection time:   1578.5000

          with patch, pool-of-threads

          # Min connection time:   1274
          # Max connection time:   2034 
          # Total connection time: 74009
          # Avg connection time:   1480.1800

          with patch, one-thread-per-connection

          # Min connection time:   1419
          # Max connection time:   2322 
          # Total connection time: 85742
          # Avg connection time:   1714.8400

          So the patch does help, although I don't know if it's otherwise correct.

          Test case

          --source include/master-slave.inc
          --source include/have_binlog_format_statement.inc
           
          --let $num = 50
          --let $sum = 0
          --let $max = 0
          --let $min = 10000000
          --let $n = $num
          while ($n) 
          {
          	--let $before = `select now(6)`
          	--connect (con$n,localhost,root,,)
          	--let $tm = `select timestampdiff(microsecond,'$before', now(6))`
          	--echo # Connection time: $tm
          	--let $sum = `select $sum + $tm`
          	--let $max = `select if($tm>$max,$tm,$max)`
          	--let $min = `select if($tm<$min,$tm,$min)`
          	--connection default
          	--dec $n
          }
           
          --let $avg = `select $sum / $num`
           
          --echo # Min connection time:   $min
          --echo # Max connection time:   $max 
          --echo # Total connection time: $sum
          --echo # Avg connection time:   $avg
           
          --source include/rpl_end.inc

          elenst Elena Stepanova added a comment - Thanks for the report and the patch. Below is a simple MTR test case to confirm the problem ( not to be included to the regression suite). Results with and without the patch on the current 5.5 tree, debug build: without patch, pool-of-threads # Min connection time: 1366 # Max connection time: 997304 # Total connection time: 16919926 # Avg connection time: 338398.5200 without patch, one-thread-per-connection # Min connection time: 1370 # Max connection time: 1965 # Total connection time: 78925 # Avg connection time: 1578.5000 with patch, pool-of-threads # Min connection time: 1274 # Max connection time: 2034 # Total connection time: 74009 # Avg connection time: 1480.1800 with patch, one-thread-per-connection # Min connection time: 1419 # Max connection time: 2322 # Total connection time: 85742 # Avg connection time: 1714.8400 So the patch does help, although I don't know if it's otherwise correct. Test case --source include/master-slave.inc --source include/have_binlog_format_statement.inc   --let $num = 50 --let $sum = 0 --let $max = 0 --let $min = 10000000 --let $n = $num while ($n) { --let $before = `select now(6)` --connect (con$n,localhost,root,,) --let $tm = `select timestampdiff(microsecond,'$before', now(6))` --echo # Connection time: $tm --let $sum = `select $sum + $tm` --let $max = `select if($tm>$max,$tm,$max)` --let $min = `select if($tm<$min,$tm,$min)` --connection default --dec $n }   --let $avg = `select $sum / $num`   --echo # Min connection time: $min --echo # Max connection time: $max --echo # Total connection time: $sum --echo # Avg connection time: $avg   --source include/rpl_end.inc

          We have the same problem on servers with thread pool.

          pomyk Patryk Pomykalski added a comment - We have the same problem on servers with thread pool.

          Fixed, thanks for the patch Peter. That long wait without indication to threadpool was indeed the problem.

          wlad Vladislav Vaintroub added a comment - Fixed, thanks for the patch Peter. That long wait without indication to threadpool was indeed the problem.

          People

            wlad Vladislav Vaintroub
            pchri03 Peter Nørlund
            Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.