[MDEV-7588] [PATCH] Slow connections with thread pool and replication Created: 2015-02-14  Updated: 2015-11-24  Resolved: 2015-11-24

Status: Closed
Project: MariaDB Server
Component/s: OTHER, Replication
Affects Version/s: 5.5, 10.0
Fix Version/s: 5.5.47, 10.0.23, 10.1.10

Type: Bug Priority: Minor
Reporter: Peter Nørlund Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 3
Labels: verified
Environment:

Debian Linux 7, Linux 3.10, 4 cores / 8 threads, official MariaDB packages


Attachments: File yield_wait_for_update_bin_log.diff    
Sprint: 10.1.9-3

 Description   

If MariaDB is set up with pool-of-threads and a replication slave connects and issues the COM_BINLOG_DUMP command, every nth connection (in my setup every 8th) can take as much as 1 second to establish. That is, the TCP connection is completed fast, but it takes a long time (relatively speaking) before the server greeting is sent.

The reason is apparently that the binlog dump thread doesn't notify the thread pool that it is waiting, so the active_thread_count remains > 0 and queue_put() will not do anything. Instead, the thread group is activated by the thread timer (thread_pool_stall_limit), which with the default value of 500ms give us a worst case close to 1 sec.

I only have a superficial understanding of the MariaDB core, but it seems that mysql_binlog_send() ought to call thd_wait_begin. Either something like that or the active_thread_count check should be removed in queue_put().



 Comments   
Comment by Peter Nørlund [ 2015-02-17 ]

I've tried adding the_wait_begin/thd_wait_end into MYSQL_BIN_LOG::wait_for_update_bin_log where the thread waits for a conditional variable. This had an immediate effect on the reaction time. (Attached patch yield_wait_for_update_bin_log.diff)

I couldn't compile the Debian package without commenting out the tests (and I actually had to backport two patches to even compile in the first place), but all the failed tests seems to be related the SSL, and at least one of the failures was caused by an expired certificate.

Anyway, the patch is currently running in the internal databases at my work place, to see if it have any unwanted side-effects.

Comment by Jean Weisbuch [ 2015-02-17 ]

The SSL certificates issues are due to the fact the the certs used for the test suite have expired, it has been solved since on MDEV-7536 so you can disregard those tests.

About your issue, i had issues on a slave running pool-of-threads that could be related : MDEV-5951.
That slave was having dumps for each databases running sequentially while at the same time there was another thread dumping the databases structures (also sequentially) and one the table contents and i was hitting the "thread_pool_max_threads" limit (which was not low) on one of the dumps at some point while the only connections to the server were the dumps and a simple moitoring Zabbix script that was just checking some server variables every minutes.
Disabling the threadpool made the issue to go away.

Do you think this bug could produce that kind if issue?

Comment by Peter Nørlund [ 2015-02-17 ]

I doubt that it is related to MDEV-5951, which looks like somebody forgetting to unlock LOCK_thread_count somewhere.

Comment by Elena Stepanova [ 2015-02-18 ]

Thanks for the report and the patch. Below is a simple MTR test case to confirm the problem (not to be included to the regression suite).

Results with and without the patch on the current 5.5 tree, debug build:

without patch, pool-of-threads

# Min connection time:   1366
# Max connection time:   997304 
# Total connection time: 16919926
# Avg connection time:   338398.5200

without patch, one-thread-per-connection

# Min connection time:   1370
# Max connection time:   1965 
# Total connection time: 78925
# Avg connection time:   1578.5000

with patch, pool-of-threads

# Min connection time:   1274
# Max connection time:   2034 
# Total connection time: 74009
# Avg connection time:   1480.1800

with patch, one-thread-per-connection

# Min connection time:   1419
# Max connection time:   2322 
# Total connection time: 85742
# Avg connection time:   1714.8400

So the patch does help, although I don't know if it's otherwise correct.

Test case

--source include/master-slave.inc
--source include/have_binlog_format_statement.inc
 
--let $num = 50
--let $sum = 0
--let $max = 0
--let $min = 10000000
--let $n = $num
while ($n) 
{
	--let $before = `select now(6)`
	--connect (con$n,localhost,root,,)
	--let $tm = `select timestampdiff(microsecond,'$before', now(6))`
	--echo # Connection time: $tm
	--let $sum = `select $sum + $tm`
	--let $max = `select if($tm>$max,$tm,$max)`
	--let $min = `select if($tm<$min,$tm,$min)`
	--connection default
	--dec $n
}
 
--let $avg = `select $sum / $num`
 
--echo # Min connection time:   $min
--echo # Max connection time:   $max 
--echo # Total connection time: $sum
--echo # Avg connection time:   $avg
 
--source include/rpl_end.inc

Comment by Patryk Pomykalski [ 2015-02-26 ]

We have the same problem on servers with thread pool.

Comment by Vladislav Vaintroub [ 2015-11-24 ]

Fixed, thanks for the patch Peter. That long wait without indication to threadpool was indeed the problem.

Generated at Thu Feb 08 07:20:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.