[MDEV-22513] main.processlist_notembedded fails in buildbot with Timeout in wait_until_count_sessions Created: 2020-05-09  Updated: 2020-05-27  Resolved: 2020-05-26

Status: Closed
Project: MariaDB Server
Component/s: Tests
Affects Version/s: 10.2, 10.5
Fix Version/s: 10.5.4, 10.1.46, 10.2.33, 10.3.24, 10.4.14

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-20466 SHOW PROCESSLIST truncates query text... Closed
relates to MDEV-21462 main.processlist_notembedded fails to... Closed

 Description   

http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-centos74-amd64-debug/builds/4005

bb-10.2-release 985f63cce1057d7b3fd6e86e65a2a2ee 2020-05-08 19:07:46

main.processlist_notembedded             w1 [ fail ]
        Test ended at 2020-05-08 17:33:56
 
CURRENT_TEST: main.processlist_notembedded
mysqltest: In included file "/usr/share/mysql-test/include/wait_until_count_sessions.inc": 
included from /usr/share/mysql-test/t/processlist_notembedded.test at line 37:
At line 125: Timeout in wait_until_count_sessions.inc
 
The result from queries just before the failure was:
#
# MDEV-20466: SHOW PROCESSLIST truncates query text on \0 bytes
#
connect  con1,localhost,root,,;
connection con1;
SET DEBUG_SYNC= 'before_join_optimize SIGNAL in_sync WAIT_FOR go';
connection default;
SET DEBUG_SYNC= 'now WAIT_FOR in_sync';
FOUND 1 /sleep/ in MDEV-20466.text
SET DEBUG_SYNC= 'now SIGNAL go';
disconnect con1;
SET DEBUG_SYNC = 'RESET';
# Timeout in wait_until_count_sessions.inc
# Number of sessions   expected: <= 1   found: 2
SHOW PROCESSLIST;
Id	User	Host	db	Command	Time	State	Info	Progress
11071	root	localhost	test	Query	0	init	SHOW PROCESSLIST	0.000
11072	root	localhost	test	Query	10	debug sync point: before_join_optimize	SELECT user FROM mysql.user WHERE user ='some



 Comments   
Comment by Marko Mäkelä [ 2020-05-11 ]

As reported in MDEV-21462, this test was ruining subsequent tests by failing to ensure that all connections terminated.

My change was merged by the original test author sanja upward. The bb-10.2-release branch where the test failed is a merge from bb-10.1-release. There, that test did not fail anywhere, but notably, kvm-rpm-centos74-amd64 did not run at all on bb-10.1-release.

I tried and failed to reproduce the failure locally, despite 1,200 repeats. The [ retry-fail ] in the buildbot log looks promising. I did not find anything useful in the server error log.

I am reassigning this back to the original test author, who also did the merge to bb-10.2-release. I hope that this is well repeatable on the buildbot VM.

I only fixed MDEV-21462 because the badly written test was causing a potentially large number of subsequent tests to fail. Luckily, this failure appears to be the failure of a single test for a single bug.

Comment by Marko Mäkelä [ 2020-05-12 ]

I repeated the hang once locally:

10.5 0186b0a07760b734af61a9e45ebdb5a077d1872d

CURRENT_TEST: main.processlist_notembedded
mysqltest: In included file "./include/wait_until_count_sessions.inc": 
included from /mariadb/10.5-merge/mysql-test/main/processlist_notembedded.test at line 37:
At line 125: Timeout in wait_until_count_sessions.inc
 
The result from queries just before the failure was:
#
# MDEV-20466: SHOW PROCESSLIST truncates query text on \0 bytes
#
connect  con1,localhost,root,,;
connection con1;
SET DEBUG_SYNC= 'before_join_optimize SIGNAL in_sync WAIT_FOR go';
connection default;
SET DEBUG_SYNC= 'now WAIT_FOR in_sync';
FOUND 1 /sleep/ in MDEV-20466.text
SET DEBUG_SYNC= 'now SIGNAL go';
disconnect con1;
SET DEBUG_SYNC = 'RESET';
# Timeout in wait_until_count_sessions.inc
# Number of sessions   expected: <= 1   found: 2
SHOW PROCESSLIST;
Id	User	Host	db	Command	Time	State	Info	Progress
122	root	localhost	test	Query	0	starting	SHOW PROCESSLIST	0.000
123	root	localhost	test	Query	10	debug sync point: before_join_optimize	SELECT user FROM mysql.user WHERE user ='some\0sleep'	0.000

I tried to repeat by running the same sequence of tests that the hung worker was running, since the latest server restart. But, the test would fail to time out:

while ./mtr --verbose-restart --no-reorder main.opt_trace_security main.opt_trace_ucs2 main.order_by-mdev-10122 main.order_by_optimizer main.order_by_sortkey main.order_by_zerolength-4285 main.order_fill_sortbuf main.outfile main.outfile_loaddata main.overflow main.packet main.parser main.parser_not_embedded main.parser_precedence main.parser_stack main.password_expiration main.perror main.perror-win main.plugin_vars main.preload main.prepare main.processlist main.processlist_notembedded
do :
done

Possibly related note: The tests main.partition_debug_sync and main.query_cache_debug occasionally spend 300 extra seconds (5 minutes) real time, apparently due to something DEBUG_SYNC related. The tests would not fail, but they would run for 300xxx milliseconds instead of some tens or hundreds of milliseconds. I have seen this starting with 10.2. It might also be that I have not run tests on 10.1 frequently enough to witness such ‘hang’.

Comment by Marko Mäkelä [ 2020-05-26 ]

I think that the signal can be lost because we blindly disconnect the paused connection, and then discard the signals by SET DEBUG_SYNC = 'RESET'. A reap before the disconnect should fix this.

Generated at Thu Feb 08 09:15:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.