[MDEV-10674] main.show_explain failed in buildbot Created: 2016-08-27  Updated: 2023-11-16  Resolved: 2021-05-03

Status: Closed
Project: MariaDB Server
Component/s: Tests
Affects Version/s: 10.0, 10.1, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.2.38, 10.3.29, 10.4.19, 10.5.10, 10.6.1

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-3798 Add EXPLAIN for UPDATE/DELETE (mwl #51) Closed
Relates
relates to MDEV-7069 Fix buildbot failures in main server ... Stalled
relates to MDEV-32819 main.show_explain failed in buildbot Closed

 Description   

http://buildbot.askmonty.org/buildbot/builders/bld-dan-debug/builds/3241/steps/test_1/logs/stdio
(test run with ps-protocol)

main.show_explain 'xtradb'               w1 [ fail ]
        Test ended at 2016-05-08 15:26:44
 
CURRENT_TEST: main.show_explain
mysqltest: At line 880: query 'reap' succeeded - should have failed with errno 1317...
 
The result from queries just before the failure was:
< snip >
11	data1
12	data1
13	data1
14	data1
15	data1
16	data1
17	data1
18	data1
19	data1
20	data1
set autocommit=0;
select * from t1 where pk between 10 and 20 for update;
# do: send_eval show explain for thr2;
Timeout in wait_condition.inc for select State='show explain' from information_schema.processlist where id=33
Id	User	Host	db	Command	Time	State	Info	Progress
33	root	localhost	test	Sleep	30		NULL	0.000
34	root	localhost	test	Query	30	Sending data	select * from t1 where pk between 10 and 20 for update	0.000
38	root	localhost	test	Sleep	30		NULL	0.000
39	root	localhost	test	Query	0	init	show full processlist	0.000
kill query $thr_default;



 Comments   
Comment by Elena Stepanova [ 2017-03-03 ]

10.0 is also affected:
http://buildbot.askmonty.org/buildbot/builders/win32-debug/builds/2297/steps/test/logs/stdio

main.show_explain 'xtradb'               w2 [ fail ]
        Test ended at 2017-01-17 21:58:18
 
CURRENT_TEST: main.show_explain
mysqltest: At line 879: query 'reap' succeeded - should have failed with errno 1317...
 
The result from queries just before the failure was:
< snip >
11	data1
12	data1
13	data1
14	data1
15	data1
16	data1
17	data1
18	data1
19	data1
20	data1
set autocommit=0;
select * from t1 where pk between 10 and 20 for update;
# do: send_eval show explain for thr2;
Timeout in wait_condition.inc for select State='show explain' from information_schema.processlist where id=33
Id	User	Host	db	Command	Time	State	Info	Progress
33	root	localhost:61791	test	Sleep	39		NULL	0.000
34	root	localhost:61795	test	Query	39	Sending data	select * from t1 where pk between 10 and 20 for update	0.000
38	root	localhost:61984	test	Sleep	39		NULL	0.000
39	root	localhost:61985	test	Query	0	init	show full processlist	0.000
kill query $thr_default;

Comment by Alice Sherepa [ 2017-11-20 ]

on 10.3 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/3129/steps/test/logs/stdio

main.show_explain 'innodb'               w3 [ fail ]
        Test ended at 2017-11-18 23:18:59
 
CURRENT_TEST: main.show_explain
mysqltest: At line 880: query 'reap' succeeded - should have failed with errno 1317...
 
The result from queries just before the failure was:
< snip >
20	data1
connection con1;
set autocommit=0;
select * from t1 where pk between 10 and 20 for update;
connection default;
# do: send_eval show explain for thr2;
connection con3;
Timeout in wait_condition.inc for select State='show explain' from information_schema.processlist where id=51
Id	User	Host	db	Command	Time	State	Info	Progress
1	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
2	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
5	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
51	root	localhost	test	Sleep	31		NULL	0.000
52	root	localhost	test	Query	31	Sending data	select * from t1 where pk between 10 and 20 for update	0.000
56	root	localhost	test	Sleep	31		NULL	0.000
57	root	localhost	test	Query	0	Init	show full processlist	0.000
kill query $thr_default;
connection default;

Comment by Marko Mäkelä [ 2019-09-05 ]

I got this locally on 10.3 (very close to mariadb-10.3.18).

Comment by Marko Mäkelä [ 2020-08-14 ]

This looks like a badly written test to me. See details below. Here is a recent failure from 10.4: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/24152/steps/mtr_nm/logs/stdio

10.4 2f7b37b02154748b223e385a7d7787900ab37b5e

CURRENT_TEST: main.show_explain
mysqltest: At line 882: query 'reap' succeeded - should have failed with errno 1317...
 
The result from queries just before the failure was:
< snip >
20	data1
connection con1;
set autocommit=0;
select * from t1 where pk between 10 and 20 for update;
connection default;
# do: send_eval show explain for thr2;
connection con3;
Timeout in wait_condition.inc for select State='show explain' from information_schema.processlist where id=234
Id	User	Host	db	Command	Time	State	Info	Progress
1	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
2	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
5	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
234	root	localhost	test	Sleep	30		NULL	0.000
235	root	localhost	test	Query	30	Sending data	select * from t1 where pk between 10 and 20 for update	0.000
239	root	localhost	test	Sleep	30		NULL	0.000
240	root	localhost	test	Query	0	Init	show full processlist	0.000
kill query $thr_default;
connection default;
 
More results from queries before failure can be found in /mnt/buildbot/build/mariadb-10.4.15/mysql-test/var/3/log/show_explain.log

Let us look at the changes:

git show 105e3ae6c93f57498fa6c504dfbb92203b0d1056 -- mysql-test/t/show_explain.test

diff --git a/mysql-test/t/show_explain.test b/mysql-test/t/show_explain.test
index 91e3be34408..6735a0e851e 100644
--- a/mysql-test/t/show_explain.test
+++ b/mysql-test/t/show_explain.test
@@ -224,7 +224,9 @@ connection default;
 --source include/wait_condition.inc
 --error ER_TARGET_NOT_EXPLAINABLE
 evalp show explain for $thr2;
+evalp kill query $thr2;
 connection con1;
+--error ER_QUERY_INTERRUPTED
 reap;
 set debug_dbug=@old_debug;
 
@@ -399,7 +401,9 @@ connection default;
 --source include/wait_condition.inc
 --error ER_TARGET_NOT_EXPLAINABLE
 evalp show explain for $thr2;
+evalp kill query $thr2;
 connection con1;
+--error ER_QUERY_INTERRUPTED
 reap;
 set debug_dbug=@old_debug;
 DROP VIEW v1;

The test is using DBUG, but for some reason, it is missing proper DEBUG_SYNC instrumentation to ensure that the KILL QUERY will be processed before the statement that was submitted to execution earlier was completed.

Actually, the above mentioned commit is merely duplicating earlier use of such pattern. The error at line 882 was introduced even earlier:

commit 3e90dc1f77dc3fa51d542bf82a336753310f7776
Author: Sergey Petrunya <psergey@askmonty.org>
Date:   Thu Jul 5 22:04:13 2012 +0400
 
    MWL#182: Explain running statements
    - Make SHOW EXPLAIN command be KILLable with KILL QUERY.

Comment by Michael Widenius [ 2021-03-25 ]

I made this a blocker as this affects protected branches and thus can stop people from pushing

Comment by Sergei Petrunia [ 2021-04-24 ]

I was able to reproduce once

CURRENT_TEST: main.show_explain15
mysqltest: At line 880: query 'reap' succeeded - should have failed with errno 1317...

--- good-execution.log
+++ bad-execution.log
@@ -966,82 +966,16 @@ 
 connection default;
 # do: send_eval show explain for thr2;
 connection con3;
+Timeout in wait_condition.inc for select State='show explain' from information_schema.processlist where id=139
+Id     User    Host    db      Command Time    State   Info    Progress
+1      system user             NULL    Daemon  NULL    InnoDB purge worker     NULL    0.000
+2      system user             NULL    Daemon  NULL    InnoDB purge worker     NULL    0.000
+3      system user             NULL    Daemon  NULL    InnoDB purge coordinator        NULL    0.000
+4      system user             NULL    Daemon  NULL    InnoDB purge worker     NULL    0.000
+5      system user             NULL    Daemon  NULL    InnoDB shutdown handler NULL    0.000
+139    root    localhost       test    Sleep   30              NULL    0.000
+140    root    localhost       test    Query   30      Sending data    select * from t1 where pk between 10 and 20 for update  0.000
+144    root    localhost       test    Sleep   30              NULL    0.000
+145    root    localhost       test    Query   0       Init    show full processlist   0.000
 kill query $thr_default;
 connection default;
-ERROR 70100: Query execution was interrupted
-connection con2;

The test was expecting to find the SHOW EXPLAIN query to be stuck waiting to get explain on a thread that's stuck in innodb lock wait.

The query that's stuck in innodb lock wait is there (id=140). The SHOW EXPLAIN has managed to succeed somehow.

Comment by Sergei Petrunia [ 2021-04-24 ]

Could something like this happen if InnoDB lock wait experiences a spurious wakeup (caused by either pthreads or some other reason)? The wakeup will process APCs, including the APC to produce EXPLAIN output...

Comment by Sergei Petrunia [ 2021-05-01 ]

...no, the cause seems to be the following issue in the test. Look at these lines in the test:

# This will freeze
send
select * from t1 where pk between 10 and 20 for update;

# run SHOW EXPLAIN on a frozen thread
connection default;
let $save_wait_condition= $wait_condition;
let $wait_condition= select State='Sending data' from information_schema.processlist where id=$thr2;
--source include/wait_condition.inc

Here, the test expects that the above for update query is blocked inside
InnoDB, waiting for a lock.
However, let's suppose it hasn't reached the lock acquistion yet and is still
at the start of do_select() (which JOIN::exec_inner calls right after
calling THD_STAGE_INFO(thd, stage_sending_data) call)

Then the SHOW EXPLAIN call

--echo # do: send_eval show explain for thr2;
--disable_query_log
send_eval show explain for $thr2;
--enable_query_log

has a chance to finish: the SELECT...FOR UPDATE query will call
thd->apc_target.process_apcs() before it is stuck waiting on an innodb lock.

Comment by Sergei Petrunia [ 2021-05-02 ]

Candidate patch: http://lists.askmonty.org/pipermail/commits/2021-May/014592.html . I would like to get a review for this from somebody in InnoDB team. marko could you please take a look?

Comment by Marko Mäkelä [ 2021-05-03 ]

The patch looks otherwise OK, but it forgot to load the plugins:

At line 43: query 'let $success= `$wait_condition`' failed with wrong errno 1109: 'Unknown table 'INNODB_LOCK_WAITS' in information_schema', instead of 0...

I pushed an amended version of the patch to bb-10.2-release.

Generated at Thu Feb 08 07:44:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.