|
Re: semijoin much slower than materialization
The test data has been uploaded to the private FTP server, named bug_929732_fimbulvetr.sql.bz2
|
|
Re: semijoin much slower than in_to_exists
On my laptop, the query takes ~0.8 sec when semi-join is enabled. and 0.01 sec when it is not.
EXPLAINs are same as posted. For both plans, #rows column has values of
{275,1,1,1,1,1}
, which gives an estimate of about 1375 rows read.
|
|
Re: semijoin much slower than in_to_exists
(run the query without semi-join)
MariaDB [information_schema]> select * from table_statistics;
-------------------------------------------------------------------
| TABLE_SCHEMA |
TABLE_NAME |
ROWS_READ |
ROWS_CHANGED |
ROWS_CHANGED_X_INDEXES |
-------------------------------------------------------------------
| bug929732 |
c |
420 |
0 |
0 |
| bug929732 |
v |
1509 |
0 |
0 |
-------------------------------------------------------------------
2 rows in set (0.00 sec)
(run the query with semi-join)
MariaDB [information_schema]> select * from table_statistics;
-------------------------------------------------------------------
| TABLE_SCHEMA |
TABLE_NAME |
ROWS_READ |
ROWS_CHANGED |
ROWS_CHANGED_X_INDEXES |
-------------------------------------------------------------------
| bug929732 |
c |
918 |
0 |
0 |
| bug929732 |
v |
896565 |
0 |
0 |
-------------------------------------------------------------------
Apparently, the plan with semi-join reads many more records. We'll need to
figure out if that is justified, and if yes, why the optimizer could not
foresee that.
|
|
Re: semijoin much slower than in_to_exists
The problem seems to be related to nested subqueries. The top-level subquery is equivalent to a join, if I manually change it to be a join:
SELECT count
FROM (v
LEFT JOIN c
ON v.cid = c.cid)
join c as c2
WHERE v.t >= '2012-01-31 05:00:00'
AND v.t <= '2012-02-08 07:59:59'
AND v.did = '208'
AND c.pid = '3124'
AND v.cid = c2.cid
AND c2.pid = '3124'
AND c2.s = 0
AND
(
(
c2.cid IN
( /Inner query 1/ SELECT v2.cid
FROM v AS v2
WHERE v2.did = 208
AND v2.t >= '2012-01-31 05:00:00'
AND v2.t <= '2012-02-08 07:59:59'
)
)
AND
(
c2.cid IN
( /Inner query 2/ SELECT v3.cid
FROM v as v3
WHERE v3.did = 208
AND v3.t >= '2012-01-31 05:00:00'
AND v3.t <= '2012-02-08 07:59:59'
)
)
AND
(
c2.cid IN
( /Inner query 3/ SELECT v4.cid
FROM v as v4
WHERE v4.did = 208
AND v4.t >= '2012-01-31 05:00:00'
AND v4.t <= '2012-02-08 07:59:59'
)
)
);
.. then the query takes 0.02-0.03 sec
|
|
Re: semijoin much slower than in_to_exists
^^ I mean it takes 0.02 - 0.03 sec regardless of whether I have semijoin=on or semijoin=off
|
|
Re: semijoin much slower than in_to_exists
Took the original query and changed table names inside deepest-level subqueries to be v2, v3, v4 (so we can tell them apart in the counters), and I get this:
-----------------------------------------------------------------------------------------------------------------------+
| id |
select_type |
table |
type |
possible_keys |
key |
key_len |
ref |
rows |
Extra |
-----------------------------------------------------------------------------------------------------------------------+
| 1 |
PRIMARY |
v |
range |
cid,did |
did |
17 |
NULL |
275 |
Using where; Using index |
| 1 |
PRIMARY |
v2 |
ref |
cid,did |
cid |
16 |
bug929732.v.cid,const |
1 |
Using where; Using index; Start temporary |
| 1 |
PRIMARY |
v3 |
ref |
cid,did |
cid |
16 |
bug929732.v.cid,const |
1 |
Using where; Using index |
| 1 |
PRIMARY |
v4 |
ref |
cid,did |
cid |
16 |
bug929732.v.cid,const |
1 |
Using where; Using index |
| 1 |
PRIMARY |
c |
eq_ref |
PRIMARY |
PRIMARY |
8 |
bug929732.v.cid |
1 |
Using where |
| 1 |
PRIMARY |
c2 |
eq_ref |
PRIMARY |
PRIMARY |
8 |
bug929732.v.cid |
1 |
Using where; End temporary |
-----------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
statistics shows this:
MariaDB [bug929732]> SHOW TABLE_STATISTICS;
--------------------------------------------------------------------
| Table_schema |
Table_name |
Rows_read |
Rows_changed |
Rows_changed_x_#indexes |
--------------------------------------------------------------------
| bug929732 |
v |
276 |
0 |
0 |
| bug929732 |
v2 |
3994 |
0 |
0 |
| bug929732 |
v3 |
53506 |
0 |
0 |
| bug929732 |
v4 |
837280 |
0 |
0 |
| bug929732 |
c |
498 |
0 |
0 |
--------------------------------------------------------------------
5 rows in set (0.00 sec)
MariaDB [bug929732]> SHOW INDEX_STATISTICS;
-------------------------------------------+
| Table_schema |
Table_name |
Index_name |
Rows_read |
-------------------------------------------+
| bug929732 |
v |
did |
276 |
| bug929732 |
v2 |
cid |
3994 |
| bug929732 |
v3 |
cid |
53506 |
| bug929732 |
v4 |
cid |
837280 |
| bug929732 |
c |
PRIMARY |
498 |
-------------------------------------------+
|
|
Re: semijoin much slower than in_to_exists
Hi,
I just confirmed this problem is in maria 5.5.24 (Compiled from source) build as well. We are eagerly awaiting this fix as semijoin is otherwise quite good. I am willing to take a number of steps to help fix this bug.
|
|
Re: semijoin much slower than in_to_exists
Thanks. I'll raise the priority.
|
|
Re: semijoin much slower than in_to_exists
Hi Sergey,
That is a very interesting last result you posted. I just compiled 5.5.25 and attempted to reproduce the SHOW TABLE_STATISTICS like this:
--------------------------------------------------------------------
| Table_schema |
Table_name |
Rows_read |
Rows_changed |
Rows_changed_x_#indexes |
--------------------------------------------------------------------
| bug929732 |
v |
276 |
0 |
0 |
| bug929732 |
v2 |
3994 |
0 |
0 |
| bug929732 |
v3 |
53506 |
0 |
0 |
| bug929732 |
v4 |
837280 |
0 |
0 |
| bug929732 |
c |
498 |
0 |
0 |
--------------------------------------------------------------------
However, I must confess that I may not know how to get table_statistics to recognize v, v2, v3 and v4 as different tables in the output. Mine always show only v on table_statistics, but the explain output does recognize them as different tables. I am using:
SELECT count
FROM v
LEFT JOIN c
ON v.cid = c.cid
WHERE v.t >= '2012-01-31 05:00:00'
AND v.t <= '2012-02-08 07:59:59'
AND v.did = '208'
AND c.pid = '3124'
AND v.cid IN
( SELECT c.cid
FROM c
WHERE c.pid = '3124'
AND c.s = 0
AND
(
(
c.cid IN
( /Inner query 1/ SELECT v2.cid
FROM v as v2
WHERE v2.did = 208
AND v2.t >= '2012-01-31 05:00:00'
AND v2.t <= '2012-02-08 07:59:59'
)
)
AND
(
c.cid IN
( /Inner query 2/ SELECT v3.cid
FROM v as v3
WHERE v3.did = 208
AND v3.t >= '2012-01-31 05:00:00'
AND v3.t <= '2012-02-08 07:59:59'
)
)
AND
(
c.cid IN
( /Inner query 3/ SELECT v4.cid
FROM v as v4
WHERE v4.did = 208
AND v4.t >= '2012-01-31 05:00:00'
AND v4.t <= '2012-02-08 07:59:59'
)
)
)
);
I had noticed that each v (v, v2, v3, v4) in your statistics reads roughly 14x as many rows as the previous. I was trying to see if a v5 would read ~14*837280.
That is an interesting pattern and may lead us to the culprit. I have a debug build compiled, if you can paste in the commands I need to start mysql with to get the relevant debugging for the optimizer, I can have a look at the source and better understand the problem.
|
|
Re: semijoin much slower than in_to_exists
> However, I must confess that I may not know how to get table_statistics to recognize v, v2, v3 and v4 as different tables in the output.
I do it like this:
create table v2 like v;
insert into v2 select * from v;
-
- same for v3, v4
make the different subqueries access the created tables v2, v3, v4.
|
|
Re: semijoin much slower than in_to_exists
To re-iterate:
When I try the flattened (where 'SELECT c2.cid FROM c2' subquery is changed
into a join), I get a decent plan.
It executes in 0.04 secs (my laptop, debug build) and reads 2766 rows.
Playing with semi-join opimization flags in optimizer switch allows one to get
several query plans which all do 1600-2700 reads and are done in 0.03-0.04
sec.
|
|
Re: semijoin much slower than in_to_exists
For the original query, the explain looks decent: http://mariadb.org/explain_analyzer/analyze/muybt
rows=
{275, 1, 1, 1, 1, 1}
, which gives total_rows =
275+275+275+275+275+275 = 1650, which is on par with the flattened query plans.
The execution is different, though:
---------------------------------
| Table_schema |
Table_name |
Rows_read |
---------------------------------
| bug929732 |
v |
276 |
- OK
|
| bug929732 |
v2 |
3994 |
- on estimate, should be 276 x 1=276. Instead, it is 14.5 x times more.
|
| bug929732 |
v3 |
53506 |
- again, 13.4 times more
|
| bug929732 |
v4 |
837280 |
- again, 15.6 times more
|
| bug929732 |
c |
249 |
- Now, we are back to small number of reads
|
| bug929732 |
c2 |
249 |
- and here, too
---------------------------------
|
|
|
Re: semijoin much slower than in_to_exists
-
- Lets check index statistics:
--------------------------------------------------------------------
| Table |
Non_unique |
Key_name |
Seq_in_index |
Column_name |
Collation |
Cardinality |
--------------------------------------------------------------------
| v2 |
1 |
cid |
1 |
cid |
A |
589 |
| v2 |
1 |
cid |
2 |
did |
A |
589 |
-
- For full scan on table v2, EXPLAIN gives rows=1083, which gives
rec_per_key=1.8. EXPLAIN shows '1'.
Checking how many records subquery with table v2 has for table v:
select count from v where
v.t >= '2012-01-31 05:00:00'
AND v.t <= '2012-02-08 07:59:59'
AND v.did = '208';
- This gives 275 rows, which confirms the previous findings
select count from v where
v.t >= '2012-01-31 05:00:00'
AND v.t <= '2012-02-08 07:59:59'
AND v.did = '208'
and v.cid IN
( /Inner query 1/ SELECT v2.cid
FROM v2
WHERE v2.did = 208
AND v2.t >= '2012-01-31 05:00:00'
AND v2.t <= '2012-02-08 07:59:59'
);
-
- ^^ 275 rows again: the subquery has a match for each row of table v.
select count from v,v2 where
v.t >= '2012-01-31 05:00:00'
AND v.t <= '2012-02-08 07:59:59'
AND v.did = '208'
and v.cid = v2.cid and
v2.did = 208
AND v2.t >= '2012-01-31 05:00:00'
AND v2.t <= '2012-02-08 07:59:59'
;
(output is count =1145)
---------------------------------
| Table_schema |
Table_name |
Rows_read |
---------------------------------
| bug929732 |
v |
276 |
| bug929732 |
v2 |
3994 |
---------------------------------
-
- This gives: total subquery's fanout is 1145/276= 4x,
- the number of records that one needs to read is 15x.
|
|
Re: semijoin much slower than in_to_exists
|
|
full log for the above comments: queries, EXPLAIN outputs, counters
LPexportBug929732_bug929732-xpl.txt
|
|
Re: semijoin much slower than in_to_exists
== Observation about the dataset ==
- The subqueries have high fanout, it is 4x overall, 15x if one counts the
number of rows that need to be examined.
- The real fanout is much higher than the estimate provided to the 1x.
== Analysis ==
Semi-join optimizer/run-time is incapable of processing nested subqueries. When
it encounters nested subuqeries it will flatten them. The original query of
this bug has this form:
v LEFT JOIN c SEMI JOIN (c2 SEMI JOIN v2
|
SEMI JOIN v3
|
SEMI JOIN v4)
|
Note the 2-level semi-join nesting. The nesting is eliminated: the query is
converted to
v LEFT JOIN c JOIN c2 SEMI JOIN (v2 JOIN v3 JOIN v4)
|
All subqueries are put together into one cross-product-join-subquery. Fanout
of this new subquery is a product of fanouts of each subqueries.
In some cases, it is not a problem. For instance, consider FirstMatch strategy
which is run on a dataset where the every table from the subquery immediately
produces a matching record (that is, the first row read from the table happens
to satisfy relevant part of the WHERE clause).
In this case, FirstMatch's execution short-cutting property will make
subquery's fanout irrelevant: only the first matching record will be read.
However, for the query in this bug, the plan was Duplicate Weedout:
+-----+------+-----------+----+-----------------------------------------+
|
|table|type |ref |rows|Extra |
|
+-----+------+-----------+----+-----------------------------------------+
|
|v |range |NULL | 275|Using where; Using index |
|
|v2 |ref |v.cid,const| 1|Using where; Using index; Start temporary|
|
|v3 |ref |v.cid,const| 1|Using where; Using index |
|
|v4 |ref |v.cid,const| 1|Using where; Using index |
|
|c |eq_ref|v.cid | 1|Using where |
|
|c2 |eq_ref|v.cid | 1|Using where; End temporary |
|
+-----+------+-----------+----+-----------------------------------------+
|
the semi-join nest consists of tables (v2, v3, v4), and it is correlated with
table c because of the IN-equalities:
c2.cid=v2.cid AND c2.cid=v3.cid AND c2.cid=v4.cid
|
Duplicate Elimination operates on the join order of "v2,v3,v4,c,c2". Note that
the outer table is at the end. When the outer table is at the end, Duplicate
Elimination is unable to do any short-cutting, and so ends up reading many more records.
|
|
Launchpad bug id: 929732
|
|
Re: semijoin much slower than in_to_exists
This is not easy to solve. Created https://mariadb.atlassian.net/browse/MDEV-401 to work on a solution.
|