Looking at the next patch. Good to see the input was addressed.
Handling of NULLs seems to produce correct output, although I see that the query plan has redundant elements. Look at the attached file: mdev22534-full-scan-on-null-key.sql
The trivial correlation is
the code puts t1.b IS NOT NULL at the top level (outside the subquery):
"query_block": {
|
"select_id": 1,
|
"attached_condition": "!(t1.b is not null and <in_optimizer>((t1.a,t1.b),<exists>(subquery#2)))"
|
But then inside the subquery we have:
{
|
"full-scan-on-null_key": {
|
"table": {
|
"table_name": "t2",
|
"access_type": "ref_or_null",
|
"possible_keys": ["y"],
|
"key": "y",
|
"key_length": "5",
|
"used_key_parts": ["y"],
|
the "NULL key" for lookup into t2.y is t1.b. Which we know will be NOT NULL if we get into the subquery.
I did some debugging, and it seems to be caused by the IN subselect creating Item_func_trig_cond when it is not top level during stage 2 optimization[1]. It is avoided by the exists2in transformation because it marks the newly created IN subselect as toplevel[2], even if there's an upper_not:
in_subs->top_level_item();
|
[1] (backtrace): Item_in_subselect::create_row_in_to_exists_cond > Item_in_subselect::create_in_to_exists_cond > JOIN::choose_subquery_plan > make_join_statistics > JOIN::optimize_inner > JOIN::optimize > st_select_lex::optimize_unflattened_subqueries > JOIN::optimize_unflattened_subqueries > JOIN::optimize_stage2 > JOIN::optimize_inner > JOIN::optimize
[2] Item_exists_subselect::exists2in_processor
I tried to do the same thing with my decorrelate-in transformation[3], but it fails the following test:
CREATE TABLE t1 (pk INT NOT NULL, i INT NOT NULL);
|
INSERT INTO t1 VALUES (0,10), (1,20), (2,30), (3,40);
|
CREATE TABLE t2a (pk INT NOT NULL, i INT NOT NULL, PRIMARY KEY(i,pk));
|
INSERT INTO t2a VALUES (0,0), (1,1), (2,2), (3,3);
|
SELECT * FROM t1 WHERE (NULL, 1) NOT IN (SELECT t2a.i, t2a.pk FROM t2a WHERE t2a.pk = t1.pk); # outputs nothing, but should be (0, 10), (2, 30), (3, 40)
|
[3] https://github.com/MariaDB/server/commit/373a97067bd
I could not come up with a similar test for exists2in to check whether the forced toplevel would cause similar problems. In any case, this needs to be further investigated.
Yes that should be fine. However since the primary goal of processing IN subquery is to decorrelate it, it would be nice to write the outcomes into the trace: whether the subquery became uncorrelated and also write a note if Materialization became a possible strategy.
Added a field in the trace whether the outcome is still correlated. AFAICT materialization, if applicable after the transformation, does show up in the trace. An example is the non-semijoin materialization example in the test[3], where if you remove the ".to" in json_extract path, you will see materialization after the decorrelation has the possible field equal true.
[3] https://github.com/MariaDB/server/blob/62251786f8bb2084fe9169befdc569c24f5078f2/mysql-test/main/subselect_decorrelate_in.test#L437
Nice to see a readable comment before the big if-check in Item_in_subselect::exists2in_processor().
Comparing it with its counterpart in Item_exists_subselect::exists2in_processor(), I can see the latter has more checks:
(!optimizer ||
|
...
|
!first_select->leaf_tables.elements||
|
with_recursive_reference)
|
Do we need to add these too? (I wonder what "!optimizer" check is for...)
Looking at my notes, I don't know what these checks are for and did not find them necessary for my decorrelate-in transformation. The example snippet in the tests I found that covers !first_select->leaf_tables.elements is
SELECT * FROM t2 AS alias1, t2 AS alias2
|
WHERE EXISTS ( SELECT 1 ) AND (alias2.pk = alias1.b )
|
ORDER BY alias1.b;
|
I wrote a similar test for IN, and it got optimized
# no need to check leaf tables
|
CREATE TABLE t2 (pk INT PRIMARY KEY, b INT) ENGINE=MyISAM;
|
INSERT INTO t2 VALUES (1,1),(2,7);
|
SELECT * FROM t2 AS alias1, t2 AS alias2
|
WHERE alias1.pk in ( SELECT 1 ) AND (alias2.pk = alias1.b )
|
ORDER BY alias1.b;
|
drop table t2;
|
For with_recursive_reference, the relevant snippet is from cte_recursive.test:
with recursive
|
reached_objects as
|
(
|
select v, 'init' as m from objects where v in ('v3','v7','v9')
|
union
|
select module_results.v, module_results.m from module_results, applied_modules
|
where module_results.m = applied_modules.m
|
),
|
applied_modules as
|
(
|
select * from modules where 1=0
|
union
|
select modules.m
|
from
|
modules
|
where
|
not exists (select * from module_arguments
|
where module_arguments.m = modules.m and
|
module_arguments.v not in
|
(select v from reached_objects))
|
)
|
select * from reached_objects
|
There's IN in this test, and without the check in the decorrelate-in transformation, the test passes. I also tried the following test snippet and there's no problem without the check in place:
with recursive
|
ancestor_ids (id, generation)
|
as
|
(
|
select father, 1 from folks where name = 'Me' and father is not null
|
union all
|
select mother, 1 from folks where name = 'Me' and mother is not null
|
union all
|
select father, fa.generation+1 from folks, ancestor_ids fa
|
where folks.id = fa.id and father is not null and
|
(father not in (select id from ancestor_ids))
|
union all
|
select mother, ma.generation+1 from folks, ancestor_ids ma
|
where folks.id = ma.id and mother is not null and
|
(mother not in (select id from ancestor_ids))
|
)
|
select generation, name from ancestor_ids a, folks
|
where a.id = folks.id
|
I can also see both exists2in_processor() functions have:
!(is_top_level_item() ||
|
(upper_not && upper_not->is_top_level_item())) || /* (2) */
|
although the motivation behind this limitation is not clear.
is_top_level_item()==true means one doesn't care between FALSE/NULL results, that is, one only needs to check if the subquery has a match. We don't need to care about non-matches vs non-matches in presence of NULLs.
But if is_top_level_item()==false and (upper_not && upper_not->is_top_level_item())==true then it's a NOT IN and for IN, we don't care between TRUE/NULL results.
This means we do need to differentiate between non-match case (FALSE result) and non-match-but-NULLs-are-present (NULL result).
But if we do differentiate, why can't we handle the subquery in an arbitrary context?
Right, we don't have to worry about null when it is top level in, and with the added is not null in both the inner and outer where condition, the value of the outer where condition is exactly the same before and after the transformation. I wrote two possibilities[4] why we don't consider arbitrary context, which I quote here:
With this modification, theoretically we don't have to care about
whether the IN subselect is top level, however, there are two issues:
1. When the subquery is indeed top level, do the extra terms lower
performance?
2. When the subquery is not top level, how do we substitute `this`
which is an Item_in_subselect with the Item_func_and?
I suspect this is why the exists2in transformation only does the
modification when there is an upper_not, and the transformation is
skipped when the subquery is not toplevel EXISTS nor toplevel NOT
EXISTS.
I think we can fix Item 1 by handling toplevel separately from non-toplevel, that is, by not adding the is not nulls when it is top level. But do you know how fix Item 2 psergei?
[4] https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/message/CEPJCC7GCZ4H44L276AFRIUB7LYIJOPI/
The "!optimizer" check is only meaningful for EXISTS subqueries. Item_in_subselect is always wrapped in an "optimizer".
This is not always true. Item_in_subselect is not always wrapped in an optimizer and it is ok (we don't need a similar check). For example you will get segfaults when you remove the if (optimizer) from the following[5] and test against the most basic case select * from t1 where a in (select a from t2 where t1.b=t2.b);
// inside bool Item_in_subselect::exists2in_processor(void *opt_arg)
|
/* Update any Item_in_optimizer wrapper if exists */
|
if (optimizer)
|
{
|
optimizer->reset_cache();
|
if (optimizer->fix_left(thd))
|
{
|
res= TRUE;
|
goto out;
|
}
|
}
|
[5] https://github.com/MariaDB/server/blob/62251786f8bb2084fe9169befdc569c24f5078f2/sql/item_subselect.cc#L3441
A patch with trivial coding style cleanups: https://github.com/MariaDB/server/commit/62251786f8bb2084fe9169befdc569c24f5078f2
Thank you, I have incorporated the change, as well as the is_correlated field:
https://github.com/MariaDB/server/commit/6175a4541ff
So we want to turn some kind of trivially correlated subquery to an uncorrelated subquery, so that the optimizer can apply materialization.
Extrapolating from the example, and the description in EXISTS-TO-IN[1], what happens is that we transform
in (select inner_col' from inner_table where inner_col = outer_col)
to
, outer_col in (select inner_col', inner_col from inner_table)
How much do we want to generalize this? Judging from EXISTS-TO-IN, we can add an inner_where:
in (select inner_col' from inner_table where inner_col = outer_col and inner_where) => , outer_col in (select inner_col', inner_col from inner_table where inner_where)
But how about expressions?
EXISTS-TO-IN seems to generalise to outer_expr=inner_col (see find_inner_outer_equalities()), so maybe we can also have
in (select inner_col' from inner_table where inner_col = outer_expr and inner_where) => , outer_expr in (select inner_col', inner_col from inner_table where inner_where)
But why not go all the way to
in (select inner_expr' from inner_table where inner_expr = outer_expr and inner_where) => , outer_expr in (select inner_expr', inner_expr from inner_table where inner_where)
?
[1] https://mariadb.com/kb/en/exists-to-in-optimization/
In any case I'll try to hack out a PoC that covers the example first.