[MDEV-8359] WHERE condition referring to inner table of left join can be sargable Created: 2015-06-23  Updated: 2023-03-24

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 11.2

Type: Task Priority: Major
Reporter: VAROQUI Stephane Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 2
Labels: optimizer, outer-join

Issue Links:
Relates
relates to MDEV-10946 Partition pruning not working with LE... Stalled
Sprint: 10.2.4-4, 10.2.4-1, 10.2.4-2

 Description   

Some ERP generate that type of queries

explain 
select count(0) AS `COUNT(*)` 
from 
   E_relance
   left join E_action on ((E_action.id_demande = E_relance.id_demande)) 
where 
  ((E_relance.id_demande = 88224) or (E_action.id_demande = 88224))

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: E_relance
         type: index
possible_keys: fk_E_relance_E_demande1
          key: fk_E_relance_E_demande1
      key_len: 5
          ref: NULL
         rows: 205655
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: E_action
         type: ref
possible_keys: FI__4381
          key: FI__4381
      key_len: 4
          ref: siam2.E_relance.id_demande
         rows: 2
        Extra: Using where; Using index

The full index scan on primary table is not necessary if const on joined table is not NULL.

Rewriting the query is hard to be done in the application because it may happen that the application is doing a lookup for NULL on left joined table

indeed query rewriting change the lookup to const or range in case the second condition is not null and propagated to the upper table

MAD_WEB_DEV (madweb@localhost) [siam2]> 
explain select count(0) AS `COUNT(*)` 
from
  E_relance left join E_action on ((E_action.id_demande = E_relance.id_demande)) 
where 
  ((E_relance.id_demande = 88224) or (E_relance.id_demande = 88224))\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: E_relance
         type: ref
possible_keys: fk_E_relance_E_demande1
          key: fk_E_relance_E_demande1
      key_len: 5
          ref: const
         rows: 1
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: E_action
         type: ref
possible_keys: FI__4381
          key: FI__4381
      key_len: 4
          ref: siam2.E_relance.id_demande
         rows: 2
        Extra: Using where; Using index



 Comments   
Comment by Sergei Petrunia [ 2015-06-23 ]

So, the second query has

where 
  ((E_relance.id_demande = 88224) or (E_relance.id_demande = 88224))

which allows the optimizer to read only rows with E_relance.id_demande=88224, i.e use ref access.

The first query is:

   E_relance
   left join E_action on ((E_action.id_demande = E_relance.id_demande)) 
where 
  ((E_relance.id_demande = 88224) or (E_action.id_demande = 88224))

The question is, can we satisfy the first query by just looking at rows with E_relance.id_demande=88224 ?

Comment by Sergei Petrunia [ 2015-06-23 ]

The answer is "YES", one can infer that by following this logic:

Suppose we read a row from E_relance, such that (E_relance.id_demande=88224)= FALSE
Then, there are two options
1. E_action has no matches, E_action.id_demande IS NULL. WHERE condition will evaluate to false.

2. E_action has a match. ON expression then guarantees that (E_action.id_demande = 88224)= FALSE, which means that the WHERE condition will evaluate to false.

Now, the question is, how can one generalize this logic to a rule that could be
used by the optimizer..

Comment by VAROQUI Stephane [ 2015-06-23 ]

The condition can be push up if left join and const is not null and fully push up for inner join

Comment by Sergei Petrunia [ 2015-06-23 ]

The above logic can only be applied in a limited number of cases:

First, the OR must be a two-way OR. If the WHERE expression was

  ((E_relance.id_demande = 88224) or (E_action.id_demande = 88224) or unknown_cond)

then it would be possible that

* {{(E_relance.id_demande = 88224)}} = FALSE
* {{(E_relance.id_demande = 88224)}} is either FALSE (there was a match) or NULL (no match)
* but, unknown_cond= TRUE.

we would still need to scan the whole table E_relance to check unknown_cond for all rows.

What if the constants in the WHERE were not the same? Consider

where
  ((E_relance.id_demande = 1) or (E_action.id_demande = 2))

Suppose we read a row from E_relance, such that (E_relance.id_demande=1)= FALSE.
Then, there are two options
1. E_action has no matches, E_action.id_demande IS NULL. WHERE condition will evaluate to false.
2.1 E_action has a match, E_action.id_demande=2=FALSE. WHERE condition will evaluate to false.
2.2 E_action has a match, E_action.id_demande=2=TRUE. Here, the condition will evaluate to true.

So, we'll need to scan rows with E_relance.id_demande IN (1,2).

Comment by Sergei Petrunia [ 2015-06-23 ]

Attempt to generalize #1:

Consider a query

select * from t1 left join t2 on on_expr where where_expr.

Suppose, we are doing range analysis on table t1. We are analyzing where_expr,
at the moment we're looking at its sub-condition $COND.

If $COND is null-rejecting wrt table t2, we can use multiple-equalities from
on_expr to substitute in $COND columns of t2 for columns of t1.

Need to think of:

  • whether the above is correct
  • how this applies to nested outer joins.
Comment by Sergei Petrunia [ 2015-06-24 ]

Checked PostgreSQL:

test=# explain select * from t1 left join t2 on t2.col1=t1.col1 where (t1.pk between 10 and 20) or t2.col1 between 30 and 40 ;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Hash Left Join  (cost=2693.00..7136.00 rows=20 width=16)
   Hash Cond: (t1.col1 = t2.col1)
   Filter: (((t1.pk >= 10) AND (t1.pk <= 20)) OR ((t2.col1 >= 30) AND (t2.col1 <= 40)))
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
         ->  Seq Scan on t2  (cost=0.00..1443.00 rows=100000 width=8)
 Planning time: 0.465 ms
(7 rows)

Note the Seq Scan elements. They dont have the optimization that we're talking about here.

Comment by Sergei Petrunia [ 2016-10-13 ]

Re the *Attempt to generalize #1* :

Here's a justification why it is correct:
Suppose we have a query:

t1 LEFT JOIN t2 ON t1.col=t2.col
WHERE 
  ... cond(t2.col) ...

The question is, can we replace cond(t2.col) with cond(t1.col)? (if not replace altogether, at least do it for the purpose of doing range analysis on table t1)

Let's consider how the WHERE clause is computed for various kinds of (t1.row, t2.row) record combinations.
LEFT JOIN operation produces two kinds of row combinations:

  1. {t1.rowA, t2.rowX}

    when table t2 had a match

  2. {t1.rowB, t2.NULLs}

    when table t2 didn't have a match.

Consider a row combination of type #1. It has t1.col=t2.col, which means the value of cond(t1.col) will be the same as the value cond(t2.col).
(Here, we assume equality substitution is "identity" substitution, i.e. the issue of 'A'='a' but f('A')!=f('a') is not a concern).

Consider a row combination of type #2. t2.col is a NULL, cond(NULL) is not the same as cond(t1.col).

when cond(t2.col) is at top-level of the WHERE clause, and is NULL-rejecting wrt t2.col, it will evaluate to FALSE, and cause the entire WHERE to evaluate to FALSE.
If we change cond(t2.col) into cond(t1.col), then

  • the value of the WHERE expression may remain FALSE. The result of query will remain the same, or
  • the value of the WHERE expression may become TRUE.

If the latter happens, we will get extra rows. However, this is not an issue if we just use this substitution for the purpose of constructing range/index_merge
access to table t1. We may read extra rows but they will be filtered out when the WHERE condition is checked.

As for nested outer joins: We don't care about the nesting. t2.col must be either an NULL-complemented row, or the row must have t2.col=t1.col.

Comment by Sergei Petrunia [ 2016-10-13 ]

A basic example to play with:

create table t1 (a int, b int, key(a));
insert into t1 select a,a from test.one_k;
 
create table t2 (a int, b int);
insert into t2 select a,a from test.one_k;
 
explain extended
select * 
from 
  t1 left join t2 on t2.a=t1.a
where 
  t1.a< 3 or t2.a <4;

Debugging this, one can see a difficulty in the implementation:
When the range optimizer is invoked for the WHERE condition, "t2.a" is an Item_field that has no reference to the Item_equal.
This is kind of logical, as one can't unconditionally substitute t2.a in the WHERE, based on the restrictions in the ON expression.

Comment by Sergei Petrunia [ 2016-10-13 ]

http://lists.askmonty.org/pipermail/commits/2016-October/009987.html

elenst, I would like a test pass for this. Need outer joins

  • ON expressions have equalities
  • WHERE expression has range conditions on inner tables
  • the outer table has indexes that these range conditions could use
Comment by Sergei Petrunia [ 2016-10-14 ]

This fix doesn't help with MDEV-10946.
This fix allows using parts of WHERE to construct access for the outer tables.
MDEV-10946 needs to use a part of WHERE when the ON expression is used to construct access to an inner table.

Comment by VAROQUI Stephane [ 2016-10-14 ]

Miam Miam! Thanks for this fix

Comment by Elena Stepanova [ 2016-10-31 ]

psergey,

Elena Stepanova, I would like a test pass for this. Need outer joins
ON expressions have equalities
WHERE expression has range conditions on inner tables
the outer table has indexes that these range conditions could use

I've run tests on the patch comparing to the vanilla tree, didn't get any problems which weren't there before the patch. Please go ahead and push, there will also be regular regression tests on the main tree.

Comment by Sergei Petrunia [ 2016-11-15 ]

Igor, please review

Comment by Sergei Petrunia [ 2017-01-27 ]

Re-assigning to me as the ball is on my side currently.

Generated at Thu Feb 08 07:26:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.