[MDEV-22360] Sufficient conditions for accurate calculation of join cardinality Created: 2020-04-24  Updated: 2023-12-12

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 11.5

Type: New Feature Priority: Major
Reporter: Igor Babaev Assignee: Rex Johnston
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
PartOf
is part of MDEV-8306 Complete cost-based optimization for ... Stalled
Relates
relates to MDEV-31327 Histogram range selectivity estimates... Stalled

 Description   

Let Q be a join query over tables T1,...,Tn with the where condition C.

In the following cases the calculation of the cardinality of the result set of Q is considered as accurate:

a) C is a multiple equality such that for each column of it the number of distinct values is known

Example:

SELECT * from t1,t2 WHERE t1.a=t2.a;

We just need to know NDV, which can be found in 2 ways

  • there is index on t1.a and t2.a
  • EITS statistics is available for t1.a and t2.a

b) C is a conjunction of formulas from a)

c) C is either inequality predicate or [NOT] BETWEEN predicate or [NOT] IN predicate or [NOT] LIKE predicate or IS [NOT] NULL predicate who has

  • only one non-constant argument and this argument a reference to a column that is
    used either as the first component of an index with known statistical data or a column
    for which a histogram is provided
    (for [NOT] IN / [NOT] LIKE predicate this column must be the first argument)
  • all other arguments must be inexpensive constant items that are converted to simple
    constant (cached) (by the time we check the conditions)
    (in the case of [NOT] LIKE predicate the constant argument must be of a special form)

Examples here would be:

SELECT * from t1,t2 WHERE t1.a op const

Where op can be >/>=/</<=/=/<>

Also the other cases are with
[NOT] IN predicate, [NOT] NULL predicate, LIKE predicate

d) C is a conjunction formulas of a) and AND-OR formulas over predicates from b) such that
each of these formulas depends (directly or indirectly through equalities inferred from the
conjuncted multiple equalities) only on one column.

SELECT * from t1,t2 WHERE (t1.a =1 or t1.a=4);    // directly depends on one column

SELECT * from t1,t2 WHERE t1.a=t2.b and (t1.a > 5 or t2.b < 1);

indirectly depeneds on one column, we use muliple equality ME(t1.a,t2.b).
So after substitution the query would look like

SELECT * from t1,t2 WHERE t1.a=t2.b and (t1.a > 5 or t1.a < 1);

and the entire OR conjunct (t1.a > 5 or t1.a < 1) can be resolved by one column

The above sufficient conditions can easily be checked for any WHERE clause just by traversing the corresponding item.

It's clear that if C guarantees accurateness of calculation of the cardinality of the result set for Q then the cardinality of any partial join over tables Ti1,...,Tik with conditions pushed from C to these tables can be calculated accurately.



 Comments   
Comment by Varun Gupta (Inactive) [ 2020-05-20 ]

Has started with implementing the part c) mentioned in the description, looks simple to implement

The part that is going to be complex to implement is part (a), need more clarity as to how we are going to take selectivity for equality conditions(both parts are non-const) into consideration

Comment by Varun Gupta (Inactive) [ 2020-05-20 ]

Found out that for NOT IN predicate the selectivity estimates are not correct

Here is a simple dataset:

CREATE TABLE t1(a INT, b INT);
INSERT INTO t1 SELECT seq, seq from seq_1_to_100;
ANALYZE TABLE t1 PERSISTENT FOR ALL; // collect EITS

MariaDB [test]> EXPLAIN EXTENDED SELECT * from t1 WHERE a IN (1,2,3,4,5,6,7,8,9,10);
+------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 100  |    10.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.004 sec)

For the IN predicate we get accurate selectivity of 10%

MariaDB [test]> EXPLAIN EXTENDED SELECT * from t1 WHERE a NOT IN (1,2,3,4,5,6,7,8,9,10);
+------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 100  |   100.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.003 sec)

But here with NOT IN predicate the selectivity is 100 % which is incorrect, it should be 90%.

Also from the optimizer trace I see

"rows_estimation": [
              {
                "selectivity_for_indexes": [],
                "selectivity_for_columns": [
                  {
                    "column_name": "a",
                    "selectivity_from_histogram": 1.088125
                  }
                ],
                "cond_selectivity": 1
              },

Here selectivity_from_histogram > 1 which is again INCORRECT.
I think this needs a separate MDEV

Comment by Varun Gupta (Inactive) [ 2020-05-21 ]

Considering about part a) , the implementation is restricted to the case

The distinct values for a column is known via
1) EITS estimate: when EITS is collected we have the distinct values for the column
2) Column is the first component of an index.

Eg:

SELECT * from t1,t2 WHERE t1.a=t2.a;

so for the equality condition t1.a=t2.a if EITS is not available then t1.a and t2.a should be the first component of an index from table t1 and t2 respectively so that we could get the accurate estimate.

This approach would obviously disallow cases like

 SELECT * from t1,t2 WHERE t1.a=t2.a and t1.b=t2.b;

where lets say the plan is t1,t2 and t2 makes a ref access with index key1(a,b)
then we have an estimate for the distinct values for (a,b) combined but this will give us accurate estimates for join cardinality.

Comment by Varun Gupta (Inactive) [ 2020-05-27 ]

The branch where the code is 10.6-mdev22360

Comment by Varun Gupta (Inactive) [ 2020-06-16 ]

The approach that we are using to check if the estimates are accurate for join cardinality:

Lets take a generic example with OR-AND conjuncts

WHERE clause is defined as: (COND1 AND COND2) OR (COND3 AND (COND4 OR COND5))

and each CONDx can be an OR-AND conjunct itself.

So the algorithm to check if the selectivity estimates are accurate is

Lets denote the WHERE CLAUSE as conds

if conds is ITEM_COND::AND_ITEM :
    while ((item= li++))
    {
      if (item->walk(callback_function, &arg))
        return false;
    }
else
  conds->walk(callback_function, &arg);

So lets break this down into more details, the condition at the top level can be of 2 cases:

Case 1: WHERE clause is an AND conjunct

For an AND item at the top level, we need to walk over all the top level conjuncts and call walk individually on
them. This is done in such a way because for an AND conjunct at the top level we may have accurate
selectivity, even if the predicate belongs to a different column.
Eg: t1.a > 10 and t2.a < 5

For this AND item we will have accurate selectivities.
For AND conjuncts (not at the top level), the entire conjunct needs to be resolved to one column.

Eg: t1.a = t2.a AND ( (t1.a > 5 AND t2.a < 10) OR t1.a <= 0)

Case 2

2a) OR item

For an OR item at the top level, we need to make sure that all the columns inside the OR conjunct need to
belong to one column directly or indirectly. This needs to happen for an OR conjunct even if it is not at the
top level.

Eg: t1.a=t2.b and (t2.b > 5 or t1.a < 0);

2b) Single predicate at the top level

Eg:

  • t1.a= t2.a [ For this case we need to make sure we know number of distinct values for t1.a and t2.a ]
  • t1.a > 5 [ sargable predicate, get the estimate from the range optimizer ]
Comment by Varun Gupta (Inactive) [ 2020-06-16 ]

The latest patch is here
https://github.com/MariaDB/server/commit/1014b8f4c6ab88304d866edd8e93826744afca5c
the branch is 10.6-mdev22360.

Generated at Thu Feb 08 09:14:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.