Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22360

Sufficient conditions for accurate calculation of join cardinality

Details

    • New Feature
    • Status: Stalled (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 12.1
    • Optimizer
    • None

    Description

      Let Q be a join query over tables T1,...,Tn with the where condition C.

      In the following cases the calculation of the cardinality of the result set of Q is considered as accurate:

      a) C is a multiple equality such that for each column of it the number of distinct values is known

      Example:

      SELECT * from t1,t2 WHERE t1.a=t2.a;

      We just need to know NDV, which can be found in 2 ways

      • there is index on t1.a and t2.a
      • EITS statistics is available for t1.a and t2.a

      b) C is a conjunction of formulas from a)

      c) C is either inequality predicate or [NOT] BETWEEN predicate or [NOT] IN predicate or [NOT] LIKE predicate or IS [NOT] NULL predicate who has

      • only one non-constant argument and this argument a reference to a column that is
        used either as the first component of an index with known statistical data or a column
        for which a histogram is provided
        (for [NOT] IN / [NOT] LIKE predicate this column must be the first argument)
      • all other arguments must be inexpensive constant items that are converted to simple
        constant (cached) (by the time we check the conditions)
        (in the case of [NOT] LIKE predicate the constant argument must be of a special form)

      Examples here would be:

      SELECT * from t1,t2 WHERE t1.a op const

      Where op can be >/>=/</<=/=/<>

      Also the other cases are with
      [NOT] IN predicate, [NOT] NULL predicate, LIKE predicate

      d) C is a conjunction formulas of a) and AND-OR formulas over predicates from b) such that
      each of these formulas depends (directly or indirectly through equalities inferred from the
      conjuncted multiple equalities) only on one column.

      SELECT * from t1,t2 WHERE (t1.a =1 or t1.a=4);    // directly depends on one column

      SELECT * from t1,t2 WHERE t1.a=t2.b and (t1.a > 5 or t2.b < 1);

      indirectly depeneds on one column, we use muliple equality ME(t1.a,t2.b).
      So after substitution the query would look like

      SELECT * from t1,t2 WHERE t1.a=t2.b and (t1.a > 5 or t1.a < 1);

      and the entire OR conjunct (t1.a > 5 or t1.a < 1) can be resolved by one column

      The above sufficient conditions can easily be checked for any WHERE clause just by traversing the corresponding item.

      It's clear that if C guarantees accurateness of calculation of the cardinality of the result set for Q then the cardinality of any partial join over tables Ti1,...,Tik with conditions pushed from C to these tables can be calculated accurately.

      Attachments

        Issue Links

          Activity

            Has started with implementing the part c) mentioned in the description, looks simple to implement

            The part that is going to be complex to implement is part (a), need more clarity as to how we are going to take selectivity for equality conditions(both parts are non-const) into consideration

            varun Varun Gupta (Inactive) added a comment - Has started with implementing the part c) mentioned in the description, looks simple to implement The part that is going to be complex to implement is part (a), need more clarity as to how we are going to take selectivity for equality conditions(both parts are non-const) into consideration
            varun Varun Gupta (Inactive) added a comment - - edited

            Found out that for NOT IN predicate the selectivity estimates are not correct

            Here is a simple dataset:

            CREATE TABLE t1(a INT, b INT);
            INSERT INTO t1 SELECT seq, seq from seq_1_to_100;
            ANALYZE TABLE t1 PERSISTENT FOR ALL; // collect EITS
            

            MariaDB [test]> EXPLAIN EXTENDED SELECT * from t1 WHERE a IN (1,2,3,4,5,6,7,8,9,10);
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            |    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 100  |    10.00 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            1 row in set, 1 warning (0.004 sec)
            

            For the IN predicate we get accurate selectivity of 10%

            MariaDB [test]> EXPLAIN EXTENDED SELECT * from t1 WHERE a NOT IN (1,2,3,4,5,6,7,8,9,10);
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            |    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 100  |   100.00 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            1 row in set, 1 warning (0.003 sec)
            

            But here with NOT IN predicate the selectivity is 100 % which is incorrect, it should be 90%.

            Also from the optimizer trace I see

            "rows_estimation": [
                          {
                            "selectivity_for_indexes": [],
                            "selectivity_for_columns": [
                              {
                                "column_name": "a",
                                "selectivity_from_histogram": 1.088125
                              }
                            ],
                            "cond_selectivity": 1
                          },
            

            Here selectivity_from_histogram > 1 which is again INCORRECT.
            I think this needs a separate MDEV

            varun Varun Gupta (Inactive) added a comment - - edited Found out that for NOT IN predicate the selectivity estimates are not correct Here is a simple dataset: CREATE TABLE t1(a INT, b INT); INSERT INTO t1 SELECT seq, seq from seq_1_to_100; ANALYZE TABLE t1 PERSISTENT FOR ALL; // collect EITS MariaDB [test]> EXPLAIN EXTENDED SELECT * from t1 WHERE a IN (1,2,3,4,5,6,7,8,9,10); +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+ | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 100 | 10.00 | Using where | +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+ 1 row in set, 1 warning (0.004 sec) For the IN predicate we get accurate selectivity of 10% MariaDB [test]> EXPLAIN EXTENDED SELECT * from t1 WHERE a NOT IN (1,2,3,4,5,6,7,8,9,10); +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+ | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 100 | 100.00 | Using where | +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+ 1 row in set, 1 warning (0.003 sec) But here with NOT IN predicate the selectivity is 100 % which is incorrect, it should be 90%. Also from the optimizer trace I see "rows_estimation": [ { "selectivity_for_indexes": [], "selectivity_for_columns": [ { "column_name": "a", "selectivity_from_histogram": 1.088125 } ], "cond_selectivity": 1 }, Here selectivity_from_histogram > 1 which is again INCORRECT. I think this needs a separate MDEV

            Considering about part a) , the implementation is restricted to the case

            The distinct values for a column is known via
            1) EITS estimate: when EITS is collected we have the distinct values for the column
            2) Column is the first component of an index.

            Eg:

            SELECT * from t1,t2 WHERE t1.a=t2.a;
            

            so for the equality condition t1.a=t2.a if EITS is not available then t1.a and t2.a should be the first component of an index from table t1 and t2 respectively so that we could get the accurate estimate.

            This approach would obviously disallow cases like

             SELECT * from t1,t2 WHERE t1.a=t2.a and t1.b=t2.b;
            

            where lets say the plan is t1,t2 and t2 makes a ref access with index key1(a,b)
            then we have an estimate for the distinct values for (a,b) combined but this will give us accurate estimates for join cardinality.

            varun Varun Gupta (Inactive) added a comment - Considering about part a) , the implementation is restricted to the case The distinct values for a column is known via 1) EITS estimate: when EITS is collected we have the distinct values for the column 2) Column is the first component of an index. Eg: SELECT * from t1,t2 WHERE t1.a=t2.a; so for the equality condition t1.a=t2.a if EITS is not available then t1.a and t2.a should be the first component of an index from table t1 and t2 respectively so that we could get the accurate estimate. This approach would obviously disallow cases like SELECT * from t1,t2 WHERE t1.a=t2.a and t1.b=t2.b; where lets say the plan is t1,t2 and t2 makes a ref access with index key1(a,b) then we have an estimate for the distinct values for (a,b) combined but this will give us accurate estimates for join cardinality.

            The branch where the code is 10.6-mdev22360

            varun Varun Gupta (Inactive) added a comment - The branch where the code is 10.6-mdev22360
            varun Varun Gupta (Inactive) added a comment - - edited

            The approach that we are using to check if the estimates are accurate for join cardinality:

            Lets take a generic example with OR-AND conjuncts

            WHERE clause is defined as: (COND1 AND COND2) OR (COND3 AND (COND4 OR COND5))

            and each CONDx can be an OR-AND conjunct itself.

            So the algorithm to check if the selectivity estimates are accurate is

            Lets denote the WHERE CLAUSE as conds

            if conds is ITEM_COND::AND_ITEM :
                while ((item= li++))
                {
                  if (item->walk(callback_function, &arg))
                    return false;
                }
            else
              conds->walk(callback_function, &arg);
            

            So lets break this down into more details, the condition at the top level can be of 2 cases:

            Case 1: WHERE clause is an AND conjunct

            For an AND item at the top level, we need to walk over all the top level conjuncts and call walk individually on
            them. This is done in such a way because for an AND conjunct at the top level we may have accurate
            selectivity, even if the predicate belongs to a different column.
            Eg: t1.a > 10 and t2.a < 5

            For this AND item we will have accurate selectivities.
            For AND conjuncts (not at the top level), the entire conjunct needs to be resolved to one column.

            Eg: t1.a = t2.a AND ( (t1.a > 5 AND t2.a < 10) OR t1.a <= 0)

            Case 2

            2a) OR item

            For an OR item at the top level, we need to make sure that all the columns inside the OR conjunct need to
            belong to one column directly or indirectly. This needs to happen for an OR conjunct even if it is not at the
            top level.

            Eg: t1.a=t2.b and (t2.b > 5 or t1.a < 0);

            2b) Single predicate at the top level

            Eg:

            • t1.a= t2.a [ For this case we need to make sure we know number of distinct values for t1.a and t2.a ]
            • t1.a > 5 [ sargable predicate, get the estimate from the range optimizer ]
            varun Varun Gupta (Inactive) added a comment - - edited The approach that we are using to check if the estimates are accurate for join cardinality: Lets take a generic example with OR-AND conjuncts WHERE clause is defined as: (COND1 AND COND2) OR (COND3 AND (COND4 OR COND5)) and each CONDx can be an OR-AND conjunct itself. So the algorithm to check if the selectivity estimates are accurate is Lets denote the WHERE CLAUSE as conds if conds is ITEM_COND::AND_ITEM : while ((item= li++)) { if (item->walk(callback_function, &arg)) return false ; } else conds->walk(callback_function, &arg); So lets break this down into more details, the condition at the top level can be of 2 cases: Case 1: WHERE clause is an AND conjunct For an AND item at the top level, we need to walk over all the top level conjuncts and call walk individually on them. This is done in such a way because for an AND conjunct at the top level we may have accurate selectivity, even if the predicate belongs to a different column. Eg: t1.a > 10 and t2.a < 5 For this AND item we will have accurate selectivities. For AND conjuncts (not at the top level), the entire conjunct needs to be resolved to one column. Eg: t1.a = t2.a AND ( (t1.a > 5 AND t2.a < 10) OR t1.a <= 0) Case 2 2a) OR item For an OR item at the top level, we need to make sure that all the columns inside the OR conjunct need to belong to one column directly or indirectly. This needs to happen for an OR conjunct even if it is not at the top level. Eg: t1.a=t2.b and (t2.b > 5 or t1.a < 0); 2b) Single predicate at the top level Eg: t1.a= t2.a [ For this case we need to make sure we know number of distinct values for t1.a and t2.a ] t1.a > 5 [ sargable predicate, get the estimate from the range optimizer ]
            varun Varun Gupta (Inactive) added a comment - - edited

            The latest patch is here
            https://github.com/MariaDB/server/commit/1014b8f4c6ab88304d866edd8e93826744afca5c
            the branch is 10.6-mdev22360.

            varun Varun Gupta (Inactive) added a comment - - edited The latest patch is here https://github.com/MariaDB/server/commit/1014b8f4c6ab88304d866edd8e93826744afca5c the branch is 10.6-mdev22360.

            People

              Johnston Rex Johnston
              igor Igor Babaev (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.