Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5985

EITS: selectivity estimates look illogical for join and non-key equalities

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.0.9
    • 10.0.12
    • None

    Description

      Create a dataset:

      create table ten(a int);
      insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
      create table one_k(a int);
      insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
      create table one_k2 as select * from one_k;
      set histogram_size=100;
      set use_stat_tables='preferably';
      set optimizer_use_condition_selectivity=4;
      analyze table one_k persistent for all;
      analyze table one_k2 persistent for all;

      Let's see what histograms give us

      explain extended select * from one_k A where A.a < 40;
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
      |    1 | SIMPLE      | A     | ALL  | NULL          | NULL | NULL    | NULL | 1000 |     4.95 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+

      explain extended select * from one_k2 B where B.a < 100;
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
      |    1 | SIMPLE      | B     | ALL  | NULL          | NULL | NULL    | NULL | 1000 |     9.90 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+

      The numbers looks ok.

      Now, let's try a join:

      explain extended select * from one_k A, one_k2 B where A.a < 40 and B.a < 100;
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                           |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
      |    1 | SIMPLE      | A     | ALL  | NULL          | NULL | NULL    | NULL | 1000 |     4.95 | Using where                                     |
      |    1 | SIMPLE      | B     | ALL  | NULL          | NULL | NULL    | NULL | 1000 |     9.90 | Using where; Using join buffer (flat, BNL join) |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+

      Looks ok. Now, let's add a condition.

      explain extended select * from one_k A, one_k2 B where A.a < 40 and B.a < 100 and B.a=A.a;
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                           |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
      |    1 | SIMPLE      | A     | ALL  | NULL          | NULL | NULL    | NULL | 1000 |     4.95 | Using where                                     |
      |    1 | SIMPLE      | B     | ALL  | NULL          | NULL | NULL    | NULL | 1000 |   100.00 | Using where; Using join buffer (flat, BNL join) |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+

      And filtered% becomes 100%. It used to be 9.90%, we have added another condition into WHERE and now the optimizer expects the condition to be less selective! This looks wrong.

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            Description Create a dataset:
            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
            create table one_k2 as select * from one_k;
            set histogram_size=100;
            set use_stat_tables='preferably';
            set optimizer_use_condition_selectivity=4;
            analyze table one_k persistent for all;
            analyze table one_k2 persistent for all;
            {noformat}

            Let's see what histograms give us
            }
            explain extended select * from one_k A where A.a < 40;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1000 | 4.95 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            {noformat}

            {noformat}
            explain extended select * from one_k2 B where B.a < 100;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 1000 | 9.90 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            {noformat}

            {noformat}
            explain extended select * from one_k A, one_k2 B where A.a < 40 and B.a < 100;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1000 | 4.95 | Using where |
            | 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 1000 | 9.90 | Using where; Using join buffer (flat, BNL join) |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            {noformat}

            {noformat}
            explain extended select * from one_k A, one_k2 B where A.a < 40 and B.a < 100 and B.a=A.a;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1000 | 4.95 | Using where |
            | 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where; Using join buffer (flat, BNL join) |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            {noformat}
            Create a dataset:
            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
            create table one_k2 as select * from one_k;
            set histogram_size=100;
            set use_stat_tables='preferably';
            set optimizer_use_condition_selectivity=4;
            analyze table one_k persistent for all;
            analyze table one_k2 persistent for all;
            {noformat}

            Let's see what histograms give us
            {noformat}
            explain extended select * from one_k A where A.a < 40;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1000 | 4.95 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            {noformat}

            {noformat}
            explain extended select * from one_k2 B where B.a < 100;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 1000 | 9.90 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            {noformat}
            The numbers looks ok.

            Now, let's try a join:
            {noformat}
            explain extended select * from one_k A, one_k2 B where A.a < 40 and B.a < 100;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1000 | 4.95 | Using where |
            | 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 1000 | 9.90 | Using where; Using join buffer (flat, BNL join) |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            {noformat}
            Looks ok. Now, let's add a condition.

            {noformat}
            explain extended select * from one_k A, one_k2 B where A.a < 40 and B.a < 100 and B.a=A.a;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            | 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1000 | 4.95 | Using where |
            | 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where; Using join buffer (flat, BNL join) |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------------------------------------------+
            {noformat}

            And filtered% becomes 100%. It used to be 9.90%, we have added another condition into WHERE and now the optimizer expects the condition to be less selective! This looks wrong.
            psergei Sergei Petrunia made changes -
            Assignee Igor Babaev [ igor ]
            psergei Sergei Petrunia made changes -
            Assignee Igor Babaev [ igor ] Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            Look at the code near this comment:

                If the field f from the table is equal to a field from one the
                earlier joined tables then the selectivity of the range conditions
                over the field f must be discounted.

            Maybe, this is the cause of this bug?

            psergei Sergei Petrunia added a comment - Look at the code near this comment: If the field f from the table is equal to a field from one the earlier joined tables then the selectivity of the range conditions over the field f must be discounted. Maybe, this is the cause of this bug?
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.0.11 [ 15200 ]

            Patch submitted for review

            psergei Sergei Petrunia added a comment - Patch submitted for review

            Summary of discussion with igor earlier this week:

            The committed patch doesn't handle all cases. When the multi-equality is a part
            of ref access, its selectivity is counted twice. Example:

            create table one_k3 (a int, b int, key(a));
            insert into one_k3 select A.a + 10*B.a, 12345 from ten A, ten B, ten C;
            analyze table one_k3 persistent for all;

            The query plan will be:

            explain extended select * from one_k2 straight_join one_k3 where one_k2.a=one_k3.a;
            +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+
            | id   | select_type | table  | type | possible_keys | key  | key_len | ref          | rows | filtered | Extra       |
            +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+
            |    1 | SIMPLE      | one_k2 | ALL  | NULL          | NULL | NULL    | NULL         | 1000 |   100.00 | Using where |
            |    1 | SIMPLE      | one_k3 | ref  | a             | a    | 5       | j22.one_k2.a |   10 |   100.00 |             |
            +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+

            Now, add a condition:

            explain extended select * from one_k2 straight_join one_k3 where one_k2.a=one_k3.a and one_k2.a<10;
            +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+
            | id   | select_type | table  | type | possible_keys | key  | key_len | ref          | rows | filtered | Extra       |
            +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+
            |    1 | SIMPLE      | one_k2 | ALL  | NULL          | NULL | NULL    | NULL         | 1000 |     0.99 | Using where |
            |    1 | SIMPLE      | one_k3 | ref  | a             | a    | 5       | j22.one_k2.a |   10 |     9.90 |             |
            +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+

            one_k2 has values 1...1000, selectivity is 10/1000 = 1%.
            one_k3 has values 1...100. The shown selectivity is 10%.

            However, we fail to take into account that ref access only gives us rows that
            have one_k2.a=one_k3.a < 10. Because of that, the inferred condition
            "one_k3.a < 10" will always be true, and we should have filtered=100%.

            psergei Sergei Petrunia added a comment - Summary of discussion with igor earlier this week: The committed patch doesn't handle all cases. When the multi-equality is a part of ref access, its selectivity is counted twice. Example: create table one_k3 (a int, b int, key(a)); insert into one_k3 select A.a + 10*B.a, 12345 from ten A, ten B, ten C; analyze table one_k3 persistent for all; The query plan will be: explain extended select * from one_k2 straight_join one_k3 where one_k2.a=one_k3.a; +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+ | 1 | SIMPLE | one_k2 | ALL | NULL | NULL | NULL | NULL | 1000 | 100.00 | Using where | | 1 | SIMPLE | one_k3 | ref | a | a | 5 | j22.one_k2.a | 10 | 100.00 | | +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+ Now, add a condition: explain extended select * from one_k2 straight_join one_k3 where one_k2.a=one_k3.a and one_k2.a<10; +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+ | 1 | SIMPLE | one_k2 | ALL | NULL | NULL | NULL | NULL | 1000 | 0.99 | Using where | | 1 | SIMPLE | one_k3 | ref | a | a | 5 | j22.one_k2.a | 10 | 9.90 | | +------+-------------+--------+------+---------------+------+---------+--------------+------+----------+-------------+ one_k2 has values 1...1000, selectivity is 10/1000 = 1%. one_k3 has values 1...100. The shown selectivity is 10%. However, we fail to take into account that ref access only gives us rows that have one_k2.a=one_k3.a < 10. Because of that, the inferred condition "one_k3.a < 10" will always be true, and we should have filtered=100%.

            About the initial example (the one not using ref access):

            Debugging table_cond_selectivity(B)
            Table B uses full table scan (i.e. not ref acces).
            This means, sel=1.

            Then, we go into the part of code that starts with
            " If the field f from the table is equal to a field from one the..."

            We find that B.a is a member in multiple_equality(B.a, A.a).
            (B.a)->cond_selectivity=0.049, we execute

            if (field->cond_selectivity > 0)
            sel/= field->cond_selectivity;

            and get
            sel=20.19

            psergei Sergei Petrunia added a comment - About the initial example (the one not using ref access): Debugging table_cond_selectivity(B) Table B uses full table scan (i.e. not ref acces). This means, sel=1. Then, we go into the part of code that starts with " If the field f from the table is equal to a field from one the..." We find that B.a is a member in multiple_equality(B.a, A.a). (B.a)->cond_selectivity=0.049, we execute if (field->cond_selectivity > 0) sel/= field->cond_selectivity; and get sel=20.19

            So.

            1. When using full table scan, we must not try to discount field->cond_selectivity
            from "sel". With full table scan, we start with sel=1, that is,
            field->cond_selectivity was not in "sel" (it was dealt with in
            matching_cond_for_table)

            2. When using ref access, there are two possibilities:

            2.1 A member of multiple-equality is a part of ref_key. In this case, the
            condition for which field->cond_selectivity was produced (further $COND) will
            be pulled to some preceding table and checked there. Use of ref access on
            this table will guarantee that we will only get rows that already satisfy
            $COND.

            2.2 No member of multiple equality is a part of ref_key.
            In this case, field->cond_selectivity must not be discounted.

            psergei Sergei Petrunia added a comment - So. 1. When using full table scan, we must not try to discount field->cond_selectivity from "sel". With full table scan, we start with sel=1, that is, field->cond_selectivity was not in "sel" (it was dealt with in matching_cond_for_table) 2. When using ref access, there are two possibilities: 2.1 A member of multiple-equality is a part of ref_key. In this case, the condition for which field->cond_selectivity was produced (further $COND) will be pulled to some preceding table and checked there. Use of ref access on this table will guarantee that we will only get rows that already satisfy $COND. 2.2 No member of multiple equality is a part of ref_key. In this case, field->cond_selectivity must not be discounted.

            Debugging the example in the comment (with one_k3 table), I see another problem:

            The code tries to discount for selectivity of "one_k3.a < 3". It reaches the lines

                      if (field->cond_selectivity > 0)
                        sel/= field->cond_selectivity;

            However, field->cond_selectivity==1, because selectivity was obtained using range optimization.

            psergei Sergei Petrunia added a comment - Debugging the example in the comment (with one_k3 table), I see another problem: The code tries to discount for selectivity of "one_k3.a < 3". It reaches the lines if (field->cond_selectivity > 0) sel/= field->cond_selectivity; However, field->cond_selectivity==1, because selectivity was obtained using range optimization.
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.12 [ 15201 ]
            Fix Version/s 10.0.11 [ 15200 ]
            psergei Sergei Petrunia made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            psergei Sergei Petrunia made changes -

            Current EITS optimization code has multiple problems. I will not try to fix them all at once.

            Pushed the fix for the problem demonstrated by the example in this issue. The issue demonstrated in the comment dated 2014-04-25 20:29 is branched off into MDEV-6325.

            psergei Sergei Petrunia added a comment - Current EITS optimization code has multiple problems. I will not try to fix them all at once. Pushed the fix for the problem demonstrated by the example in this issue. The issue demonstrated in the comment dated 2014-04-25 20:29 is branched off into MDEV-6325 .
            psergei Sergei Petrunia made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            psergei Sergei Petrunia made changes -
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow defaullt [ 37900 ] MariaDB v2 [ 43827 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 43827 ] MariaDB v3 [ 63026 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 63026 ] MariaDB v4 [ 147723 ]

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.