Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5950

EITS: bad estimate for very skewed distributions

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.9
    • 10.0.10
    • None

    Description

      After fix for MDEV-5926, we see a regression for the filtered% column for testcase from MDEV-4350:

      create table t1 (a int);
      insert into t1 values (1), (1);
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 select * from t1; 
      insert into t1 values (0);
      set use_stat_tables='preferably';
      set histogram_size=127;
      set histogram_type='SINGLE_PREC_HB';
      analyze table t1;
      flush table t1;
      set optimizer_use_condition_selectivity=4;

      mysql> explain extended select * from t1 where a=0;
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
      |    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 1025 |    99.22 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+

      filtered% used to be 50%, now it's 99.22%. Both estimates are very wrong:

      mysql> select a,count(*) from  t1 group by a;
      +------+----------+
      | a    | count(*) |
      +------+----------+
      |    0 |        1 |
      |    1 |     1024 |
      +------+----------+

      But the new one is even worse than before.

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            Labels eits
            psergei Sergei Petrunia made changes -
            Description After fix for MDEV-5926, we see a regression for the filtered% column for testcase from MDEV-4350:

            {noformat}
            create table t1 (a int);
            insert into t1 values (1), (1);
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 values (0);
            set use_stat_tables='preferably';
            set histogram_size=127;
            set histogram_type='SINGLE_PREC_HB';
            analyze table t1;
            flush table t1;
            set optimizer_use_condition_selectivity=4;
            {noformat}

            {noformat}
            mysql> explain extended select * from t1 where a=0;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 1025 | 99.22 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            {noformat}
            After fix for MDEV-5926, we see a regression for the filtered% column for testcase from MDEV-4350:

            {noformat}
            create table t1 (a int);
            insert into t1 values (1), (1);
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 select * from t1;
            insert into t1 values (0);
            set use_stat_tables='preferably';
            set histogram_size=127;
            set histogram_type='SINGLE_PREC_HB';
            analyze table t1;
            flush table t1;
            set optimizer_use_condition_selectivity=4;
            {noformat}

            {noformat}
            mysql> explain extended select * from t1 where a=0;
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 1025 | 99.22 | Using where |
            +------+-------------+-------+------+---------------+------+---------+------+------+----------+-------------+
            {noformat}

            filtered% used to be 50%, now it's 99.22%. Both estimates are very wrong:

            {noformat}
            mysql> select a,count(*) from t1 group by a;
            +------+----------+
            | a | count(*) |
            +------+----------+
            | 0 | 1 |
            | 1 | 1024 |
            +------+----------+
            {noformat}

            But the new one is even worse than before.

            mysql> select *, hex(histogram) from mysql.column_stats where db_name='j32'\G
            *************************** 1. row ***************************
                   db_name: j32
                table_name: t1
               column_name: a
                 min_value: 0
                 max_value: 1
               nulls_ratio: 0.0000
                avg_length: 4.0000
             avg_frequency: 512.5000
                 hist_size: 127
                 hist_type: SINGLE_PREC_HB
                 histogram: �������������������������������������������������������������������������������������������������������������������������������
            hex(histogram): FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

            So, the histogram shows that all values are "1" (or very close to 1). We should be able to determine that value of 0 (which is equal to minimum) occupies zero buckets, not 99.22% of buckets.

            psergei Sergei Petrunia added a comment - mysql> select *, hex(histogram) from mysql.column_stats where db_name='j32'\G *************************** 1. row *************************** db_name: j32 table_name: t1 column_name: a min_value: 0 max_value: 1 nulls_ratio: 0.0000 avg_length: 4.0000 avg_frequency: 512.5000 hist_size: 127 hist_type: SINGLE_PREC_HB histogram: ������������������������������������������������������������������������������������������������������������������������������� hex(histogram): FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF So, the histogram shows that all values are "1" (or very close to 1). We should be able to determine that value of 0 (which is equal to minimum) occupies zero buckets, not 99.22% of buckets.
            psergei Sergei Petrunia made changes -
            Assignee Igor Babaev [ igor ]
            psergei Sergei Petrunia made changes -
            Assignee Igor Babaev [ igor ] Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            This looks like an edge case, but this is an edge case that I think we could (and should) handle.

            psergei Sergei Petrunia added a comment - This looks like an edge case, but this is an edge case that I think we could (and should) handle.

            Debugging:

            Histogram::point_selectivity (this=0x7fffca8829f8, pos=0, avg_sel=0.5)
            (gdb) print min
            $92 = 0
            (gdb) print max
            $93 = 126
            (gdb) print used_buckets
            $97 = 127
            (gdb) print avg_buckets_per_value
            $95 = 64

            sel= avg_sel * used_buckets / avg_buckets_per_value;
            (gdb) print sel
            $99 = 0.9921875

            Histogram::point_selectivity() is invoked with pos=0. This is correct.
            Things go wrong where we look at the first bucket, see 0xFF as its upper bound, and then assume that 0xFF is close to the value we're making estimates for.
            (We then see how many cells have the same value, and see that all of them have.)

            Proposed solution: when we're looking at the cell, compare its left endpoint with the right endpoint. if our constant is closer to the left endpoint, assume that the cell is unoccupied.

            psergei Sergei Petrunia added a comment - Debugging: Histogram::point_selectivity (this=0x7fffca8829f8, pos=0, avg_sel=0.5) (gdb) print min $92 = 0 (gdb) print max $93 = 126 (gdb) print used_buckets $97 = 127 (gdb) print avg_buckets_per_value $95 = 64 sel= avg_sel * used_buckets / avg_buckets_per_value; (gdb) print sel $99 = 0.9921875 Histogram::point_selectivity() is invoked with pos=0. This is correct. Things go wrong where we look at the first bucket, see 0xFF as its upper bound, and then assume that 0xFF is close to the value we're making estimates for. (We then see how many cells have the same value, and see that all of them have.) Proposed solution: when we're looking at the cell, compare its left endpoint with the right endpoint. if our constant is closer to the left endpoint, assume that the cell is unoccupied.
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.0.10 [ 14500 ]

            The issue is fixed by the new patch for MDEV-5926. There, we have:

            -1 SIMPLE t1 ALL NULL NULL NULL NULL 1025 49.61 Using where
            +1 SIMPLE t1 ALL NULL NULL NULL NULL 1025 0.39 Using where

            i.e. the patch improves the estimation.

            psergei Sergei Petrunia added a comment - The issue is fixed by the new patch for MDEV-5926 . There, we have: -1 SIMPLE t1 ALL NULL NULL NULL NULL 1025 49.61 Using where +1 SIMPLE t1 ALL NULL NULL NULL NULL 1025 0.39 Using where i.e. the patch improves the estimation.
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow defaullt [ 37401 ] MariaDB v2 [ 42805 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 42805 ] MariaDB v3 [ 62515 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 62515 ] MariaDB v4 [ 147696 ]

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.