Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4364

Histograms show the same selectivity for col=rare_value and col=frequent_value

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • None
    • 10.0.10
    • None

    Description

      Create the dataset as specified by MDEV-4363.

      MariaDB [j10]> set use_stat_tables='preferably';
      Query OK, 0 rows affected (0.00 sec)
       
      MariaDB [j10]> set optimizer_use_condition_selectivity=4;
      Query OK, 0 rows affected, 1 warning (0.00 sec)

      The point of this test is to check skewed data distributions.
      The value of 178 is very frequent:

      MariaDB [j10]> select (select  count(*) from t5 where col2 = 178 ) /(select  count(*) from t5);
      +--------------------------------------------------------------------------+
      | (select  count(*) from t5 where col2 = 178 ) /(select  count(*) from t5) |
      +--------------------------------------------------------------------------+
      |                                                                   0.3301 |
      +--------------------------------------------------------------------------+
      1 row in set (0.18 sec)

      What does EXPLAIN think:

      MariaDB [j10]> explain extended select count(*) from t5 where col2 = 178 ;
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.60 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      1 row in set, 1 warning (0.00 sec)

      it thinks that selectivity is 3.6%.

      If i try it with a regular, rare value:

       
      MariaDB [j10]> select (select  count(*) from t5 where col2 = 179 ) /(select  count(*) from t5);
      +--------------------------------------------------------------------------+
      | (select  count(*) from t5 where col2 = 179 ) /(select  count(*) from t5) |
      +--------------------------------------------------------------------------+
      |                                                                   0.0001 |
      +--------------------------------------------------------------------------+
      1 row in set (0.20 sec)
       
      MariaDB [j10]> explain extended select count(*) from t5 where col2 = 179 ;
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.60 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      1 row in set, 1 warning (0.00 sec)

      It looks like histograms do not allow to distinguish between a frequent and a rare value?

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia added a comment - - edited

            One can have an example with non "RQG data", also:

            MariaDB [dbt3sf10]> select  c_mktsegment,  count(*)/@total from customer group by c_mktsegment;
            +--------------+-----------------+
            | c_mktsegment | count(*)/@total |
            +--------------+-----------------+
            | AUTOMOBILE   |          0.2000 |
            | BUILDING     |          0.2002 |
            | FURNITURE    |          0.1997 |
            | HOUSEHOLD    |          0.1998 |
            | MACHINERY    |          0.2003 |
            +--------------+-----------------+
            5 rows in set (0.64 sec)

            MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='AUTOMOBILE';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |   100.00 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='BUILDING';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    48.33 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='FURNITURE';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    11.94 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    23.60 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='MACHINERY';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |     9.57 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            MariaDB [dbt3sf10]> select *, hex(histogram) from mysql.column_stats where column_name='c_mktsegment' and table_name='customer'\G
            *************************** 1. row ***************************
                   db_name: dbt3sf10
                table_name: customer
               column_name: c_mktsegment
                 min_value: AUTOMOBILE
                 max_value: MACHINERY
               nulls_ratio: 0.0000
                avg_length: 8.9998
             avg_frequency: 300000.0000
                 hist_size: 200
                 hist_type: 
                 histogram:                                         ����������������������������������������jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj��������������������������������������������������������������������������������
            hex(histogram): 00000000000000000000000000000000000000000000000000000000000000000000000000000000151515151515151515151515151515151515151515151515151515151515151515151515151515156A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A95959595959595959595959595959595959595959595959595959595959595959595959595959595FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

            psergei Sergei Petrunia added a comment - - edited One can have an example with non "RQG data", also: MariaDB [dbt3sf10]> select c_mktsegment, count(*)/@total from customer group by c_mktsegment; +--------------+-----------------+ | c_mktsegment | count(*)/@total | +--------------+-----------------+ | AUTOMOBILE | 0.2000 | | BUILDING | 0.2002 | | FURNITURE | 0.1997 | | HOUSEHOLD | 0.1998 | | MACHINERY | 0.2003 | +--------------+-----------------+ 5 rows in set (0.64 sec) MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='AUTOMOBILE'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 100.00 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='BUILDING'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 48.33 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='FURNITURE'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 11.94 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 23.60 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='MACHINERY'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 9.57 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ MariaDB [dbt3sf10]> select *, hex(histogram) from mysql.column_stats where column_name='c_mktsegment' and table_name='customer'\G *************************** 1. row *************************** db_name: dbt3sf10 table_name: customer column_name: c_mktsegment min_value: AUTOMOBILE max_value: MACHINERY nulls_ratio: 0.0000 avg_length: 8.9998 avg_frequency: 300000.0000 hist_size: 200 hist_type: histogram: ����������������������������������������jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj�������������������������������������������������������������������������������� hex(histogram): 00000000000000000000000000000000000000000000000000000000000000000000000000000000151515151515151515151515151515151515151515151515151515151515151515151515151515156A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A95959595959595959595959595959595959595959595959595959595959595959595959595959595FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

            Note that: for
            c_mktsegment ='AUTOMOBILE'

            selectivity is 100%.

            histogram shows 'AUTOMOBILE' is the minimum value.

            Histogram has 200 buckets. first 40 of which are at 0%.. The remaining 160 buckets start at 0x15=20%.

            psergei Sergei Petrunia added a comment - Note that: for c_mktsegment ='AUTOMOBILE' selectivity is 100%. histogram shows 'AUTOMOBILE' is the minimum value. Histogram has 200 buckets. first 40 of which are at 0%.. The remaining 160 buckets start at 0x15=20%.

            Trying with the latest code, revid:psergey@askmonty.org-20140327090800-l3je8eqjrfx21iki. Selectivity for all above queries like

            MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';

            is 19.90%. THis is good.

            One can also see that

            mysql> explain extended select count(*) from customer where c_mktsegment ='NO-SUCH-MARKET';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    19.90 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            is still 19.90, but this seems to be a limitation of this type of histogram.

            psergei Sergei Petrunia added a comment - Trying with the latest code, revid:psergey@askmonty.org-20140327090800-l3je8eqjrfx21iki. Selectivity for all above queries like MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD'; is 19.90%. THis is good. One can also see that mysql> explain extended select count(*) from customer where c_mktsegment ='NO-SUCH-MARKET'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 19.90 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ is still 19.90, but this seems to be a limitation of this type of histogram.
            psergei Sergei Petrunia added a comment - - edited

            After MDEV-5962 was fixed:

            mysql> explain extended select count(*) from customer where c_mktsegment ='ZZZZZZZZZZ';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    19.90 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            mysql> explain extended select count(*) from customer where c_mktsegment ='DDDDDDD';
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            | id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
            |    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |     0.30 | Using where |
            +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

            That is, we will get incorrect estimates for values that

            • are not within the min...max range
            • are sufficiently close to the popular values (so much that they have the same position. value position has precision of 1/256th or 1/65536th for SINGLE/DOUBLE_PREC_HB).

            for other non-existent values, we're able to return low selecitivity.

            psergei Sergei Petrunia added a comment - - edited After MDEV-5962 was fixed: mysql> explain extended select count(*) from customer where c_mktsegment ='ZZZZZZZZZZ'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 19.90 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ mysql> explain extended select count(*) from customer where c_mktsegment ='DDDDDDD'; +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 0.30 | Using where | +------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+ That is, we will get incorrect estimates for values that are not within the min...max range are sufficiently close to the popular values (so much that they have the same position. value position has precision of 1/256th or 1/65536th for SINGLE/DOUBLE_PREC_HB). for other non-existent values, we're able to return low selecitivity.

            As for the original testcase:
            if I run

            [explain extended] select count(*), from t5 where col2=$i;

            I get this:

            'col1=...	#rows	filtered%
            160	1	0.01
            161	1	0.01
            162	1	0.01
            163	1	0.01
            164	1	0.01
            165	1	0.01
            166	1	0.01
            167	1	0.01
            168	1	0.01
            169	1	0.01
            170	1	0.01
            171	1	0.01
            172	1	0.01
            173	1	0.01
            174	1	0.01
            175	1	32.5
            176	1	32.5
            177	1	32.5
            178	3301	32.5
            179	1	32.5
            180	1	32.5
            181	1	32.5
            182	1	32.5
            183	1	32.5
            184	1	0.07
            185	1	0.07
            186	1	0.07
            187	1	0.07
            188	1	0.07
            189	1	0.07
            190	1	0.07

            That is, big number of values for value of 178 spreads over to its neighbors. One could expect that.

            psergei Sergei Petrunia added a comment - As for the original testcase: if I run [explain extended] select count(*), from t5 where col2=$i; I get this: 'col1=... #rows filtered% 160 1 0.01 161 1 0.01 162 1 0.01 163 1 0.01 164 1 0.01 165 1 0.01 166 1 0.01 167 1 0.01 168 1 0.01 169 1 0.01 170 1 0.01 171 1 0.01 172 1 0.01 173 1 0.01 174 1 0.01 175 1 32.5 176 1 32.5 177 1 32.5 178 3301 32.5 179 1 32.5 180 1 32.5 181 1 32.5 182 1 32.5 183 1 32.5 184 1 0.07 185 1 0.07 186 1 0.07 187 1 0.07 188 1 0.07 189 1 0.07 190 1 0.07 That is, big number of values for value of 178 spreads over to its neighbors. One could expect that.

            If one changes the histogram to use DOUBLE_PREC_HB, then the situation improves. 30% is only returned for the col2=178:

            const   #rows    filtered%       filtered%,
                           single_prec_hb   double_prec_hb
            160     1         0.01             0.01
            161     1         0.01             0.01
            162     1         0.01             0.01
            163     1         0.01             0.01
            164     1         0.01             0.01
            165     1         0.01             0.01
            166     1         0.01             0.01
            167     1         0.01             0.01
            168     1         0.01             0.01
            169     1         0.01             0.01
            170     1         0.01             0.01
            171     1         0.01             0.01
            172     1         0.01             0.01
            173     1         0.01             0.11
            174     1         0.01             0.11
            175     1         32.5             0.11
            176     1         32.5             0.11
            177     1         32.5             0.11
            178     3301      32.5            32.50
            179     1         32.5             0.06
            180     1         32.5             0.06
            181     1         32.5             0.06
            182     1         32.5             0.06
            183     1         32.5             0.06
            184     1         0.07             0.06
            185     1         0.07             0.06
            186     1         0.07             0.06
            187     1         0.07             0.06
            188     1         0.07             0.06
            189     1         0.07             0.06
            190     1         0.07             0.01

            psergei Sergei Petrunia added a comment - If one changes the histogram to use DOUBLE_PREC_HB, then the situation improves. 30% is only returned for the col2=178: const #rows filtered% filtered%, single_prec_hb double_prec_hb 160 1 0.01 0.01 161 1 0.01 0.01 162 1 0.01 0.01 163 1 0.01 0.01 164 1 0.01 0.01 165 1 0.01 0.01 166 1 0.01 0.01 167 1 0.01 0.01 168 1 0.01 0.01 169 1 0.01 0.01 170 1 0.01 0.01 171 1 0.01 0.01 172 1 0.01 0.01 173 1 0.01 0.11 174 1 0.01 0.11 175 1 32.5 0.11 176 1 32.5 0.11 177 1 32.5 0.11 178 3301 32.5 32.50 179 1 32.5 0.06 180 1 32.5 0.06 181 1 32.5 0.06 182 1 32.5 0.06 183 1 32.5 0.06 184 1 0.07 0.06 185 1 0.07 0.06 186 1 0.07 0.06 187 1 0.07 0.06 188 1 0.07 0.06 189 1 0.07 0.06 190 1 0.07 0.01

            Closing as this is inherent problem of this type of histogram.

            psergei Sergei Petrunia added a comment - Closing as this is inherent problem of this type of histogram.

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.