[MDEV-4364] Histograms show the same selectivity for col=rare_value and col=frequent_value Created: 2013-04-03  Updated: 2014-03-27  Resolved: 2014-03-27

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.10

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Not a Bug Votes: 0
Labels: eits, mwl#253

Issue Links:
Relates
relates to MDEV-4145 Take into account the selectivity of ... Closed

 Description   

Create the dataset as specified by MDEV-4363.

MariaDB [j10]> set use_stat_tables='preferably';
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [j10]> set optimizer_use_condition_selectivity=4;
Query OK, 0 rows affected, 1 warning (0.00 sec)

The point of this test is to check skewed data distributions.
The value of 178 is very frequent:

MariaDB [j10]> select (select  count(*) from t5 where col2 = 178 ) /(select  count(*) from t5);
+--------------------------------------------------------------------------+
| (select  count(*) from t5 where col2 = 178 ) /(select  count(*) from t5) |
+--------------------------------------------------------------------------+
|                                                                   0.3301 |
+--------------------------------------------------------------------------+
1 row in set (0.18 sec)

What does EXPLAIN think:

MariaDB [j10]> explain extended select count(*) from t5 where col2 = 178 ;
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.60 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

it thinks that selectivity is 3.6%.

If i try it with a regular, rare value:

 
MariaDB [j10]> select (select  count(*) from t5 where col2 = 179 ) /(select  count(*) from t5);
+--------------------------------------------------------------------------+
| (select  count(*) from t5 where col2 = 179 ) /(select  count(*) from t5) |
+--------------------------------------------------------------------------+
|                                                                   0.0001 |
+--------------------------------------------------------------------------+
1 row in set (0.20 sec)
 
MariaDB [j10]> explain extended select count(*) from t5 where col2 = 179 ;
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.60 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

It looks like histograms do not allow to distinguish between a frequent and a rare value?



 Comments   
Comment by Sergei Petrunia [ 2013-04-04 ]

One can have an example with non "RQG data", also:

MariaDB [dbt3sf10]> select  c_mktsegment,  count(*)/@total from customer group by c_mktsegment;
+--------------+-----------------+
| c_mktsegment | count(*)/@total |
+--------------+-----------------+
| AUTOMOBILE   |          0.2000 |
| BUILDING     |          0.2002 |
| FURNITURE    |          0.1997 |
| HOUSEHOLD    |          0.1998 |
| MACHINERY    |          0.2003 |
+--------------+-----------------+
5 rows in set (0.64 sec)

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='AUTOMOBILE';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |   100.00 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='BUILDING';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    48.33 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='FURNITURE';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    11.94 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    23.60 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='MACHINERY';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |     9.57 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> select *, hex(histogram) from mysql.column_stats where column_name='c_mktsegment' and table_name='customer'\G
*************************** 1. row ***************************
       db_name: dbt3sf10
    table_name: customer
   column_name: c_mktsegment
     min_value: AUTOMOBILE
     max_value: MACHINERY
   nulls_ratio: 0.0000
    avg_length: 8.9998
 avg_frequency: 300000.0000
     hist_size: 200
     hist_type: 
     histogram:                                         ����������������������������������������jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj��������������������������������������������������������������������������������
hex(histogram): 00000000000000000000000000000000000000000000000000000000000000000000000000000000151515151515151515151515151515151515151515151515151515151515151515151515151515156A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A95959595959595959595959595959595959595959595959595959595959595959595959595959595FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Comment by Sergei Petrunia [ 2013-04-04 ]

Note that: for
c_mktsegment ='AUTOMOBILE'

selectivity is 100%.

histogram shows 'AUTOMOBILE' is the minimum value.

Histogram has 200 buckets. first 40 of which are at 0%.. The remaining 160 buckets start at 0x15=20%.

Comment by Sergei Petrunia [ 2014-03-27 ]

Trying with the latest code, revid:psergey@askmonty.org-20140327090800-l3je8eqjrfx21iki. Selectivity for all above queries like

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';

is 19.90%. THis is good.

One can also see that

mysql> explain extended select count(*) from customer where c_mktsegment ='NO-SUCH-MARKET';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    19.90 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

is still 19.90, but this seems to be a limitation of this type of histogram.

Comment by Sergei Petrunia [ 2014-03-27 ]

After MDEV-5962 was fixed:

mysql> explain extended select count(*) from customer where c_mktsegment ='ZZZZZZZZZZ';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    19.90 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

mysql> explain extended select count(*) from customer where c_mktsegment ='DDDDDDD';
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |     0.30 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

That is, we will get incorrect estimates for values that

  • are not within the min...max range
  • are sufficiently close to the popular values (so much that they have the same position. value position has precision of 1/256th or 1/65536th for SINGLE/DOUBLE_PREC_HB).

for other non-existent values, we're able to return low selecitivity.

Comment by Sergei Petrunia [ 2014-03-27 ]

As for the original testcase:
if I run

[explain extended] select count(*), from t5 where col2=$i;

I get this:

'col1=...	#rows	filtered%
160	1	0.01
161	1	0.01
162	1	0.01
163	1	0.01
164	1	0.01
165	1	0.01
166	1	0.01
167	1	0.01
168	1	0.01
169	1	0.01
170	1	0.01
171	1	0.01
172	1	0.01
173	1	0.01
174	1	0.01
175	1	32.5
176	1	32.5
177	1	32.5
178	3301	32.5
179	1	32.5
180	1	32.5
181	1	32.5
182	1	32.5
183	1	32.5
184	1	0.07
185	1	0.07
186	1	0.07
187	1	0.07
188	1	0.07
189	1	0.07
190	1	0.07

That is, big number of values for value of 178 spreads over to its neighbors. One could expect that.

Comment by Sergei Petrunia [ 2014-03-27 ]

If one changes the histogram to use DOUBLE_PREC_HB, then the situation improves. 30% is only returned for the col2=178:

const   #rows    filtered%       filtered%,
               single_prec_hb   double_prec_hb
160     1         0.01             0.01
161     1         0.01             0.01
162     1         0.01             0.01
163     1         0.01             0.01
164     1         0.01             0.01
165     1         0.01             0.01
166     1         0.01             0.01
167     1         0.01             0.01
168     1         0.01             0.01
169     1         0.01             0.01
170     1         0.01             0.01
171     1         0.01             0.01
172     1         0.01             0.01
173     1         0.01             0.11
174     1         0.01             0.11
175     1         32.5             0.11
176     1         32.5             0.11
177     1         32.5             0.11
178     3301      32.5            32.50
179     1         32.5             0.06
180     1         32.5             0.06
181     1         32.5             0.06
182     1         32.5             0.06
183     1         32.5             0.06
184     1         0.07             0.06
185     1         0.07             0.06
186     1         0.07             0.06
187     1         0.07             0.06
188     1         0.07             0.06
189     1         0.07             0.06
190     1         0.07             0.01

Comment by Sergei Petrunia [ 2014-03-27 ]

Closing as this is inherent problem of this type of histogram.

Generated at Thu Feb 08 06:55:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.