[MDEV-4364] Histograms show the same selectivity for col=rare_value and col=frequent_value - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: None
Fix Version/s: 10.0.10
Component/s: None
Labels:
- eits
- mwl#253

Description

Create the dataset as specified by ~~MDEV-4363~~.

MariaDB [j10]> set use_stat_tables='preferably';

Query OK, 0 rows affected (0.00 sec)

MariaDB [j10]> set optimizer_use_condition_selectivity=4;

Query OK, 0 rows affected, 1 warning (0.00 sec)

The point of this test is to check skewed data distributions.
The value of 178 is very frequent:

MariaDB [j10]> select (select  count(*) from t5 where col2 = 178 ) /(select  count(*) from t5);

+--------------------------------------------------------------------------+

| (select  count(*) from t5 where col2 = 178 ) /(select  count(*) from t5) |

+--------------------------------------------------------------------------+

|                                                                   0.3301 |

+--------------------------------------------------------------------------+

1 row in set (0.18 sec)

What does EXPLAIN think:

MariaDB [j10]> explain extended select count(*) from t5 where col2 = 178 ;

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.60 | Using where |

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

1 row in set, 1 warning (0.00 sec)

it thinks that selectivity is 3.6%.

If i try it with a regular, rare value:

MariaDB [j10]> select (select  count(*) from t5 where col2 = 179 ) /(select  count(*) from t5);

+--------------------------------------------------------------------------+

| (select  count(*) from t5 where col2 = 179 ) /(select  count(*) from t5) |

+--------------------------------------------------------------------------+

|                                                                   0.0001 |

+--------------------------------------------------------------------------+

1 row in set (0.20 sec)

MariaDB [j10]> explain extended select count(*) from t5 where col2 = 179 ;

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.60 | Using where |

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

1 row in set, 1 warning (0.00 sec)

It looks like histograms do not allow to distinguish between a frequent and a rare value?

Attachments

Issue Links

relates to

MDEV-4145 Take into account the selectivity of single-table range predicates on non-indexed columns when searching for the best execution plan

Closed

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2013-04-04 13:15 - edited

One can have an example with non "RQG data", also:

MariaDB [dbt3sf10]> select  c_mktsegment,  count(*)/@total from customer group by c_mktsegment;

+--------------+-----------------+

| c_mktsegment | count(*)/@total |

+--------------+-----------------+

| AUTOMOBILE   |          0.2000 |

| BUILDING     |          0.2002 |

| FURNITURE    |          0.1997 |

| HOUSEHOLD    |          0.1998 |

| MACHINERY    |          0.2003 |

+--------------+-----------------+

5 rows in set (0.64 sec)

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='AUTOMOBILE';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |   100.00 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='BUILDING';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    48.33 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='FURNITURE';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    11.94 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    23.60 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='MACHINERY';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |     9.57 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

MariaDB [dbt3sf10]> select *, hex(histogram) from mysql.column_stats where column_name='c_mktsegment' and table_name='customer'\G

*************************** 1. row ***************************

       db_name: dbt3sf10

    table_name: customer

   column_name: c_mktsegment

     min_value: AUTOMOBILE

     max_value: MACHINERY

   nulls_ratio: 0.0000

    avg_length: 8.9998

 avg_frequency: 300000.0000

     hist_size: 200

     hist_type:

     histogram:                                         ����������������������������������������jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj��������������������������������������������������������������������������������

hex(histogram): 00000000000000000000000000000000000000000000000000000000000000000000000000000000151515151515151515151515151515151515151515151515151515151515151515151515151515156A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A95959595959595959595959595959595959595959595959595959595959595959595959595959595FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Sergei Petrunia added a comment - 2013-04-04 13:19

Note that: for
c_mktsegment ='AUTOMOBILE'

selectivity is 100%.

histogram shows 'AUTOMOBILE' is the minimum value.

Histogram has 200 buckets. first 40 of which are at 0%.. The remaining 160 buckets start at 0x15=20%.

Sergei Petrunia added a comment - 2013-04-04 13:19 Note that: for c_mktsegment ='AUTOMOBILE' selectivity is 100%. histogram shows 'AUTOMOBILE' is the minimum value. Histogram has 200 buckets. first 40 of which are at 0%.. The remaining 160 buckets start at 0x15=20%.

Sergei Petrunia added a comment - 2014-03-27 11:38

Trying with the latest code, revid:psergey@askmonty.org-20140327090800-l3je8eqjrfx21iki. Selectivity for all above queries like

MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';

is 19.90%. THis is good.

One can also see that

mysql> explain extended select count(*) from customer where c_mktsegment ='NO-SUCH-MARKET';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    19.90 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

is still 19.90, but this seems to be a limitation of this type of histogram.

Sergei Petrunia added a comment - 2014-03-27 13:14 - edited

After ~~MDEV-5962~~ was fixed:

mysql> explain extended select count(*) from customer where c_mktsegment ='ZZZZZZZZZZ';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |    19.90 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

mysql> explain extended select count(*) from customer where c_mktsegment ='DDDDDDD';

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

|    1 | SIMPLE      | customer | ALL  | NULL          | NULL | NULL    | NULL | 1500000 |     0.30 | Using where |

+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+

That is, we will get incorrect estimates for values that

are not within the min...max range
are sufficiently close to the popular values (so much that they have the same position. value position has precision of 1/256th or 1/65536th for SINGLE/DOUBLE_PREC_HB).

for other non-existent values, we're able to return low selecitivity.

Sergei Petrunia added a comment - 2014-03-27 13:40

As for the original testcase:
if I run

[explain extended] select count(*), from t5 where col2=$i;

I get this:

'col1=... #rows filtered%
160 1 0.01
161 1 0.01
162 1 0.01
163 1 0.01
164 1 0.01
165 1 0.01
166 1 0.01
167 1 0.01
168 1 0.01
169 1 0.01
170 1 0.01
171 1 0.01
172 1 0.01
173 1 0.01
174 1 0.01
175 1 32.5
176 1 32.5
177 1 32.5
178 3301 32.5
179 1 32.5
180 1 32.5
181 1 32.5
182 1 32.5
183 1 32.5
184 1 0.07
185 1 0.07
186 1 0.07
187 1 0.07
188 1 0.07
189 1 0.07
190 1 0.07

That is, big number of values for value of 178 spreads over to its neighbors. One could expect that.

Sergei Petrunia added a comment - 2014-03-27 13:40 As for the original testcase: if I run [explain extended] select count(*), from t5 where col2=$i; I get this: 'col1=... #rows filtered% 160 1 0.01 161 1 0.01 162 1 0.01 163 1 0.01 164 1 0.01 165 1 0.01 166 1 0.01 167 1 0.01 168 1 0.01 169 1 0.01 170 1 0.01 171 1 0.01 172 1 0.01 173 1 0.01 174 1 0.01 175 1 32.5 176 1 32.5 177 1 32.5 178 3301 32.5 179 1 32.5 180 1 32.5 181 1 32.5 182 1 32.5 183 1 32.5 184 1 0.07 185 1 0.07 186 1 0.07 187 1 0.07 188 1 0.07 189 1 0.07 190 1 0.07 That is, big number of values for value of 178 spreads over to its neighbors. One could expect that.

Sergei Petrunia added a comment - 2014-03-27 14:20

If one changes the histogram to use DOUBLE_PREC_HB, then the situation improves. 30% is only returned for the col2=178:

const #rows filtered% filtered%,
single_prec_hb double_prec_hb
160 1 0.01 0.01
161 1 0.01 0.01
162 1 0.01 0.01
163 1 0.01 0.01
164 1 0.01 0.01
165 1 0.01 0.01
166 1 0.01 0.01
167 1 0.01 0.01
168 1 0.01 0.01
169 1 0.01 0.01
170 1 0.01 0.01
171 1 0.01 0.01
172 1 0.01 0.01
173 1 0.01 0.11
174 1 0.01 0.11
175 1 32.5 0.11
176 1 32.5 0.11
177 1 32.5 0.11
178 3301 32.5 32.50
179 1 32.5 0.06
180 1 32.5 0.06
181 1 32.5 0.06
182 1 32.5 0.06
183 1 32.5 0.06
184 1 0.07 0.06
185 1 0.07 0.06
186 1 0.07 0.06
187 1 0.07 0.06
188 1 0.07 0.06
189 1 0.07 0.06
190 1 0.07 0.01

Sergei Petrunia added a comment - 2014-03-27 14:20 If one changes the histogram to use DOUBLE_PREC_HB, then the situation improves. 30% is only returned for the col2=178: const #rows filtered% filtered%, single_prec_hb double_prec_hb 160 1 0.01 0.01 161 1 0.01 0.01 162 1 0.01 0.01 163 1 0.01 0.01 164 1 0.01 0.01 165 1 0.01 0.01 166 1 0.01 0.01 167 1 0.01 0.01 168 1 0.01 0.01 169 1 0.01 0.01 170 1 0.01 0.01 171 1 0.01 0.01 172 1 0.01 0.01 173 1 0.01 0.11 174 1 0.01 0.11 175 1 32.5 0.11 176 1 32.5 0.11 177 1 32.5 0.11 178 3301 32.5 32.50 179 1 32.5 0.06 180 1 32.5 0.06 181 1 32.5 0.06 182 1 32.5 0.06 183 1 32.5 0.06 184 1 0.07 0.06 185 1 0.07 0.06 186 1 0.07 0.06 187 1 0.07 0.06 188 1 0.07 0.06 189 1 0.07 0.06 190 1 0.07 0.01

Sergei Petrunia added a comment - 2014-03-27 14:21

Closing as this is inherent problem of this type of histogram.

Sergei Petrunia added a comment - 2014-03-27 14:21 Closing as this is inherent problem of this type of histogram.

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2013-04-03 16:14

Updated:: 2014-03-27 14:21

Resolved:: 2014-03-27 14:21

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server