[MDEV-4362] Selectivity estimates for IN (...) do not depend on whether the values are in range Created: 2013-04-03  Updated: 2014-03-27  Resolved: 2014-03-27

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.10

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: eits, mwl#253

Issue Links:
Relates
relates to MDEV-4145 Take into account the selectivity of ... Closed

 Description   

Create the dataset:

create table t5 (col1 int);
set @a=-1;
create table one_k (a int)  select (@a:=@a+1) as a from information_schema.session_variables A, information_schema.session_variables B limit 1000;
insert into t5 select A.a from one_k A, one_k B where A.a < 100 and B.a < 100;
set histogram_size=100;
analyze table t5 persistent for all;
select *, hex(histogram) from mysql.column_stats where table_name='t5'\G
*************************** 1. row ***************************
       db_name: j10
    table_name: t5
   column_name: col1
     min_value: 0
     max_value: 99
   nulls_ratio: 0.0000
    avg_length: 4.0000
 avg_frequency: 100.0000
     hist_size: 100
     hist_type: 
     histogram:  (100 bytes here)

Ok, so we've got a table with 100 rows of 0, 100 rows of 1, and so forth up to 99.

Let's see how estimating works:

MariaDB [j10]> explain extended select * from t5 where col1 in (1,2,3);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (10.64 sec)

The real selectivity is 3%, we've got 3.79. Good.

Now, let's try values that are certainly not in the table:

MariaDB [j10]> explain extended select * from t5 where col1 in (-1,-2,-3);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

OOps, again 3.79%.

Let's see what non-equality range shows:

MariaDB [j10]> explain extended select * from t5 where col1<=-1;
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     0.99 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

1%. It's better.

I consider selecitivity obtained for "where col1 in (-1,-2,-3)" to be a bug.



 Comments   
Comment by Patryk Pomykalski [ 2013-04-20 ]

idea:

— sql/sql_statistics.h 2013-04-16 05:43:07 +0000
+++ sql/sql_statistics.h 2013-04-20 14:42:57 +0000
@@ -248,6 +248,10 @@
1.0 : get_value(max) * inv_prec_factor) -
(min == 0 ?
0.0 : get_value(min-1) * inv_prec_factor);
+
+ if (width <= DBL_EPSILON)
+ return 0.0;
+
sel= avg_sel * (bucket_sel * (max + 1 - min)) / width;
return sel;
}

Comment by Sergei Petrunia [ 2014-03-17 ]

Situation with current 10.0

mysql> explain extended select * from t5 where col1 in (1,2,3);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

mysql>  explain extended select * from t5 where col1 in (-1,-2,-3);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |   100.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

mysql>  explain extended select * from t5 where col1<=-1;
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     1.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

Comment by Sergei Petrunia [ 2014-03-17 ]

I am debugging this query:

explain extended select * from t5 where col1 in (-1,-2,-3);

There seems to be a problem with this line in Histogram::point_selectivity():

sel= avg_sel * (bucket_sel * (max + 1 - min)) / width;

Here, width=0.0, we get sel=inf and return it.

Comment by Sergei Petrunia [ 2014-03-17 ]

So, Histogram::point_selectivity() returns inf,
get_column_range_cardinality() returns inf,
records_in_column_ranges() returns inf,
field("col1")->cond_selectivity = inf,
and then table->cond_selectivity=1,

although the condition was highly selective.

Comment by Sergei Petrunia [ 2014-03-25 ]

Re-running the queries after fix for MDEV-5926, I get:

mysql> explain extended select * from t5 where col1 in (1,2,3);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     2.97 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

mysql> explain extended select * from t5 where col1 in (-1,-2,-3);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     2.97 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

mysql> explain extended select * from t5 where col1<=-1;
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
|    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     1.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

Comment by Sergei Petrunia [ 2014-03-25 ]

That is, the issue from comment #2 where filtered=100% suddenly, is gone.

Comment by Sergei Petrunia [ 2014-03-25 ]

pomyk, I was leaning towards that idea also.. but then I questioned the selectivity formula, tried to come up with my own formula, and, surprise, it didn't have anything resembling the "width" that the old formula used.

Comment by Sergei Petrunia [ 2014-03-25 ]

So, we still have col1 =

{value_less_than_tables_minimum}

return the same estimate as col1=

{ value_within_range_of_table_values}

. (The data distribution is uniform).

This might be not as bad as it sounds, because minimum and maximum are statistics. The actual table can have values which are greater than the maximum value we've had when collecting statistics (or, correspondingly, less than the minimum... ).

I have also discovered a problem with very skewed distributions, MDEV-5950. My opinion is that estimates that are produced for this MDEV are satisfactory, while estimates that are produced for MDEV-5950 are not. However, when we fix MDEV-5950, we might fix this one also.

Comment by Sergei Petrunia [ 2014-03-26 ]

With Fix for MDEV-5926, "Attempt#2", the division-by-zero problem started to appear again (it fails an assert that I've added specifically to catch this).

Developed a fix for it. getting value_pos=0 (or 1) and the first (or the last) bucket of histogram having zero width is a special case.

Comment by Sergei Petrunia [ 2014-03-27 ]

Pushed the second variant of the fix for "division by zero" problem. Pushed a testcase.

Generated at Thu Feb 08 06:55:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.