[MDEV-4362] Selectivity estimates for IN (...) do not depend on whether the values are in range Created: 2013-04-03 Updated: 2014-03-27 Resolved: 2014-03-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 10.0.10 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergei Petrunia | Assignee: | Sergei Petrunia |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | eits, mwl#253 | ||
| Issue Links: |
|
||||||||
| Description |
|
Create the dataset:
Ok, so we've got a table with 100 rows of 0, 100 rows of 1, and so forth up to 99. Let's see how estimating works:
The real selectivity is 3%, we've got 3.79. Good. Now, let's try values that are certainly not in the table:
OOps, again 3.79%. Let's see what non-equality range shows:
1%. It's better. I consider selecitivity obtained for "where col1 in (-1,-2,-3)" to be a bug. |
| Comments |
| Comment by Patryk Pomykalski [ 2013-04-20 ] | ||||||||||||||||||
|
idea: — sql/sql_statistics.h 2013-04-16 05:43:07 +0000 | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-17 ] | ||||||||||||||||||
|
Situation with current 10.0
| ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-17 ] | ||||||||||||||||||
|
I am debugging this query: explain extended select * from t5 where col1 in (-1,-2,-3); There seems to be a problem with this line in Histogram::point_selectivity(): sel= avg_sel * (bucket_sel * (max + 1 - min)) / width; Here, width=0.0, we get sel=inf and return it. | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-17 ] | ||||||||||||||||||
|
So, Histogram::point_selectivity() returns inf, although the condition was highly selective. | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-25 ] | ||||||||||||||||||
|
Re-running the queries after fix for
| ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-25 ] | ||||||||||||||||||
|
That is, the issue from comment #2 where filtered=100% suddenly, is gone. | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-25 ] | ||||||||||||||||||
|
pomyk, I was leaning towards that idea also.. but then I questioned the selectivity formula, tried to come up with my own formula, and, surprise, it didn't have anything resembling the "width" that the old formula used. | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-25 ] | ||||||||||||||||||
|
So, we still have col1 = {value_less_than_tables_minimum}return the same estimate as col1= { value_within_range_of_table_values}. (The data distribution is uniform). This might be not as bad as it sounds, because minimum and maximum are statistics. The actual table can have values which are greater than the maximum value we've had when collecting statistics (or, correspondingly, less than the minimum... ). I have also discovered a problem with very skewed distributions, | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-26 ] | ||||||||||||||||||
|
With Fix for Developed a fix for it. getting value_pos=0 (or 1) and the first (or the last) bucket of histogram having zero width is a special case. | ||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-03-27 ] | ||||||||||||||||||
|
Pushed the second variant of the fix for "division by zero" problem. Pushed a testcase. |