[MDEV-31067] selectivity_from_histogram >1.0 for col=const on DOUBLE_PREC_HB histogram Created: 2023-04-17 Updated: 2023-05-22 Resolved: 2023-05-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Affects Version/s: | 10.4, 10.5, 10.6, 10.7, 10.8 |
| Fix Version/s: | 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sergei Petrunia | Assignee: | Sergei Petrunia |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Description |
|
This is a public part of TODO-3823. For certain data distributions, histogram code can produce selectivity>1.0 for a column=const equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram).
INSERTs (synthetic data):
|
| Comments |
| Comment by Sergei Petrunia [ 2023-04-17 ] | ||||||||||||||||
|
The problem is in Histogram::point_selectivity(), this part of code:
In particular, we arrive at the value that's greater than 1.0 here:
| ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-17 ] | ||||||||||||||||
|
Introduced in:
| ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-17 ] | ||||||||||||||||
|
Trying to understand the logic behind the
So, we are looking to get #rows estimate for a value X that fits within a single bucket. That is,
How many rows are in the bucket that have value X? point_selectivity() has avg_sel parameter, but that's a global per-table average. Suppose the histogram for this particular table has two buckets:
both buckets hold the same number of rows. If we assume that the values are distributed uniformly across the 0.0 (min value) to 1.0 (max value) range, we might guess that bucketY1 has more different values in it than bucketY2. This is what this code is trying to do. It assumes that buckets that span bigger value range have lower avg_frequency (rows have different values, plenty of values to choose from). Contrary, a bucket that span small value range has fewer different values, and so rows in the bucket will tend to have the same value. | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-17 ] | ||||||||||||||||
|
In the testcase for this MDEV, the bucket has a very narrow value range: its "right_endpoint - left_endpoint" is 500x smaller than what the average bucket's right_endpoint - left_endpoint has. The code multiplies the estimate by this factor and ends up with a non-sensical estimate. | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-18 ] | ||||||||||||||||
|
bb-10.6-mdev31067 | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-18 ] | ||||||||||||||||
|
Takeaways from the optimizer call: Igor would prefer to see the "brave heuristic" removed completely. I'm hesitant as that will cause more changes in the estimates (and so changes in the query plans). But let me implement that in a patch and see if that one would have an agreement. | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-19 ] | ||||||||||||||||
|
The alternative patch implementing suggestions from the above call: https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant2. igor please review. | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-19 ] | ||||||||||||||||
|
igor please review. | ||||||||||||||||
| Comment by Igor Babaev [ 2023-04-24 ] | ||||||||||||||||
|
psergei Sergei, | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-27 ] | ||||||||||||||||
|
https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant3 . igor please review. | ||||||||||||||||
| Comment by Igor Babaev [ 2023-04-28 ] | ||||||||||||||||
|
In my opinion the function Histogram::point_selectivity() the has patch touched still requires some improvements. | ||||||||||||||||
| Comment by Sergei Petrunia [ 2023-04-28 ] | ||||||||||||||||
|
igor,
How is this part of this MDEV? Nor the testcase for this MDEV, neither any of the fixes touch that part of the code. | ||||||||||||||||
| Comment by Sergei Golubchik [ 2023-04-28 ] | ||||||||||||||||
|
ok to push variant 2. commit https://github.com/MariaDB/server/commit/a87c17ab566def817da2f79fb7badb7318a1e311 It is a conservative upper-bound estimate, and is delivers lower estimates than a complex magic of popular values in the variant 3. Thus I see no reason to use the complex approach if the simple and safe approach produces better estimates. |