[MDEV-27061] A proposal on how to compute rows estimate for quick select's prefix Created: 2021-11-16  Updated: 2021-11-23

Status: Open
Project: MariaDB Server
Component/s: None
Fix Version/s: None

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

This is my recollection of Igor's proposal on how to compute #rows for quick select's prefix.

Task setting

One of the issues we encounter in the condition selectivity computations is as follows:

The table has indexes overlapping like so:

IDX1 (a, b)
IDX2 (b)

The WHERE condition is such that we get two quick selects (with their estimates):

$quick1 (IDX1, used_key_parts=2)
$quick2 (IDX2, used_key_parts=1)

How does one come up with a combined estimate of a WHERE condition?

The idea is to develop a procedure that would allow take quick select $quick1 and infer the #rows (or selectivity) that we would get if we only used one key part of IDX1 for it. That is, the selectivity of restrictions on column a.

Then, we will be able to use the independence assumption and multiply it by the selectivity of column b , and obtain the full selectivity.

Computing #rows for a quick select's prefix

Ok, we've got ranges for IDX1(a,b). We classify the ranges into 3 2 "buckets":

Bucket #1: ranges with non-equality on the first key part

The range is in the form:

 (left_endp_1, left_endp_2) <=  (a,b) <= (right_endp_1, right_endp_2)              (1-FULL)

Here,

  • xxx_endp_N are constants
  • the comparisons are <= but they could be < as well.

The important part is that left_endp_1!= right_endp_1.

The idea is, when the values for the first keypart are different, the second keyparts do not change the size of the range much.

This means that records_in_range estimate we've got for (1-FULL) is sufficiently close to the estimate we would get for the truncated range:

 (left_endp_1) <=  (a) <= (right_endp_1)                          (1-TRUNCATED)

So, we assume that estimate for (1-FULL) is close enough to the estimate of (1-TRUNCATED).

Bucket #2: ranges with equality on 1st keypart

The range is in the form:

 (endp_1, left_endp_2) <=  (a,b) <= (endp_1, right_endp_2)                    (2-FULL)

And we need an estimate for

a =  endp_1                            (2-TRUNCATED)

Here, we use rec_per_key[0] as the estimate for #rows in (2-TRUNCATED).

IDX1.rec_per_key[0] is defined as "average number of rows one would get for a=some_constant_in_the_table". So, we use the global average as an estimate.

(TODO: can this be too imprecise?)
(TODO: should one make adjustments when the estimate for 2-FULL is already a bigger value than rec_per_key[0] ?

Issues not covered so far

  • What to do if there is a choice from multiple indexes, for example

    IDX1(a,b)
    IDX2(b)
    IDX3(a,b,not_used_column)
    

    Do we use IDX1 or IDX3?

  • What else?

Generated at Thu Feb 08 09:50:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.