[MDEV-21713] LIMIT optimization and selectivity: pessimistic estimates cause optimistic plans Created: 2020-02-12 Updated: 2023-11-30 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major |
| Reporter: | Sergei Petrunia | Assignee: | Sergei Petrunia |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
Problem statement
Consider a query plan that enumerates rows in the order that matches the ORDER BY clause. It allows for "short-cutting", that is, the execution stops as soon as $N matches were produced. We can expect to produce $N matches when we have enumerated $N/selectivity(cond) rows. If the condition selectivity is not known, the optimizer will assume that selectivity=1, which will cause it to underestimate the number of rows that it will need to examine, and so underestimate the cost of this query plan. The same happens when the optimizer has a a "conservative" selectivity estimate (which it typically does, as it assumes the selectivity of condition to be 1.0 when there's no data that provides the real value). Proposed solution, v1The idea is to have the cost computations take into account the "short-cutting" only when the full condition selectivity is known. This is rather hard to do as EITS and the range analyzer extract condition selectivity from the whole WHERE condition, and give no indication of whether this selectivity is "full" or "partial". Proposed solution: detailsIf the WHERE condition is a conjunction of conditions whose selectivity is known:
then we can assume the selectivity is known. (the issue with correlation between the conditions can still cause errors in the estimation but since these can be on either side, we can ignore this issue) Conditions whose selectivity is known are:
Proposed Solution, Version 2 (This is what decided to do)This is based on the input on the first patch. The target is the current code in MDEV-8306 branch, with its current selectivity estimation code. We also want to take into account MDEV-21633 and similar issues where the current code is known to produce imprecise estimates: do not assume Limited scopeNote that is fine if we are too conservative and conclude that selectivity estimate is not precise while it actually is. The effect of this will be that applicability of optimizations of MDEV-8306 is reduced but this can be tolerated. When to check for precise selectivityMDEV-8306 needs to compute a precise estimate join output cardinality $JOIN_CARD. This number is used to determine which fraction of some join execution plan $PLAN_WITH_LIMIT will be actually executed before #LIMIT output rows are produced and execution is cut short (it is $FRACTION = $LIMIT / $JOIN_CARD). Note the question of whether the optimizer was using accurate selectivity estimates when constructing the query plan $PLAN_WITH_LIMIT is not relevant - we will assume we'll need to compute $FRACTION fraction of the join) Check for precise selectivity estimates.We will use this (intentionally narrow) definition: Precise selectivity is known for
t.col1=t2.col2=... can be satisfied if for all of its members except one, the query plan has a ref access that has t.col1 as a key part.
|
| Comments |
| Comment by Sergei Petrunia [ 2020-02-12 ] | ||
|
It's fairly easy to make the range optimizer mark the comparison Items (eg. Item_func) for which it has produced non-empty SEL_ARG objects. If these conditions are also at the top level of the WHERE clause, this means we know their selecitivty from EITS or range estimates. What about multiple equalities? If the WHERE clause has a multiple-equality.. when do we consider its selectivity to be known? | ||
| Comment by Sergei Petrunia [ 2020-02-13 ] | ||
Taking into account ref accesses.Selects with multiple tables will most likely have equi-join conditions:
Here,
The above logic extends to several pair-wise equalities and ref accesses using several key parts. What if there are several equalities but there's no single [potential] ref access that covers them all:
possible options:
| ||
| Comment by Sergei Petrunia [ 2020-02-13 ] | ||
Multiple equalities.If the WHERE clause has a multiple equality, when can we consider its selectivity "taken into account"? An equality may or may not have a constant
If the equality has a constant, it should be treated as a set of pair-wise equalities col_i=const would have been treated. (The range optimizer will build SEL_ARG object[s] for an Item_equal. We can take a note which fields were handled this way). If the equality doesn't have a constant, taking its selectivity into account means "covering" the set of fields with pair-wise equalities... | ||
| Comment by Sergei Petrunia [ 2020-02-15 ] | ||
Handling Multiple equalitiesMulti-equality is a set of fields that should be equal:
If we find a part of the query plan which guarantees that {col1=col2}, then we remove either col1 or col2 from the set:
Continue this process for all parts of the query plan. If we are left with a set of one element, all others are guaranteed to be equal to it, and selectivity is taken into account. Otherwise, it is not. (If we examine the [eq]ref accesses and remove the fields that are parts of keys used for lookup, then there's only one point where tblX.colY can be removed.) Details
|