Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21713

LIMIT optimization and selectivity: pessimistic estimates cause optimistic plans

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 10.6
    • Component/s: Optimizer
    • Labels:
      None

      Description

      Problem statement

      SELECT ... FROM ...
      WHERE cond
      ORDER by tbl.key LIMIT $N
      

      Consider a query plan that enumerates rows in the order that matches the ORDER BY clause. It allows for "short-cutting", that is, the execution stops as soon as $N matches were produced.

      We can expect to produce $N matches when we have enumerated $N/selectivity(cond) rows.

      If the condition selectivity is not known, the optimizer will assume that selectivity=1, which will cause it to underestimate the number of rows that it will need to examine, and so underestimate the cost of this query plan.

      The same happens when the optimizer has a a "conservative" selectivity estimate (which it typically does, as it assumes the selectivity of condition to be 1.0 when there's no data that provides the real value).

      Proposed solution, v1

      The idea is to have the cost computations take into account the "short-cutting" only when the full condition selectivity is known.

      This is rather hard to do as EITS and the range analyzer extract condition selectivity from the whole WHERE condition, and give no indication of whether this selectivity is "full" or "partial".

      Proposed solution: details

      If the WHERE condition is a conjunction of conditions whose selectivity is known:

        WHERE cond1 AND cond2 ... AND condN
      

      then we can assume the selectivity is known. (the issue with correlation between the conditions can still cause errors in the estimation but since these can be on either side, we can ignore this issue)

      Conditions whose selectivity is known are:

      • Comparisons (=, <, >, etc) of a column with constant for which the EITS analyzer has produced a SEL_ARG.
      • Equalities in form tbl.key=... that are used for ref access.
        • ref access must use index statistics or EITS estimate.

      Proposed Solution, Version 2 (This is what decided to do)

      This is based on the input on the first patch.

      The target is the current code in MDEV-8306 branch, with its current selectivity estimation code.
      If/when selectivity estimation code is improved the check for accurate selectivity estimation can be adjusted as well.

      We also want to take into account MDEV-21633 and similar issues where the current code is known to produce imprecise estimates: do not assume
      the selectivity estimate is precise if it might not be. Making the error on the other side is more tolerable.

      Limited scope

      Note that is fine if we are too conservative and conclude that selectivity estimate is not precise while it actually is. The effect of this will be that applicability of optimizations of MDEV-8306 is reduced but this can be tolerated.

      When to check for precise selectivity

      MDEV-8306 needs to compute a precise estimate join output cardinality $JOIN_CARD.

      This number is used to determine which fraction of some join execution plan $PLAN_WITH_LIMIT will be actually executed before #LIMIT output rows are produced and execution is cut short (it is $FRACTION = $LIMIT / $JOIN_CARD).

      Note the question of whether the optimizer was using accurate selectivity estimates when constructing the query plan $PLAN_WITH_LIMIT is not relevant - we will assume we'll need to compute $FRACTION fraction of the join)

      Check for precise selectivity estimates.

      We will use this (intentionally narrow) definition:

      Precise selectivity is known for

      • column $CMP const, where $CMP is (less|more) [or equal], and the column has a histogram data selectivity but it is not a part of any key. [and is not a part of a multiple equality].
      • The same as above, where the column is the first component of one index, and is not a part of any other index. [and is not a part of a multiple equality].
      • A multiple equality that has one field and one constant: Item_equal(tblx.colY, const) and satisfies one of the above two rules except the [] part.
      • A multiple equality that has no constants

      t.col1=t2.col2=...

      can be satisfied if for all of its members except one, the query plan has a ref access that has t.col1 as a key part.

      • an AND of any the above.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              psergey Sergei Petrunia
              Reporter:
              psergey Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated: