Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21713

LIMIT optimization and selectivity: pessimistic estimates cause optimistic plans

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Optimizer
    • None

    Description

      Problem statement

      SELECT ... FROM ...
      WHERE cond
      ORDER by tbl.key LIMIT $N
      

      Consider a query plan that enumerates rows in the order that matches the ORDER BY clause. It allows for "short-cutting", that is, the execution stops as soon as $N matches were produced.

      We can expect to produce $N matches when we have enumerated $N/selectivity(cond) rows.

      If the condition selectivity is not known, the optimizer will assume that selectivity=1, which will cause it to underestimate the number of rows that it will need to examine, and so underestimate the cost of this query plan.

      The same happens when the optimizer has a a "conservative" selectivity estimate (which it typically does, as it assumes the selectivity of condition to be 1.0 when there's no data that provides the real value).

      Proposed solution, v1

      The idea is to have the cost computations take into account the "short-cutting" only when the full condition selectivity is known.

      This is rather hard to do as EITS and the range analyzer extract condition selectivity from the whole WHERE condition, and give no indication of whether this selectivity is "full" or "partial".

      Proposed solution: details

      If the WHERE condition is a conjunction of conditions whose selectivity is known:

        WHERE cond1 AND cond2 ... AND condN
      

      then we can assume the selectivity is known. (the issue with correlation between the conditions can still cause errors in the estimation but since these can be on either side, we can ignore this issue)

      Conditions whose selectivity is known are:

      • Comparisons (=, <, >, etc) of a column with constant for which the EITS analyzer has produced a SEL_ARG.
      • Equalities in form tbl.key=... that are used for ref access.
        • ref access must use index statistics or EITS estimate.

      Proposed Solution, Version 2 (This is what decided to do)

      This is based on the input on the first patch.

      The target is the current code in MDEV-8306 branch, with its current selectivity estimation code.
      If/when selectivity estimation code is improved the check for accurate selectivity estimation can be adjusted as well.

      We also want to take into account MDEV-21633 and similar issues where the current code is known to produce imprecise estimates: do not assume
      the selectivity estimate is precise if it might not be. Making the error on the other side is more tolerable.

      Limited scope

      Note that is fine if we are too conservative and conclude that selectivity estimate is not precise while it actually is. The effect of this will be that applicability of optimizations of MDEV-8306 is reduced but this can be tolerated.

      When to check for precise selectivity

      MDEV-8306 needs to compute a precise estimate join output cardinality $JOIN_CARD.

      This number is used to determine which fraction of some join execution plan $PLAN_WITH_LIMIT will be actually executed before #LIMIT output rows are produced and execution is cut short (it is $FRACTION = $LIMIT / $JOIN_CARD).

      Note the question of whether the optimizer was using accurate selectivity estimates when constructing the query plan $PLAN_WITH_LIMIT is not relevant - we will assume we'll need to compute $FRACTION fraction of the join)

      Check for precise selectivity estimates.

      We will use this (intentionally narrow) definition:

      Precise selectivity is known for

      • column $CMP const, where $CMP is (less|more) [or equal], and the column has a histogram data selectivity but it is not a part of any key. [and is not a part of a multiple equality].
      • The same as above, where the column is the first component of one index, and is not a part of any other index. [and is not a part of a multiple equality].
      • A multiple equality that has one field and one constant: Item_equal(tblx.colY, const) and satisfies one of the above two rules except the [] part.
      • A multiple equality that has no constants

      t.col1=t2.col2=...

      can be satisfied if for all of its members except one, the query plan has a ref access that has t.col1 as a key part.

      • an AND of any the above.

      Attachments

        Issue Links

          Activity

            It's fairly easy to make the range optimizer mark the comparison Items (eg. Item_func) for which it has produced non-empty SEL_ARG objects. If these conditions are also at the top level of the WHERE clause, this means we know their selecitivty from EITS or range estimates.

            What about multiple equalities? If the WHERE clause has a multiple-equality.. when do we consider its selectivity to be known?

            psergei Sergei Petrunia added a comment - It's fairly easy to make the range optimizer mark the comparison Items (eg. Item_func) for which it has produced non-empty SEL_ARG objects. If these conditions are also at the top level of the WHERE clause, this means we know their selecitivty from EITS or range estimates. What about multiple equalities? If the WHERE clause has a multiple-equality.. when do we consider its selectivity to be known?

            Taking into account ref accesses.

            Selects with multiple tables will most likely have equi-join conditions:

            select * from t1,t2 where t2.key=t1.col
            

            Here,

            • if we are using ref access on t2.key=t1.col, then selectivity of the condition is taken care of.
            • If we are not using the ref access, could we examine the index statistics and apply its selectivity?

            The above logic extends to several pair-wise equalities and ref accesses using several key parts.

            What if there are several equalities but there's no single [potential] ref access that covers them all:

            select * from t1,t2 where t2.key1=t1.col1 and t2.key2=t1.col2
            

            possible options:

            • Don't handle this. Assume the cost of current plan is too high.
            • Apply the selectivity of columns that are not "covered" by the access method.
            psergei Sergei Petrunia added a comment - Taking into account ref accesses. Selects with multiple tables will most likely have equi-join conditions: select * from t1,t2 where t2.key=t1.col Here, if we are using ref access on t2.key=t1.col , then selectivity of the condition is taken care of. If we are not using the ref access, could we examine the index statistics and apply its selectivity? The above logic extends to several pair-wise equalities and ref accesses using several key parts. What if there are several equalities but there's no single [potential] ref access that covers them all: select * from t1,t2 where t2.key1=t1.col1 and t2.key2=t1.col2 possible options: Don't handle this. Assume the cost of current plan is too high. Apply the selectivity of columns that are not "covered" by the access method.

            Multiple equalities.

            If the WHERE clause has a multiple equality, when can we consider its selectivity "taken into account"?

            An equality may or may not have a constant

            col1=col2 AND col2=col3 AND ... col3=const
            col1=col2 AND col2=col3 AND ...
            

            If the equality has a constant, it should be treated as a set of pair-wise equalities col_i=const would have been treated. (The range optimizer will build SEL_ARG object[s] for an Item_equal. We can take a note which fields were handled this way).

            If the equality doesn't have a constant, taking its selectivity into account means "covering" the set of fields with pair-wise equalities...

            psergei Sergei Petrunia added a comment - Multiple equalities. If the WHERE clause has a multiple equality, when can we consider its selectivity "taken into account"? An equality may or may not have a constant col1=col2 AND col2=col3 AND ... col3=const col1=col2 AND col2=col3 AND ... If the equality has a constant, it should be treated as a set of pair-wise equalities col_i=const would have been treated. (The range optimizer will build SEL_ARG object[s] for an Item_equal. We can take a note which fields were handled this way). If the equality doesn't have a constant, taking its selectivity into account means "covering" the set of fields with pair-wise equalities...
            psergei Sergei Petrunia added a comment - - edited

            Handling Multiple equalities

            Multi-equality is a set of fields that should be equal:

              { col1, col2, col3 }
            

            If we find a part of the query plan which guarantees that {col1=col2}, then we remove either col1 or col2 from the set:

              {col1, col3}
            

            Continue this process for all parts of the query plan. If we are left with a set of one element, all others are guaranteed to be equal to it, and selectivity is taken into account. Otherwise, it is not.

            (If we examine the [eq]ref accesses and remove the fields that are parts of keys used for lookup, then there's only one point where tblX.colY can be removed.)

            Details

            • POSITION::key shows which key we want to use. But it doesn't have info about how many key parts are to be used.
            • POSITION has Field object. It doesn't have a pointer to Item_equal it is a part of. (Item_field has a pointer to the Item_equal it is part of). We will have to walk through the list of Item_equal members and compare.
            • Since there's only one point where we can remove a field from the set, the current set can be represented by a counter.
            psergei Sergei Petrunia added a comment - - edited Handling Multiple equalities Multi-equality is a set of fields that should be equal: { col1, col2, col3 } If we find a part of the query plan which guarantees that {col1=col2}, then we remove either col1 or col2 from the set: {col1, col3} Continue this process for all parts of the query plan. If we are left with a set of one element, all others are guaranteed to be equal to it, and selectivity is taken into account. Otherwise, it is not. (If we examine the [eq] ref accesses and remove the fields that are parts of keys used for lookup, then there's only one point where tblX.colY can be removed.) Details POSITION::key shows which key we want to use. But it doesn't have info about how many key parts are to be used. POSITION has Field object. It doesn't have a pointer to Item_equal it is a part of. (Item_field has a pointer to the Item_equal it is part of). We will have to walk through the list of Item_equal members and compare. Since there's only one point where we can remove a field from the set, the current set can be represented by a counter.

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.