Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8306

Complete cost-based optimization for ORDER BY with LIMIT

Details

    Description

      A long standing (and informally known) issue:

      Join optimizer makes its choices [almost] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

      Example:

      select * from
        t_fact
          join dim1 on t_fact.dim1_id= dim1.dim1_id
          join dim2 on t_fact.dim2_id= dim2.dim2_id
      order by
         t_fact.col1
      limit 1000;
      

      +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
      | id   | select_type | table  | type   | possible_keys   | key     | key_len | ref               | rows | Extra                           |
      +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
      |    1 | SIMPLE      | dim1   | ALL    | PRIMARY         | NULL    | NULL    | NULL              |  500 | Using temporary; Using filesort |
      |    1 | SIMPLE      | t_fact | ref    | dim1_id,dim2_id | dim1_id | 4       | j3.dim1.dim1_id   |    1 |                                 |
      |    1 | SIMPLE      | dim2   | eq_ref | PRIMARY         | PRIMARY | 4       | j3.t_fact.dim2_id |    1 |                                 |
      +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
      

      This uses filesort and takes ~8 sec.
      Now, let's force the right join order:

      select * from
        t_fact
          straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
          straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
      order by
         t_fact.col1
      limit 1000;
      

      +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
      | id   | select_type | table  | type   | possible_keys   | key     | key_len | ref               | rows | Extra |
      +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
      |    1 | SIMPLE      | t_fact | index  | dim1_id,dim2_id | col1    | 4       | NULL              | 1000 |       |
      |    1 | SIMPLE      | dim1   | eq_ref | PRIMARY         | PRIMARY | 4       | j3.t_fact.dim1_id |    1 |       |
      |    1 | SIMPLE      | dim2   | eq_ref | PRIMARY         | PRIMARY | 4       | j3.t_fact.dim2_id |    1 |       |
      +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
      

      This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

      Dataset:

      create table ten(a int);
      insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
       
      create table one_k(a int);
      insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
       
      create table t_fact
      (
        fact_id int not null,
        dim1_id int not null,
        dim2_id int not null,
        col1    int not null,
        primary key(fact_id),
        key(dim1_id),
        key(dim2_id),
        key(col1)
      );
       
      insert into t_fact
      select
        A.a+1000*B.a+1000*1000*C.a,
        A.a,
        B.a,
        A.a+1000*B.a+1000*1000*C.a
      from
        one_k A ,
        one_k B,
        ten C
      where
       A.a<500 and B.a<500
      ;
       
       
      create table dim1
      (
        dim1_id int not null primary key,
        col1 int
      );
       
      insert into dim1
      select a,a from one_k where a<500;
       
      create table dim2
      (
        dim2_id int not null primary key,
        col1 int
      );
      insert into dim2
      select a,a from one_k where a<500;
      

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            Description A long standing (and informally known) issue:

            Join optimizer makes its choices [almost] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:
            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}
            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:
            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}
            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.
            A long standing (and informally known) issue:

            Join optimizer makes its choices [almost] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:
            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}
            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:
            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}
            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.


            Dataset:
            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            psergei Sergei Petrunia made changes -
            Labels optimizer order-by-optimization

            This can be solved by making the optimizer do this:

            1. Run the join optimizer (as it is currently done)
            2. check if #join_output_rows < select_limit
            3. If yes, re-run the join optimizer in a special way:
            3.1 The first table must produce the required ordering
            3.2 Join buffering must not be used
            3.3 Costs/#rows must be adjusted to take into account
            that we are going to produce at most select_limit
            rows.

            Unclear parts:

            3.1 may also need to include the case where one runs
            filesort(.., limit=none, ...) for the first table
            and then joins the sorted sequence with the rest (applying the LIMIT)

            3.3 is the most complex and needs to be elaborated on.

            psergei Sergei Petrunia added a comment - This can be solved by making the optimizer do this: 1. Run the join optimizer (as it is currently done) 2. check if #join_output_rows < select_limit 3. If yes, re-run the join optimizer in a special way: 3.1 The first table must produce the required ordering 3.2 Join buffering must not be used 3.3 Costs/#rows must be adjusted to take into account that we are going to produce at most select_limit rows. Unclear parts: 3.1 may also need to include the case where one runs filesort(.., limit=none, ...) for the first table and then joins the sorted sequence with the rest (applying the LIMIT) 3.3 is the most complex and needs to be elaborated on.
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.1 [ 16100 ]
            psergei Sergei Petrunia made changes -

            Another problematic issue is WHERE clause. It is easy to make a decision to switch to using index-based resolution to ORDER BY ... LIMIT N when you assume that the first N rows you enumerate will match the WHERE clause. If the WHERE condition is very selective, one may end up enumerating many more rows before they find the first N rows that match the WHERE.

            EITS gives the optimizer information about condition selectivity, but does it give enough? The way EITS works currently, it is difficult to tell whether EITS plus join operations "took into account" selectivities of all parts of WHERE clause or there were some un-handled conditions.

            psergei Sergei Petrunia added a comment - Another problematic issue is WHERE clause. It is easy to make a decision to switch to using index-based resolution to ORDER BY ... LIMIT N when you assume that the first N rows you enumerate will match the WHERE clause. If the WHERE condition is very selective, one may end up enumerating many more rows before they find the first N rows that match the WHERE. EITS gives the optimizer information about condition selectivity, but does it give enough? The way EITS works currently, it is difficult to tell whether EITS plus join operations "took into account" selectivities of all parts of WHERE clause or there were some un-handled conditions.
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.1 [ 16100 ]
            psergei Sergei Petrunia added a comment - See also: http://bugs.mysql.com/bug.php?id=83323
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            psergei Sergei Petrunia made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.2 [ 14601 ]
            serg Sergei Golubchik made changes -
            Affects Version/s 10.0.19 [ 19200 ]
            Issue Type Bug [ 1 ] Task [ 3 ]

            Two ideas about properly taking ORDER BY (but not LIMIT) into account in join optimizer.

            1. Join optimizer is doing depth-first search. First it builds some join order, then tries other partial orders and compares costs. When a very first some join order is built, optimizer has the estimate of number of rows in the result. At that point it can estimate the cost of a filesort. Later when comparing plans (or partial plans), it should add the cost of a filesort to those plans that will end up using filesort. This way it can to a fair comparison of a plan-with-filesort and a plan that uses index and no filesort.

            One of the weak points of this idea, it assumes that independently from the join order, optimizer will always get the same (or similar) estimate for the number of rows in the result.

            serg Sergei Golubchik added a comment - Two ideas about properly taking ORDER BY (but not LIMIT) into account in join optimizer. 1. Join optimizer is doing depth-first search. First it builds some join order, then tries other partial orders and compares costs. When a very first some join order is built, optimizer has the estimate of number of rows in the result. At that point it can estimate the cost of a filesort. Later when comparing plans (or partial plans), it should add the cost of a filesort to those plans that will end up using filesort. This way it can to a fair comparison of a plan-with-filesort and a plan that uses index and no filesort. One of the weak points of this idea, it assumes that independently from the join order, optimizer will always get the same (or similar) estimate for the number of rows in the result.

            2. Another idea, calculate three plans in parallel. That is, currently, best_access_path() does

                  if (tmp + 0.0001 < best_time - records/(double) TIME_FOR_COMPARE)
                  {
                    best_time= tmp + records/(double) TIME_FOR_COMPARE;
                    best= tmp;
                    best_records= records;
                    best_key= start_key;
                    best_max_key_part= max_key_part;
                    best_ref_depends_map= found_ref;
                  }
            

            Instead it can do the following:

                  enum { NO_FILESORT, TMPTABLE_FILESORT, FIRST_TABLE_FILESORT } plan_type;
                  if (tmp + 0.0001 < best_time[plan_type] - records/(double) TIME_FOR_COMPARE)
                  {
                    best_time[plan_type]= tmp + records/(double) TIME_FOR_COMPARE;
                    best[plan_type]= tmp;
                    best_records[plan_type]= records;
                    best_key[plan_type]= start_key;
                    best_max_key_part[plan_type]= max_key_part;
                    best_ref_depends_map[plan_type]= found_ref;
                  }
            

            basically, it finds three best plans — one that uses no filesort, one that uses filesort after the first table, and one that does temptable and filesort at the end. When join optimizer is finished, we can select the best plan out of three, taking filesort cost into account.

            serg Sergei Golubchik added a comment - 2. Another idea, calculate three plans in parallel. That is, currently, best_access_path() does if (tmp + 0.0001 < best_time - records/( double ) TIME_FOR_COMPARE) { best_time= tmp + records/( double ) TIME_FOR_COMPARE; best= tmp; best_records= records; best_key= start_key; best_max_key_part= max_key_part; best_ref_depends_map= found_ref; } Instead it can do the following: enum { NO_FILESORT, TMPTABLE_FILESORT, FIRST_TABLE_FILESORT } plan_type; if (tmp + 0.0001 < best_time[plan_type] - records/( double ) TIME_FOR_COMPARE) { best_time[plan_type]= tmp + records/( double ) TIME_FOR_COMPARE; best[plan_type]= tmp; best_records[plan_type]= records; best_key[plan_type]= start_key; best_max_key_part[plan_type]= max_key_part; best_ref_depends_map[plan_type]= found_ref; } basically, it finds three best plans — one that uses no filesort, one that uses filesort after the first table, and one that does temptable and filesort at the end. When join optimizer is finished, we can select the best plan out of three, taking filesort cost into account.
            psergei Sergei Petrunia made changes -
            alice Alice Sherepa made changes -
            julien.fritsch Julien Fritsch made changes -
            Support case ID 9377 not-9377
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.4 [ 22408 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            NRE Projects RM_105_CANDIDATE
            ralf.gebhardt Ralf Gebhardt made changes -
            NRE Projects RM_105_CANDIDATE RM_105_CANDIDATE RM_105_OPTIMIZER
            varun Varun Gupta (Inactive) made changes -
            Assignee Sergei Petrunia [ psergey ] Varun Gupta [ varun ]
            varun Varun Gupta (Inactive) made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            varun Varun Gupta (Inactive) made changes -

            Takeaways from browsing through some more-or-less related papers :

            "Interesting orders"

            Traditional query optimizer relies on the concept of "interesting orders". ORDER BY/GROUP BY specify "interesting orders", for example.

            The optimizer tries to keep (ie to not prune away) the query plan [fragments] that produce rows in interesting orders.

            For example, if there is a query plan [fragment] QP1 that produces an interesting order, and a query plan QP2 (or a fragment for the same part of the query) which is cheaper but doesn't produce the same interesting order, then QP1 is not pruned in favor of QP2.

            Peculiar property of the LIMIT-n construct

            Consider the process of constructing a regular join plan for "t1 JOIN t2 JOIN t3".

            We start with "t1 JOIN t2", and compute its cost.
            Then we add the t3 as the 3rd table. The plan for "t1 JOIN t2 JOIN t3" will still include the plan for "t1 JOIN t2" inside it.

            Now, consider "t1 JOIN t2 JOIN t3 ORDER BY ... LIMIT N".

            Let's assume that short-cutting is possible (execution stops after producing N
            matches) and try to pick the first table. Suppose, it's t1.

            What value of LIMIT should we use? It depends on the fanout of t2, t3 (which we do not know).

            We could get an estimate of fanout by running the join optimizer and checking how many rows are expected to be in the join output. @serg's comment above was talking about this also.

            Simplified approach - satisfy ORDER BY, ignore the LIMIT

            Can we just assume that any query plan that produces the matching order is good enough?

            psergei Sergei Petrunia added a comment - Takeaways from browsing through some more-or-less related papers : "Interesting orders" Traditional query optimizer relies on the concept of "interesting orders". ORDER BY/GROUP BY specify "interesting orders", for example. The optimizer tries to keep (ie to not prune away) the query plan [fragments] that produce rows in interesting orders. For example, if there is a query plan [fragment] QP1 that produces an interesting order, and a query plan QP2 (or a fragment for the same part of the query) which is cheaper but doesn't produce the same interesting order, then QP1 is not pruned in favor of QP2. Peculiar property of the LIMIT-n construct Consider the process of constructing a regular join plan for "t1 JOIN t2 JOIN t3". We start with "t1 JOIN t2", and compute its cost. Then we add the t3 as the 3rd table. The plan for "t1 JOIN t2 JOIN t3" will still include the plan for "t1 JOIN t2" inside it. Now, consider "t1 JOIN t2 JOIN t3 ORDER BY ... LIMIT N". Let's assume that short-cutting is possible (execution stops after producing N matches) and try to pick the first table. Suppose, it's t1. What value of LIMIT should we use? It depends on the fanout of t2, t3 (which we do not know). We could get an estimate of fanout by running the join optimizer and checking how many rows are expected to be in the join output. @serg's comment above was talking about this also. Simplified approach - satisfy ORDER BY, ignore the LIMIT Can we just assume that any query plan that produces the matching order is good enough?

            Notes from the optimizer call:
            Igor suggests to also consider query plans where the first N tables are written into temp.table, then passed to filesort, and then the sorted output is joined with the rest of the tables. That is, bushy plans with a "nest" at the front. TODO: elaborate on this proposal.

            psergei Sergei Petrunia added a comment - Notes from the optimizer call: Igor suggests to also consider query plans where the first N tables are written into temp.table, then passed to filesort, and then the sorted output is joined with the rest of the tables. That is, bushy plans with a "nest" at the front. TODO: elaborate on this proposal.
            varun Varun Gupta (Inactive) made changes -
            Fix Version/s 10.5 [ 23123 ]

            Details about the approach to be taken, termed as Order by limit (bushy plan)

            General description

            • No need for this optimization when we have grouping(DISTINCT, GROUP BY, window function and aggregate functions)
            • We can also disallow this bushy plan if we have HAVING clause that is not pushed into WHERE clause
            • For bushy plan we will not need anything different until we get the partial order that supports the ORDER BY CLAUSE

            So an approach for the bushy plan(if chosen as the best plan) would be like

            1. Compute Partial join (that covers ordering)
            2. Store the result of the partial join in a temp table
            3. Sort the temp table
            4. Join the temp table with the remaining tables

            Note: the bush strategy allows to shortcut the execution for LIMIT early:

            • The part of the join that is inside the bush will need to be computed fully (and passed to Filesort)
            • After that, we can short-cut the join execution as soon as we’ve found #LIMIT matches.
            varun Varun Gupta (Inactive) added a comment - Details about the approach to be taken, termed as Order by limit (bushy plan) General description No need for this optimization when we have grouping(DISTINCT, GROUP BY, window function and aggregate functions) We can also disallow this bushy plan if we have HAVING clause that is not pushed into WHERE clause For bushy plan we will not need anything different until we get the partial order that supports the ORDER BY CLAUSE So an approach for the bushy plan(if chosen as the best plan) would be like Compute Partial join (that covers ordering) Store the result of the partial join in a temp table Sort the temp table Join the temp table with the remaining tables Note: the bush strategy allows to shortcut the execution for LIMIT early: The part of the join that is inside the bush will need to be computed fully (and passed to Filesort) After that, we can short-cut the join execution as soon as we’ve found #LIMIT matches.
            varun Varun Gupta (Inactive) added a comment - - edited

            Optimizer part

            Take the LIMIT part of ORDER BY LIMIT into account during join optimization

            The biggest difference in query execution for ORDER BY … LIMIT plans comes from ability (or inability) to terminate the query execution early, as soon as #LIMIT matches were found.

            Two issues for the optimizer here:

            • Data distribution: it’s difficult to predict whether an index scan will find the matching rows early or late (TODO rephrase). Perhaps we should add a penalty for plans that take risks.
            • Partial plans and LIMIT: The whole plan has to produce N rows. How many rows should the partial plan produce for that? This depends on the fanout of the rest of the plan.

            Suggested answer for the second: run the join optimizer, compute the total join output cardinality. Then, we can see which fraction of join output we will need and assume that we will need to do the same fraction of work for each table (TODO: our optimizer typically heavily over-estimates cardinality due to unknown selectivity of predicates. Will this cause an overly-optimistic estimates?)

            (TODO2: another suggestion is to
            1. Force a construction of query plan which produces rows in a suitable order (that is: don’t prune it away even if there are cheaper query plans which produce rows in a non-suitable order)
            2. Take into account LIMIT shortcut only after the whole query plan has been constructed)

            Add a sort operation when join prefix allows it

            We can add a sorting operation if all of the columns that are required to compute the sort key are in the join prefix. Note that the check should take into account equality propagation (See MDEV-8989).

            Example:

            WHERE
            t2.a= t3.a and t3.a=t4.a
            ORDER BY t1.a, t2.a

            For each reference to tblX.columnY in ORDER BY list, we get:
            A single table_map referring to tblX
            If tblX is a part of multiple equality, a bitmap of tables whose fields participate in the multiple equality.
            For this example, we would get:

              table_map_1= {t1}, 
              table_map_2= {t2, t3, t4}
            

            Then, when we have a join prefix, we can check if it covers all the table_map values that we have.

            1.3 Cost of sorting
            It is cheaper to sort fewer rows, so sort operation must be done at the point in the query plan where the output cardinality is minimal. Add the cost of sorting as soon as we have achieved the ordering of the columns

            No join buffering in ordered-join mode

            After the sorting nest has been added, the rest of the join is performed in ordered-join mode (first “table” is sorted, the join output is sorted too). Join buffering and BKA must be disabled here.

            Representation of the sort operation in the JOIN prefix
            TODO: Should sort operation be represented as a member of a POSITION object, or as a separate POSITION object?

            1.6 Enumeration of different sorting options
            If we construct a join prefix of
            t1, t2
            And then add a sort operation:

            sort {t1, t2}
            

            And then consider adding table t3,

            We will add it afterwards:

             Sort {t1, t2} , t3 
            

            No overlaps with semi/outer joins

            It’s difficult to handle the situation where the sorting operation comes in the middle of outer join or semi-join processing.
            Example:

            SELECT * FROM t1 LEFT JOIN (t2, t3) ON …

            It is hard to make a sort bush of

            {t1, t2}

            here. The suggestion is to not allow this.
            It’s similar for semi-joins: Putting a sort operation after a join prefix that has [partial] duplicate row combinations will add a lot of complexity. Possible variants of condition under which sort operation can be added:

            varun Varun Gupta (Inactive) added a comment - - edited Optimizer part Take the LIMIT part of ORDER BY LIMIT into account during join optimization The biggest difference in query execution for ORDER BY … LIMIT plans comes from ability (or inability) to terminate the query execution early, as soon as #LIMIT matches were found. Two issues for the optimizer here: Data distribution: it’s difficult to predict whether an index scan will find the matching rows early or late (TODO rephrase). Perhaps we should add a penalty for plans that take risks. Partial plans and LIMIT: The whole plan has to produce N rows. How many rows should the partial plan produce for that? This depends on the fanout of the rest of the plan. Suggested answer for the second: run the join optimizer, compute the total join output cardinality. Then, we can see which fraction of join output we will need and assume that we will need to do the same fraction of work for each table (TODO: our optimizer typically heavily over-estimates cardinality due to unknown selectivity of predicates. Will this cause an overly-optimistic estimates?) (TODO2: another suggestion is to 1. Force a construction of query plan which produces rows in a suitable order (that is: don’t prune it away even if there are cheaper query plans which produce rows in a non-suitable order) 2. Take into account LIMIT shortcut only after the whole query plan has been constructed) Add a sort operation when join prefix allows it We can add a sorting operation if all of the columns that are required to compute the sort key are in the join prefix. Note that the check should take into account equality propagation (See MDEV-8989 ). Example: WHERE t2.a= t3.a and t3.a=t4.a ORDER BY t1.a, t2.a For each reference to tblX.columnY in ORDER BY list, we get: A single table_map referring to tblX If tblX is a part of multiple equality, a bitmap of tables whose fields participate in the multiple equality. For this example, we would get: table_map_1= {t1}, table_map_2= {t2, t3, t4} Then, when we have a join prefix, we can check if it covers all the table_map values that we have. 1.3 Cost of sorting It is cheaper to sort fewer rows, so sort operation must be done at the point in the query plan where the output cardinality is minimal. Add the cost of sorting as soon as we have achieved the ordering of the columns No join buffering in ordered-join mode After the sorting nest has been added, the rest of the join is performed in ordered-join mode (first “table” is sorted, the join output is sorted too). Join buffering and BKA must be disabled here. Representation of the sort operation in the JOIN prefix TODO: Should sort operation be represented as a member of a POSITION object, or as a separate POSITION object? 1.6 Enumeration of different sorting options If we construct a join prefix of t1, t2 And then add a sort operation: sort {t1, t2} And then consider adding table t3, We will add it afterwards: Sort {t1, t2} , t3 No overlaps with semi/outer joins It’s difficult to handle the situation where the sorting operation comes in the middle of outer join or semi-join processing. Example: SELECT * FROM t1 LEFT JOIN (t2, t3) ON … It is hard to make a sort bush of {t1, t2} here. The suggestion is to not allow this. It’s similar for semi-joins: Putting a sort operation after a join prefix that has [partial] duplicate row combinations will add a lot of complexity. Possible variants of condition under which sort operation can be added:

            Plan refinement part

            No linked join buffers across the bush bound

            Join buffering code should be adjusted to not do this.

            Create temp. table

            Figure out the set of fields we need to be saved from the tables inside the sort nest, and create a temporary table with those fields.

            Access to tables inside the sort nest from post-sorting context

            Query execution with sort nest:
            Step 1: Compute a join of tables inside the sort nest; store record combinations in the temporary table;
            Step 2: Use filesort on the temporary table
            Step 3: Read filesort result and join it with the tables outside the sort nest
            Step 4: Return the query result (to the client or elsewhere)

            When we are at Step 3 and 4, we may need to read columns of tables inside the sort nest.
            This requires some extra arrangement, as the “current” values are in the temporary table.

            Possible solutions:

            The “Unpack” solution (aka copy-back)

            After we read a row from the temporary table, put the values back into the record buffers of the original tables.
            This is used by:
            SJ-Materialization-Scan (see rr_sequential_and_unpack())
            Join buffering code (see JOIN_CACHE::read_all_record_fields())
            When reading the result of filesort’ing a single table.

            Advantages: no need to modify any Item objects.
            Disadvantages: Extra overhead to copy the values back.

            The “refer-to-temptable like GROUP BY does” solution

            This is used for example by GROUP BY code. When a tableX.fieldY is accessed from post-group operation context (e.g. when computing the HAVING clause), it is done as follows:

            The Item_field referring to tableX.fieldY is wrapped inside Item_ref object.
            Item_ref::val_int() translates the call to val_int_result() (the same happens for other datatypes, too).
            Item_field has two pointers to table fields: “field” refers to the Field in the original table, while “result_field” refers to its corresponding field in the temporary table.
            Item_field::val_int_result() returns value from the “result_field”, that is, from the temporary table.

            Advantages: no need to copy
            Disadvantages: need to have Item_ref objects (can we use them also or only GROUP BY code can?)

            The “refer to temptable but not like GROUP BY does it” solution.

            Is there anything that prevents us from changing Item_field objects to point to temptable directly? (This change will need to be un-done for SP re-execution).
            (note: there is class Item_temptable_field, but it is the same as Item_field for most purposes)

            Advantages: no need to copy
            Disadvantages: need to walk Item trees and make substitutions (which then will need to be undone if running as PS or in SP).

            varun Varun Gupta (Inactive) added a comment - Plan refinement part No linked join buffers across the bush bound Join buffering code should be adjusted to not do this. Create temp. table Figure out the set of fields we need to be saved from the tables inside the sort nest, and create a temporary table with those fields. Access to tables inside the sort nest from post-sorting context Query execution with sort nest: Step 1: Compute a join of tables inside the sort nest; store record combinations in the temporary table; Step 2: Use filesort on the temporary table Step 3: Read filesort result and join it with the tables outside the sort nest Step 4: Return the query result (to the client or elsewhere) When we are at Step 3 and 4, we may need to read columns of tables inside the sort nest. This requires some extra arrangement, as the “current” values are in the temporary table. Possible solutions: The “Unpack” solution (aka copy-back) After we read a row from the temporary table, put the values back into the record buffers of the original tables. This is used by: SJ-Materialization-Scan (see rr_sequential_and_unpack()) Join buffering code (see JOIN_CACHE::read_all_record_fields()) When reading the result of filesort’ing a single table. Advantages: no need to modify any Item objects. Disadvantages: Extra overhead to copy the values back. The “refer-to-temptable like GROUP BY does” solution This is used for example by GROUP BY code. When a tableX.fieldY is accessed from post-group operation context (e.g. when computing the HAVING clause), it is done as follows: The Item_field referring to tableX.fieldY is wrapped inside Item_ref object. Item_ref::val_int() translates the call to val_int_result() (the same happens for other datatypes, too). Item_field has two pointers to table fields: “field” refers to the Field in the original table, while “result_field” refers to its corresponding field in the temporary table. Item_field::val_int_result() returns value from the “result_field”, that is, from the temporary table. Advantages: no need to copy Disadvantages: need to have Item_ref objects (can we use them also or only GROUP BY code can?) The “refer to temptable but not like GROUP BY does it” solution. Is there anything that prevents us from changing Item_field objects to point to temptable directly? (This change will need to be un-done for SP re-execution). (note: there is class Item_temptable_field, but it is the same as Item_field for most purposes) Advantages: no need to copy Disadvantages: need to walk Item trees and make substitutions (which then will need to be undone if running as PS or in SP).

            The branch is 10.5-mdev8306

            varun Varun Gupta (Inactive) added a comment - The branch is 10.5-mdev8306

            While testing for the optimizer part where we enumerate the plans, I found two issues, opened sub-tasks MDEV-19853 and MDEV-19854.

            varun Varun Gupta (Inactive) added a comment - While testing for the optimizer part where we enumerate the plans, I found two issues, opened sub-tasks MDEV-19853 and MDEV-19854 .
            varun Varun Gupta (Inactive) added a comment - - edited

            Optimizer part

            Part 1 of the project is done, the planner now enumerates plans that would also consider adding a nest to a partial join that would satisfy the order by clause.
            Also for the plans I have added the cost of filling and sorting the temporary table that would store the result of the partial join.

            Also I have made some progress in the second part
            1) Representing the order-nest in the top level select just as we do for Semi Join Materialization Strategies
            So lets consider a case of 4 tables
            the join order is t1,t2,t3,t4
            order nest is on table (t1,t2)
            So the top level select would have 3 tables <order-nest>, t3,t4

            This would like this:

                             <order-nest> ---- t3 ---- t4 
                                   |
                                    +----t1---t2
            

            2) Created the temp table with that would store the partial join

            varun Varun Gupta (Inactive) added a comment - - edited Optimizer part Part 1 of the project is done, the planner now enumerates plans that would also consider adding a nest to a partial join that would satisfy the order by clause. Also for the plans I have added the cost of filling and sorting the temporary table that would store the result of the partial join. Also I have made some progress in the second part 1) Representing the order-nest in the top level select just as we do for Semi Join Materialization Strategies So lets consider a case of 4 tables the join order is t1,t2,t3,t4 order nest is on table (t1,t2) So the top level select would have 3 tables <order-nest>, t3,t4 This would like this: <order-nest> ---- t3 ---- t4 | +----t1---t2 2) Created the temp table with that would store the partial join
            varun Varun Gupta (Inactive) added a comment - - edited

            More updates

            • Instead of the order-nest for now, kept a seperate JOIN_TAB for the temp table containing the result of the partial join order. This is done currently because we need to adjust the number of JOIN_TAB's how they are allocated(will make a seperate patch for this).
            • Also the iterations over the JOIN_TAB looked a bit complex, we have this WITH_BUSH_ROOT and WITHOUT_BUSH_ROOT that is used to include and exclude the root of the SJ-nest, we can use this for the order nest too.
            • Also introduced a struct NEST_INFO to store some of the contents of the nest.

            The commits are:
            https://github.com/MariaDB/server/commit/ec395d289533317e30b156d9e3b17afd4b50982e
            https://github.com/MariaDB/server/commit/d3d66daccbc51cf87a96a8c4cea130c0fa404056

            varun Varun Gupta (Inactive) added a comment - - edited More updates Instead of the order-nest for now, kept a seperate JOIN_TAB for the temp table containing the result of the partial join order. This is done currently because we need to adjust the number of JOIN_TAB's how they are allocated(will make a seperate patch for this). Also the iterations over the JOIN_TAB looked a bit complex, we have this WITH_BUSH_ROOT and WITHOUT_BUSH_ROOT that is used to include and exclude the root of the SJ-nest, we can use this for the order nest too. Also introduced a struct NEST_INFO to store some of the contents of the nest. The commits are: https://github.com/MariaDB/server/commit/ec395d289533317e30b156d9e3b17afd4b50982e https://github.com/MariaDB/server/commit/d3d66daccbc51cf87a96a8c4cea130c0fa404056
            varun Varun Gupta (Inactive) added a comment - - edited

            Updates

            MariaDB [test]> explain select t1.b from t1,t2 where t1.a=t2.a  order by t2.b limit 2;
            +------+-------------+--------------+------+---------------+------+---------+------+------+----------------+
            | id   | select_type | table        | type | possible_keys | key  | key_len | ref  | rows | Extra          |
            +------+-------------+--------------+------+---------------+------+---------+------+------+----------------+
            |    1 | SIMPLE      | t2           | ALL  | NULL          | NULL | NULL    | NULL | 100  |                |
            |    1 | SIMPLE      | <order-nest> | ALL  | NULL          | NULL | NULL    | NULL | 100  | Using filesort |
            |    1 | SIMPLE      | t1           | ALL  | NULL          | NULL | NULL    | NULL | 100  | Using where    |
            +------+-------------+--------------+------+---------------+------+---------+------+------+----------------+
            

            varun Varun Gupta (Inactive) added a comment - - edited Updates Making explain work for the queries that used order by limit, had to add the special behaviour for the order nest https://github.com/MariaDB/server/commit/c4139bdb5a87949371f98ec0321d22788dae7814 This would look like MariaDB [test]> explain select t1.b from t1,t2 where t1.a=t2.a order by t2.b limit 2; +------+-------------+--------------+------+---------------+------+---------+------+------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+--------------+------+---------------+------+---------+------+------+----------------+ | 1 | SIMPLE | t2 | ALL | NULL | NULL | NULL | NULL | 100 | | | 1 | SIMPLE | <order-nest> | ALL | NULL | NULL | NULL | NULL | 100 | Using filesort | | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 100 | Using where | +------+-------------+--------------+------+---------------+------+---------+------+------+----------------+ Substitution of order by items with the nest table items, this is needed because the filesort is attached to the temporary table(sort-nest) https://github.com/MariaDB/server/commit/47fb57cfef7eb3bae82c6b55af66a144ef3fef5c Introduced a function end_nest_materialization that would hold the result of the partial join of the tables inside the sort nest https://github.com/MariaDB/server/commit/03debf755f1238efe48114744f04c73440b73056 https://github.com/MariaDB/server/commit/a1f06e0346a003f48308145e6db4fe7a55f32a94 Substitution of the base items with the nest items, this is currently done for the WHERE clause and the items in the select list , so that we can evaluate the expression in the POST ORDER BY mode https://github.com/MariaDB/server/commit/1e61de2299216800d016188df45ba7af09b111c8 https://github.com/MariaDB/server/commit/ca15e96f188d5420e32eeda50af9f5f9540d8ce5 A simple test case https://github.com/MariaDB/server/commit/82b4674bfdc18ef98f41c88f072259ec9f16f29f
            varun Varun Gupta (Inactive) made changes -

            Updates

            varun Varun Gupta (Inactive) added a comment - Updates Made substitution for ref items, so that they point to the sort-nest fields instead of the base table items https://github.com/MariaDB/server/commit/aa34be302818b974d0e41336baf89d3552559d4d Extracted the condition from the WHERE clause that would be internal to the nest and attached those to the inner table of the nest https://github.com/MariaDB/server/commit/3e1c17fedd5f4a603e61e3d1c9cc091f7bd2d1a2 For the remaining condition, substitute base items of inner tables with the sort-nest items https://github.com/MariaDB/server/commit/5e3110531ce6ce04f023787de3a0f566c2d20120 Tests added for the above extraction and substitution https://github.com/MariaDB/server/commit/e279814250e1a99bb9ff486b331c9cd9f22e0f46 No need to create a sort-nest when there is only one table to sort, this case is already handled in the server https://github.com/MariaDB/server/commit/cb70533b44eaa8a2fa6a136cb4bc894f6c238364 Removing constants from the ORDER BY clause before picking the join order https://github.com/MariaDB/server/commit/c3331cd8598c9e3a903ee70674148a8aec51c2d4 *Equality propagation for the order by items (including expressions) https://github.com/MariaDB/server/commit/c7d0dc62c24d8a6f49ff1d328bfc7be2f98292cf *Substitution of order by items with the first field in the multiple equality if found https://github.com/MariaDB/server/commit/b9e9679e53c6426f6652527d3c205fb221b15e37
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) added a comment - - edited

            Tasks left:

            1. Need to get a bush for the order-nest (this was the suggestion that I discarded earlier as it was taking me more time to do it that way)
            2. ON expressions and add test cases
            3. MDEV-19854: Using indexes that can achieve ordering
            4. Transform function for subqueries
            5. Test if Item_ref need some special handling
            6. Adding more tests for subqueries and outer joins
            7. Make sure that the entire regression test suite passes
            8. MDEV-20129 should be pushed along with MDEV-8306 (this is done needs review)
            9. There is some work to handle the case when the first non-const table satisfies ordering like enable join buffering if the index used achieves the ordering for first non-const table
            10. Use cache_record_length, to get more precise length of rec_length for the temp table instead of using AVG_REC_LEN
            11. Improving Analyze and Explain format=json for sort-nest(MDEV-20146)
            12. Handling for Blobs (need to disable if sorting cannot be done with them)
            13. Addressing some review and adding proper comments
            14. Enable the sort-nest for multi table UPDATE and DELETE queries also
            varun Varun Gupta (Inactive) added a comment - - edited Tasks left: Need to get a bush for the order-nest (this was the suggestion that I discarded earlier as it was taking me more time to do it that way) ON expressions and add test cases MDEV-19854 : Using indexes that can achieve ordering Transform function for subqueries Test if Item_ref need some special handling Adding more tests for subqueries and outer joins Make sure that the entire regression test suite passes MDEV-20129 should be pushed along with MDEV-8306 (this is done needs review) There is some work to handle the case when the first non-const table satisfies ordering like enable join buffering if the index used achieves the ordering for first non-const table Use cache_record_length, to get more precise length of rec_length for the temp table instead of using AVG_REC_LEN Improving Analyze and Explain format=json for sort-nest( MDEV-20146 ) Handling for Blobs (need to disable if sorting cannot be done with them) Addressing some review and adding proper comments Enable the sort-nest for multi table UPDATE and DELETE queries also
            varun Varun Gupta (Inactive) made changes -
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            varun Varun Gupta (Inactive) added a comment - - edited

            Issue with DEPENDENT SUBQUERY

            Case 1: when the subquery is attached to the tables inside the nest

            Query:

            SELECT abs(t1.a) as x, t2.a,t3.a FROM t1,t2,t3
            WHERE t1.b = t2.b and t3.b=t1.b and
            EXISTS (select 1 from t4 where t4.b=t1.c and t4.b < 4 group by t4.c having max(t4.a) = x)
            ORDER BY t2.a desc, t1.a desc limit 5;
            

            x refers to the item abs(t1.a) as x in the parent select
            the subquery has a reference to a select_list element
            So when I do the substitution of the select list with sort-nest items then I
            change the outer-ref to x also(unintentionally)

            The idea to fix this would be:

            • From the select list filter out the items that are internal to the sort-nest
            • Create temporary table fields for all the items found above, this would mean that
              abs(t1.a) would also be in the sort-nest table
            • Then substitute in the select list abs(t1.a) as x with sort-nest(abs(t1.a)) as x
            • Also this would require change to the ref_pointer_array as those should also point
              to the new substituted items in the select list
            varun Varun Gupta (Inactive) added a comment - - edited Issue with DEPENDENT SUBQUERY Case 1: when the subquery is attached to the tables inside the nest Query: SELECT abs (t1.a) as x, t2.a,t3.a FROM t1,t2,t3 WHERE t1.b = t2.b and t3.b=t1.b and EXISTS ( select 1 from t4 where t4.b=t1.c and t4.b < 4 group by t4.c having max (t4.a) = x) ORDER BY t2.a desc , t1.a desc limit 5; x refers to the item abs(t1.a) as x in the parent select the subquery has a reference to a select_list element So when I do the substitution of the select list with sort-nest items then I change the outer-ref to x also(unintentionally) The idea to fix this would be: From the select list filter out the items that are internal to the sort-nest Create temporary table fields for all the items found above, this would mean that abs(t1.a) would also be in the sort-nest table Then substitute in the select list abs(t1.a) as x with sort-nest(abs(t1.a)) as x Also this would require change to the ref_pointer_array as those should also point to the new substituted items in the select list

            So there was some more progress on taking into consideration of an index to achieve ordering for the first non-const table (MDEV-19854).
            So the case in the description now takes into consideration of the limit and we can the desired plan

            Here is the output for the explain:
            https://jira.mariadb.org/browse/MDEV-19854?focusedCommentId=132928&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-132928

            varun Varun Gupta (Inactive) added a comment - So there was some more progress on taking into consideration of an index to achieve ordering for the first non-const table ( MDEV-19854 ). So the case in the description now takes into consideration of the limit and we can the desired plan Here is the output for the explain: https://jira.mariadb.org/browse/MDEV-19854?focusedCommentId=132928&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-132928
            varun Varun Gupta (Inactive) added a comment - - edited

            While testing with dbt3 I found an example where using sort-nest optimization is not picking the best accesses for the table

            MariaDB [dbt3]> EXPLAIN SELECT   c_nationkey, c_name, o_orderDate, o_totalprice, c_acctbal, n_name FROM   orders, customer, nation  WHERE   c_custkey= o_custkey AND c_nationkey=n_nationkey AND   o_orderDATE >='1997-07-30' and o_orderDATE <= '1998-07-30'  ORDER BY o_orderDATE, o_totalprice DESC LIMIT 10;
            +------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+
            | id   | select_type | table    | type   | possible_keys             | key           | key_len | ref                   | rows | Extra                                              |
            +------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+
            |    1 | SIMPLE      | orders   | range  | i_o_orderdate,i_o_custkey | i_o_orderdate | 4       | NULL                  | 217  | Using index condition; Using where; Using filesort |
            |    1 | SIMPLE      | customer | eq_ref | PRIMARY,i_c_nationkey     | PRIMARY       | 4       | dbt3.orders.o_custkey | 1    |                                                    |
            |    1 | SIMPLE      | nation   | ALL    | PRIMARY                   | NULL          | NULL    | NULL                  | 25   | Using where                                        |
            +------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+
            3 rows in set (0.005 sec)
             
            MariaDB [dbt3]> set use_sort_nest=0;
            Query OK, 0 rows affected (0.000 sec)
             
            MariaDB [dbt3]> EXPLAIN SELECT   c_nationkey, c_name, o_orderDate, o_totalprice, c_acctbal, n_name FROM   orders, customer, nation  WHERE   c_custkey= o_custkey AND c_nationkey=n_nationkey AND   o_orderDATE >='1997-07-30' and o_orderDATE <= '1998-07-30'  ORDER BY o_orderDATE, o_totalprice DESC LIMIT 10;
            +------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+
            | id   | select_type | table    | type   | possible_keys             | key           | key_len | ref                       | rows | Extra                                              |
            +------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+
            |    1 | SIMPLE      | orders   | range  | i_o_orderdate,i_o_custkey | i_o_orderdate | 4       | NULL                      | 217  | Using index condition; Using where; Using filesort |
            |    1 | SIMPLE      | customer | eq_ref | PRIMARY,i_c_nationkey     | PRIMARY       | 4       | dbt3.orders.o_custkey     | 1    | Using where                                        |
            |    1 | SIMPLE      | nation   | eq_ref | PRIMARY                   | PRIMARY       | 4       | dbt3.customer.c_nationkey | 1    |                                                    |
            +------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+
            3 rows in set (0.005 sec)
            
            

            So with use_sort_nest=1 we use table scan on table nation while we could have use eq_ref access, somehow the optimizer is not picking it

            varun Varun Gupta (Inactive) added a comment - - edited While testing with dbt3 I found an example where using sort-nest optimization is not picking the best accesses for the table MariaDB [dbt3]> EXPLAIN SELECT c_nationkey, c_name, o_orderDate, o_totalprice, c_acctbal, n_name FROM orders, customer, nation WHERE c_custkey= o_custkey AND c_nationkey=n_nationkey AND o_orderDATE >='1997-07-30' and o_orderDATE <= '1998-07-30' ORDER BY o_orderDATE, o_totalprice DESC LIMIT 10; +------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+ | 1 | SIMPLE | orders | range | i_o_orderdate,i_o_custkey | i_o_orderdate | 4 | NULL | 217 | Using index condition; Using where; Using filesort | | 1 | SIMPLE | customer | eq_ref | PRIMARY,i_c_nationkey | PRIMARY | 4 | dbt3.orders.o_custkey | 1 | | | 1 | SIMPLE | nation | ALL | PRIMARY | NULL | NULL | NULL | 25 | Using where | +------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+ 3 rows in set (0.005 sec)   MariaDB [dbt3]> set use_sort_nest=0; Query OK, 0 rows affected (0.000 sec)   MariaDB [dbt3]> EXPLAIN SELECT c_nationkey, c_name, o_orderDate, o_totalprice, c_acctbal, n_name FROM orders, customer, nation WHERE c_custkey= o_custkey AND c_nationkey=n_nationkey AND o_orderDATE >='1997-07-30' and o_orderDATE <= '1998-07-30' ORDER BY o_orderDATE, o_totalprice DESC LIMIT 10; +------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+ | 1 | SIMPLE | orders | range | i_o_orderdate,i_o_custkey | i_o_orderdate | 4 | NULL | 217 | Using index condition; Using where; Using filesort | | 1 | SIMPLE | customer | eq_ref | PRIMARY,i_c_nationkey | PRIMARY | 4 | dbt3.orders.o_custkey | 1 | Using where | | 1 | SIMPLE | nation | eq_ref | PRIMARY | PRIMARY | 4 | dbt3.customer.c_nationkey | 1 | | +------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+ 3 rows in set (0.005 sec) So with use_sort_nest=1 we use table scan on table nation while we could have use eq_ref access, somehow the optimizer is not picking it

            The optimizer trace shows that table scan was preferred

                                "rest_of_plan": [
                                  {
                                    "plan_prefix": ["orders", "customer"],
                                    "table": "nation",
                                    "sort-nest": ["orders"],
                                    "best_access_path": {
                                      "considered_access_paths": [
                                        {
                                          "access_type": "eq_ref",
                                          "index": "PRIMARY",
                                          "rows": 1,
                                          "cost": 217,
                                          "chosen": true
                                        },
                                        {
                                          "access_type": "scan",
                                          "resulting_rows": 25,
                                          "cost": 29,
                                          "chosen": true
                                        }
                                      ]
            

            varun Varun Gupta (Inactive) added a comment - The optimizer trace shows that table scan was preferred "rest_of_plan": [ { "plan_prefix": ["orders", "customer"], "table": "nation", "sort-nest": ["orders"], "best_access_path": { "considered_access_paths": [ { "access_type": "eq_ref", "index": "PRIMARY", "rows": 1, "cost": 217, "chosen": true }, { "access_type": "scan", "resulting_rows": 25, "cost": 29, "chosen": true } ]
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            Assignee Varun Gupta [ varun ] Igor Babaev [ igor ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            Summary Poor optimization of JOIN and ORDER BY ... LIMIT Complete cost-based optimization for ORDER BY with LIMIT

            Implementation details

            Finding if the Query can consider use the cost based approach for ORDER BY with LIMIT

            • sort_nest_allowed()
            • is_order_by_expensive()

            Equality propagation for Order BY Items

            Propagate the multiple equalities to all the items in the ORDER BY list, this
            helps us to consider various other join order that can have prefix that can
            resolve the ORDER BY clause

            • propagate_equal_field_for_orderby()
            • remove_const_from_order_by()

            Find indexes that can resolve the ORDER BY clause

            Walk through all the usable indexes for a table and find a map of keys for
            table that can resolve the ORDER BY clause

            • find_keys_that_can_achieve_ordering()

            Get the cardinality estimate for the join

            Run the join planner to get the estimate of the cardinality of the join.
            This is needed as we want to apply the limit selectivity later while calculating
            the cost of the best plan.

            • estimate_cardinality_for_join()

            Run the join planner to get the best plan that will consider using a sort nest

            optimizer_prune_level=0
            This means no heuristic pruning of partial plans,
            only partial plans are pruned by the cost of the
            best plan at that instance

            For each prefix if the ORDER BY clause can be resolved by the tables
            in the prefix then
            a) We consider adding a nest around the prefix. A separate call is added
            for best_extension_by_limited_search

            • consider_adding_sort_nest()
              b) Add the cost of sorting.
            • sort_nest_oper_cost()

            c) Apply the limit selectivity. This is done because after the nest
            is considered we can shortcut the execution, we only consider
            the fraction of rows that we would read from the nest to get
            the resulting rows in the output. So this is how we take the cost
            of the shortcut into consideration.

            • set_fraction_output_for_nest

            d) Also special case when one can use indexes to resolve ORDER BY
            clause are considered here as well. We calculate the cost to access
            the index which can resolve the ORDER BY clause on the first non-const
            table. This can help in shortcutting join execution.

            • get_best_index_for_order_by_limit()
            • find_cost_of_index_with_ordering()

            After the best plan is picked, we create the sort-nest info structure if the best plan uses a sort-nest.

            • check_if_sort_nest_present()
            • create_sort_nest_info()

            For the first non-const table that uses an index to resolve ordering setup range/index scan

            • setup_index_use_for_ordering()
            • setup_range_scan()

            Creating the temp table for the sort-nest

            We create a list of field items of the tables inside the nest which are needed
            to be stored in the temporary table.
            Then we also create a list of items of the fields of the temporary table

            • make_sort_nest()

            Join buffering is disabled as it can break ordering

            • is_join_buffering_allowed()

            Substitution of items belonging to tables inside the nest with the items of the temp table of the nest

            Here we substitute the items of the tables inside the sort-nest with items
            of the temporary table of the sort-nest.
            The clauses where this substitution happens are:

            1. SELECT LIST
            2. ON-EXPR
            3. REF ACCESS items [for SJM lookup also]
            4. WHERE clause
            5. ORDER BY clause
            • substitute_base_with_nest_field_items()
              transform() is called with the function Item::replace_with_nest_items sent as a callback function

            Materializing the result inside the temp table of the sort-nest

            At the execution phase we materialize the result of the inner tables inside
            the sort-nest and then continue the execution.

            • end_nest_materialization()
            varun Varun Gupta (Inactive) added a comment - Implementation details Finding if the Query can consider use the cost based approach for ORDER BY with LIMIT sort_nest_allowed() is_order_by_expensive() Equality propagation for Order BY Items Propagate the multiple equalities to all the items in the ORDER BY list, this helps us to consider various other join order that can have prefix that can resolve the ORDER BY clause propagate_equal_field_for_orderby() remove_const_from_order_by() Find indexes that can resolve the ORDER BY clause Walk through all the usable indexes for a table and find a map of keys for table that can resolve the ORDER BY clause find_keys_that_can_achieve_ordering() Get the cardinality estimate for the join Run the join planner to get the estimate of the cardinality of the join. This is needed as we want to apply the limit selectivity later while calculating the cost of the best plan. estimate_cardinality_for_join() Run the join planner to get the best plan that will consider using a sort nest optimizer_prune_level=0 This means no heuristic pruning of partial plans, only partial plans are pruned by the cost of the best plan at that instance For each prefix if the ORDER BY clause can be resolved by the tables in the prefix then a) We consider adding a nest around the prefix. A separate call is added for best_extension_by_limited_search consider_adding_sort_nest() b) Add the cost of sorting. sort_nest_oper_cost() c) Apply the limit selectivity. This is done because after the nest is considered we can shortcut the execution, we only consider the fraction of rows that we would read from the nest to get the resulting rows in the output. So this is how we take the cost of the shortcut into consideration. set_fraction_output_for_nest d) Also special case when one can use indexes to resolve ORDER BY clause are considered here as well. We calculate the cost to access the index which can resolve the ORDER BY clause on the first non-const table. This can help in shortcutting join execution. get_best_index_for_order_by_limit() find_cost_of_index_with_ordering() After the best plan is picked, we create the sort-nest info structure if the best plan uses a sort-nest. check_if_sort_nest_present() create_sort_nest_info() For the first non-const table that uses an index to resolve ordering setup range/index scan setup_index_use_for_ordering() setup_range_scan() Creating the temp table for the sort-nest We create a list of field items of the tables inside the nest which are needed to be stored in the temporary table. Then we also create a list of items of the fields of the temporary table make_sort_nest() Join buffering is disabled as it can break ordering is_join_buffering_allowed() Substitution of items belonging to tables inside the nest with the items of the temp table of the nest Here we substitute the items of the tables inside the sort-nest with items of the temporary table of the sort-nest. The clauses where this substitution happens are: SELECT LIST ON-EXPR REF ACCESS items [for SJM lookup also] WHERE clause ORDER BY clause substitute_base_with_nest_field_items() transform() is called with the function Item::replace_with_nest_items sent as a callback function Materializing the result inside the temp table of the sort-nest At the execution phase we materialize the result of the inner tables inside the sort-nest and then continue the execution. end_nest_materialization()
            bradjorgensen Brad Jorgensen made changes -
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            igor Igor Babaev (Inactive) made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]
            igor Igor Babaev (Inactive) made changes -
            Assignee Igor Babaev [ igor ] Varun Gupta [ varun ]
            igor Igor Babaev (Inactive) made changes -
            Comment [ Review is not needed as it is a header mdev now.
            ]
            igor Igor Babaev (Inactive) made changes -
            Assignee Varun Gupta [ varun ] Igor Babaev [ igor ]
            igor Igor Babaev (Inactive) made changes -
            Status Stalled [ 10000 ] Open [ 1 ]
            varun Varun Gupta (Inactive) made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            igor Igor Babaev (Inactive) made changes -
            Status Confirmed [ 10101 ] In Review [ 10002 ]

            Before MDEV-8306, best_extension_by_limited_search had this code:

                    if (join->sort_by_table &&
                        join->sort_by_table !=
                        join->positions[join->const_tables].table->table)
                      /*
                         We may have to make a temp table, note that this is only a
                         heuristic since we cannot know for sure at this point.
                         Hence it may be wrong.
                      */
                      current_read_time= COST_ADD(current_read_time, current_record_count);
            

            Now, I see only the comment is left:

                        /*
                          We may have to make a temp table, note that this is only a
                          heuristic since we cannot know for sure at this point.
                          Hence it may be wrong.
                        */
                        cost= current_record_count;
            

            Does this mean we can't provide the old behavior? (Yes I know it is ugly and incorrect) Still, being able to provide old behavior has value and we should discuss if we could do it, here.

            psergei Sergei Petrunia added a comment - Before MDEV-8306 , best_extension_by_limited_search had this code: if (join->sort_by_table && join->sort_by_table != join->positions[join->const_tables].table->table) /* We may have to make a temp table, note that this is only a heuristic since we cannot know for sure at this point. Hence it may be wrong. */ current_read_time= COST_ADD(current_read_time, current_record_count); Now, I see only the comment is left: /* We may have to make a temp table, note that this is only a heuristic since we cannot know for sure at this point. Hence it may be wrong. */ cost= current_record_count; Does this mean we can't provide the old behavior? (Yes I know it is ugly and incorrect) Still, being able to provide old behavior has value and we should discuss if we could do it, here.

            The entire snippet is

                  else
                  {
                    /*
                      'join' is either the best partial QEP with 'search_depth' relations,
                      or the best complete QEP so far, whichever is smaller.
                    */
                    if (!nest_created &&
                        ((join->sort_by_table &&
                          join->sort_by_table !=
                          join->positions[join->const_tables].table->table) ||
                          join->sort_nest_possible))
                    {
                      double cost;
                      if (join->sort_nest_possible)
                      {
                        ulong rec_len= cache_record_length_for_nest(join, idx);
                        cost= join->sort_nest_oper_cost(partial_join_cardinality, idx,
                                                        rec_len);
                        trace_one_table.add("cost_of_sorting", cost);
                      }
                      else
                      {
                        /*
                          We may have to make a temp table, note that this is only a
                          heuristic since we cannot know for sure at this point.
                          Hence it may be wrong.
                        */
                        cost= current_record_count;
                      }
             
                      current_read_time= COST_ADD(current_read_time, cost);
                    }
            

            So when the ORDER BY LIMIT optimization is not enabled we enter the branch

            else
                      {
                        /*
                          We may have to make a temp table, note that this is only a
                          heuristic since we cannot know for sure at this point.
                          Hence it may be wrong.
                        */
                        cost= current_record_count;
                      }
            

            And then this cost of sorting is added to the cost of the current plan

             
            current_read_time= COST_ADD(current_read_time, cost);
            

            So we still provide the same old behaviour when the ORDER BY LIMIT optimization is not enabled.

            varun Varun Gupta (Inactive) added a comment - The entire snippet is else { /* 'join' is either the best partial QEP with 'search_depth' relations, or the best complete QEP so far, whichever is smaller. */ if (!nest_created && ((join->sort_by_table && join->sort_by_table != join->positions[join->const_tables].table->table) || join->sort_nest_possible)) { double cost; if (join->sort_nest_possible) { ulong rec_len= cache_record_length_for_nest(join, idx); cost= join->sort_nest_oper_cost(partial_join_cardinality, idx, rec_len); trace_one_table.add( "cost_of_sorting" , cost); } else { /* We may have to make a temp table, note that this is only a heuristic since we cannot know for sure at this point. Hence it may be wrong. */ cost= current_record_count; }   current_read_time= COST_ADD(current_read_time, cost); } So when the ORDER BY LIMIT optimization is not enabled we enter the branch else { /* We may have to make a temp table, note that this is only a heuristic since we cannot know for sure at this point. Hence it may be wrong. */ cost= current_record_count; } And then this cost of sorting is added to the cost of the current plan current_read_time= COST_ADD(current_read_time, cost); So we still provide the same old behaviour when the ORDER BY LIMIT optimization is not enabled.

            The rebased code on top of 10.5 is in the branch 10.5-mdev8306-2

            varun Varun Gupta (Inactive) added a comment - The rebased code on top of 10.5 is in the branch 10.5-mdev8306-2
            serg Sergei Golubchik made changes -
            psergei Sergei Petrunia made changes -
            varun Varun Gupta (Inactive) made changes -
            Assignee Igor Babaev [ igor ] Varun Gupta [ varun ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.5 [ 23123 ]
            varun Varun Gupta (Inactive) made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]
            varun Varun Gupta (Inactive) made changes -
            serg Sergei Golubchik made changes -
            Rank Ranked higher

            A related bug in MySQL to look at : https://bugs.mysql.com/bug.php?id=97001

            psergei Sergei Petrunia added a comment - A related bug in MySQL to look at : https://bugs.mysql.com/bug.php?id=97001
            varun Varun Gupta (Inactive) made changes -
            Assignee Varun Gupta [ varun ] Sergei Petrunia [ psergey ]
            Status Stalled [ 10000 ] In Review [ 10002 ]

            The updated branch is on 10.6-order_by_limt
            Also a push to buildbot has been made
            http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.6-varun

            varun Varun Gupta (Inactive) added a comment - The updated branch is on 10.6-order_by_limt Also a push to buildbot has been made http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.6-varun
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.6 [ 24028 ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels optimizer order-by-optimization ServiceNow optimizer order-by-optimization
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels ServiceNow optimizer order-by-optimization 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z optimizer order-by-optimization
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            serg Sergei Golubchik made changes -
            Labels 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z optimizer order-by-optimization optimizer order-by-optimization
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            serg Sergei Golubchik made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.7 [ 24805 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 69896 ] MariaDB v4 [ 131761 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.8 [ 26121 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.9 [ 26905 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 10.10 [ 27530 ]
            AirFocus AirFocus made changes -
            Description A long standing (and informally known) issue:

            Join optimizer makes its choices [almost] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:
            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}
            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:
            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}
            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.


            Dataset:
            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            A long standing (and informally known) issue:

            Join optimizer makes its choices \[almost\] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:

            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:

            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}

            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

            Dataset:

            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            julien.fritsch Julien Fritsch made changes -
            Description A long standing (and informally known) issue:

            Join optimizer makes its choices \[almost\] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:

            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:

            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}

            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

            Dataset:

            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            A long standing (and informally known) issue:-

            Join optimizer makes its choices \[almost\] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:

            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:

            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}

            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

            Dataset:

            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            julien.fritsch Julien Fritsch made changes -
            Description A long standing (and informally known) issue:-

            Join optimizer makes its choices \[almost\] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:

            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:

            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}

            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

            Dataset:

            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            A long standing (and informally known) issue:

            Join optimizer makes its choices [almost] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

            Example:

            {noformat}
            select * from
              t_fact
                join dim1 on t_fact.dim1_id= dim1.dim1_id
                join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            | 1 | SIMPLE | dim1 | ALL | PRIMARY | NULL | NULL | NULL | 500 | Using temporary; Using filesort |
            | 1 | SIMPLE | t_fact | ref | dim1_id,dim2_id | dim1_id | 4 | j3.dim1.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
            {noformat}

            This uses filesort and takes ~8 sec.
            Now, let's force the right join order:

            {noformat}
            select * from
              t_fact
                straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
                straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
            order by
               t_fact.col1
            limit 1000;
            {noformat}

            {noformat}
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            | 1 | SIMPLE | t_fact | index | dim1_id,dim2_id | col1 | 4 | NULL | 1000 | |
            | 1 | SIMPLE | dim1 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim1_id | 1 | |
            | 1 | SIMPLE | dim2 | eq_ref | PRIMARY | PRIMARY | 4 | j3.t_fact.dim2_id | 1 | |
            +------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
            {noformat}

            This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

            Dataset:

            {noformat}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t_fact
            (
              fact_id int not null,
              dim1_id int not null,
              dim2_id int not null,
              col1 int not null,
              primary key(fact_id),
              key(dim1_id),
              key(dim2_id),
              key(col1)
            );

            insert into t_fact
            select
              A.a+1000*B.a+1000*1000*C.a,
              A.a,
              B.a,
              A.a+1000*B.a+1000*1000*C.a
            from
              one_k A ,
              one_k B,
              ten C
            where
             A.a<500 and B.a<500
            ;


            create table dim1
            (
              dim1_id int not null primary key,
              col1 int
            );

            insert into dim1
            select a,a from one_k where a<500;

            create table dim2
            (
              dim2_id int not null primary key,
              col1 int
            );
            insert into dim2
            select a,a from one_k where a<500;
            {noformat}
            julien.fritsch Julien Fritsch made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.12 [ 28320 ]
            Fix Version/s 10.11 [ 27614 ]
            serg Sergei Golubchik made changes -
            psergei Sergei Petrunia made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]
            psergei Sergei Petrunia made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.1 [ 28549 ]
            Fix Version/s 11.0 [ 28320 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s 11.1 [ 28549 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.3 [ 28565 ]
            Fix Version/s 11.2 [ 28603 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.4 [ 29301 ]
            Fix Version/s 11.3 [ 28565 ]
            psergei Sergei Petrunia made changes -
            Labels optimizer order-by-optimization optimizer optimizer-feature order-by-optimization
            julien.fritsch Julien Fritsch made changes -
            Issue Type Task [ 3 ] New Feature [ 2 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.5 [ 29506 ]
            Fix Version/s 11.4 [ 29301 ]
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.6 [ 29515 ]
            Fix Version/s 11.5 [ 29506 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.7 [ 29815 ]
            Fix Version/s 11.6 [ 29515 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201658 202060 159278
            Zendesk active tickets 201658
            serg Sergei Golubchik made changes -
            Fix Version/s 11.8 [ 29921 ]
            Fix Version/s 11.7 [ 29815 ]
            serg Sergei Golubchik made changes -
            alice Alice Sherepa made changes -
            alice Alice Sherepa made changes -
            serg Sergei Golubchik made changes -
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.9 [ 29945 ]
            Fix Version/s 11.8 [ 29921 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 12.1 [ 29992 ]
            Fix Version/s 12.0 [ 29945 ]
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            julien.fritsch Julien Fritsch made changes -
            Sprint Server 12.1 dev sprint [ 793 ]
            julien.fritsch Julien Fritsch made changes -
            Sprint Server 12.1 dev sprint [ 793 ]

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              23 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.