[MDEV-8306] Complete cost-based optimization for ORDER BY with LIMIT Created: 2015-06-11  Updated: 2023-12-16

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 11.5

Type: New Feature Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 22
Labels: optimizer, optimizer-feature, order-by-optimization

Issue Links:
Blocks
is blocked by MDEV-21643 test correctness of cost_based_order_... Stalled
Duplicate
is duplicated by MDEV-6813 ORDER BY limit optimizer doesn't take... Closed
is duplicated by MDEV-14569 Query runs slower using a UNIQUE KEY ... Closed
PartOf
includes MDEV-8002 Ability to use index for order-by and... Open
includes MDEV-21713 LIMIT optimization and selectivity: p... Open
includes MDEV-22360 Sufficient conditions for accurate ca... Stalled
Relates
relates to MDEV-8880 Very suboptimal join order is generat... Stalled
relates to MDEV-13275 Query optimizer not picking optimal O... Stalled
relates to MDEV-13694 Wrong result upon GROUP BY with order... Closed
relates to MDEV-14621 Query very slow compared to Mysql Closed
relates to MDEV-16214 Incorrect plan taken by the optimizer... Closed
relates to MDEV-18094 Query with order by limit picking ind... Closed
relates to MDEV-19808 Add Optimizer Switch for Filesort wit... Open
relates to MDEV-20129 Equality propagation for ORDER BY ite... Stalled
relates to MDEV-20209 Using different index with same range... Closed
relates to MDEV-20459 Selectivity of equality condition in ... Stalled
relates to MDEV-21408 Performances testing for ORDER BY wit... Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDEV-19853 Heuristic pruning prunes partial plan... Technical task Open Sergei Petrunia  
MDEV-19854 Using indexes that can achieve ordering Technical task Stalled Sergei Petrunia  
MDEV-20503 Optimizer trace support for order by ... Technical task Open Sergei Petrunia  
MDEV-20146 Analyze and Explain for sort-nest Technical task Stalled Sergei Petrunia  

 Description   

A long standing (and informally known) issue:

Join optimizer makes its choices [almost] without regard for ORDER BY ... LIMIT clause. ORDER BY ... LIMIT optimizer is invoked when the join order is already fixed. If the picked join order doesn't allow to resolve ORDER BY ... LIMIT efficiently... then we end up with a very poor query plan.

Example:

select * from
  t_fact
    join dim1 on t_fact.dim1_id= dim1.dim1_id
    join dim2 on t_fact.dim2_id= dim2.dim2_id
order by
   t_fact.col1
limit 1000;

+------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
| id   | select_type | table  | type   | possible_keys   | key     | key_len | ref               | rows | Extra                           |
+------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+
|    1 | SIMPLE      | dim1   | ALL    | PRIMARY         | NULL    | NULL    | NULL              |  500 | Using temporary; Using filesort |
|    1 | SIMPLE      | t_fact | ref    | dim1_id,dim2_id | dim1_id | 4       | j3.dim1.dim1_id   |    1 |                                 |
|    1 | SIMPLE      | dim2   | eq_ref | PRIMARY         | PRIMARY | 4       | j3.t_fact.dim2_id |    1 |                                 |
+------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+---------------------------------+

This uses filesort and takes ~8 sec.
Now, let's force the right join order:

select * from
  t_fact
    straight_join dim1 on t_fact.dim1_id= dim1.dim1_id
    straight_join dim2 on t_fact.dim2_id= dim2.dim2_id
order by
   t_fact.col1
limit 1000;

+------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
| id   | select_type | table  | type   | possible_keys   | key     | key_len | ref               | rows | Extra |
+------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+
|    1 | SIMPLE      | t_fact | index  | dim1_id,dim2_id | col1    | 4       | NULL              | 1000 |       |
|    1 | SIMPLE      | dim1   | eq_ref | PRIMARY         | PRIMARY | 4       | j3.t_fact.dim1_id |    1 |       |
|    1 | SIMPLE      | dim2   | eq_ref | PRIMARY         | PRIMARY | 4       | j3.t_fact.dim2_id |    1 |       |
+------+-------------+--------+--------+-----------------+---------+---------+-------------------+------+-------+

This uses index to resolve the ORDER BY ... LIMIT and the select takes 0.01 sec to execute.

Dataset:

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
 
create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
 
create table t_fact
(
  fact_id int not null,
  dim1_id int not null,
  dim2_id int not null,
  col1    int not null,
  primary key(fact_id),
  key(dim1_id),
  key(dim2_id),
  key(col1)
);
 
insert into t_fact
select
  A.a+1000*B.a+1000*1000*C.a,
  A.a,
  B.a,
  A.a+1000*B.a+1000*1000*C.a
from
  one_k A ,
  one_k B,
  ten C
where
 A.a<500 and B.a<500
;
 
 
create table dim1
(
  dim1_id int not null primary key,
  col1 int
);
 
insert into dim1
select a,a from one_k where a<500;
 
create table dim2
(
  dim2_id int not null primary key,
  col1 int
);
insert into dim2
select a,a from one_k where a<500;



 Comments   
Comment by Sergei Petrunia [ 2015-06-11 ]

This can be solved by making the optimizer do this:

1. Run the join optimizer (as it is currently done)
2. check if #join_output_rows < select_limit
3. If yes, re-run the join optimizer in a special way:
3.1 The first table must produce the required ordering
3.2 Join buffering must not be used
3.3 Costs/#rows must be adjusted to take into account
that we are going to produce at most select_limit
rows.

Unclear parts:

3.1 may also need to include the case where one runs
filesort(.., limit=none, ...) for the first table
and then joins the sorted sequence with the rest (applying the LIMIT)

3.3 is the most complex and needs to be elaborated on.

Comment by Sergei Petrunia [ 2015-11-09 ]

Another problematic issue is WHERE clause. It is easy to make a decision to switch to using index-based resolution to ORDER BY ... LIMIT N when you assume that the first N rows you enumerate will match the WHERE clause. If the WHERE condition is very selective, one may end up enumerating many more rows before they find the first N rows that match the WHERE.

EITS gives the optimizer information about condition selectivity, but does it give enough? The way EITS works currently, it is difficult to tell whether EITS plus join operations "took into account" selectivities of all parts of WHERE clause or there were some un-handled conditions.

Comment by Sergei Petrunia [ 2016-10-11 ]

See also: http://bugs.mysql.com/bug.php?id=83323

Comment by Sergei Golubchik [ 2017-08-06 ]

Two ideas about properly taking ORDER BY (but not LIMIT) into account in join optimizer.

1. Join optimizer is doing depth-first search. First it builds some join order, then tries other partial orders and compares costs. When a very first some join order is built, optimizer has the estimate of number of rows in the result. At that point it can estimate the cost of a filesort. Later when comparing plans (or partial plans), it should add the cost of a filesort to those plans that will end up using filesort. This way it can to a fair comparison of a plan-with-filesort and a plan that uses index and no filesort.

One of the weak points of this idea, it assumes that independently from the join order, optimizer will always get the same (or similar) estimate for the number of rows in the result.

Comment by Sergei Golubchik [ 2017-08-06 ]

2. Another idea, calculate three plans in parallel. That is, currently, best_access_path() does

      if (tmp + 0.0001 < best_time - records/(double) TIME_FOR_COMPARE)
      {
        best_time= tmp + records/(double) TIME_FOR_COMPARE;
        best= tmp;
        best_records= records;
        best_key= start_key;
        best_max_key_part= max_key_part;
        best_ref_depends_map= found_ref;
      }

Instead it can do the following:

      enum { NO_FILESORT, TMPTABLE_FILESORT, FIRST_TABLE_FILESORT } plan_type;
      if (tmp + 0.0001 < best_time[plan_type] - records/(double) TIME_FOR_COMPARE)
      {
        best_time[plan_type]= tmp + records/(double) TIME_FOR_COMPARE;
        best[plan_type]= tmp;
        best_records[plan_type]= records;
        best_key[plan_type]= start_key;
        best_max_key_part[plan_type]= max_key_part;
        best_ref_depends_map[plan_type]= found_ref;
      }

basically, it finds three best plans — one that uses no filesort, one that uses filesort after the first table, and one that does temptable and filesort at the end. When join optimizer is finished, we can select the best plan out of three, taking filesort cost into account.

Comment by Sergei Petrunia [ 2019-05-03 ]

Takeaways from browsing through some more-or-less related papers :

"Interesting orders"

Traditional query optimizer relies on the concept of "interesting orders". ORDER BY/GROUP BY specify "interesting orders", for example.

The optimizer tries to keep (ie to not prune away) the query plan [fragments] that produce rows in interesting orders.

For example, if there is a query plan [fragment] QP1 that produces an interesting order, and a query plan QP2 (or a fragment for the same part of the query) which is cheaper but doesn't produce the same interesting order, then QP1 is not pruned in favor of QP2.

Peculiar property of the LIMIT-n construct

Consider the process of constructing a regular join plan for "t1 JOIN t2 JOIN t3".

We start with "t1 JOIN t2", and compute its cost.
Then we add the t3 as the 3rd table. The plan for "t1 JOIN t2 JOIN t3" will still include the plan for "t1 JOIN t2" inside it.

Now, consider "t1 JOIN t2 JOIN t3 ORDER BY ... LIMIT N".

Let's assume that short-cutting is possible (execution stops after producing N
matches) and try to pick the first table. Suppose, it's t1.

What value of LIMIT should we use? It depends on the fanout of t2, t3 (which we do not know).

We could get an estimate of fanout by running the join optimizer and checking how many rows are expected to be in the join output. @serg's comment above was talking about this also.

Simplified approach - satisfy ORDER BY, ignore the LIMIT

Can we just assume that any query plan that produces the matching order is good enough?

Comment by Sergei Petrunia [ 2019-05-07 ]

Notes from the optimizer call:
Igor suggests to also consider query plans where the first N tables are written into temp.table, then passed to filesort, and then the sorted output is joined with the rest of the tables. That is, bushy plans with a "nest" at the front. TODO: elaborate on this proposal.

Comment by Varun Gupta (Inactive) [ 2019-06-05 ]

Details about the approach to be taken, termed as Order by limit (bushy plan)

General description

  • No need for this optimization when we have grouping(DISTINCT, GROUP BY, window function and aggregate functions)
  • We can also disallow this bushy plan if we have HAVING clause that is not pushed into WHERE clause
  • For bushy plan we will not need anything different until we get the partial order that supports the ORDER BY CLAUSE

So an approach for the bushy plan(if chosen as the best plan) would be like

  1. Compute Partial join (that covers ordering)
  2. Store the result of the partial join in a temp table
  3. Sort the temp table
  4. Join the temp table with the remaining tables

Note: the bush strategy allows to shortcut the execution for LIMIT early:

  • The part of the join that is inside the bush will need to be computed fully (and passed to Filesort)
  • After that, we can short-cut the join execution as soon as we’ve found #LIMIT matches.
Comment by Varun Gupta (Inactive) [ 2019-06-07 ]

Optimizer part

Take the LIMIT part of ORDER BY LIMIT into account during join optimization

The biggest difference in query execution for ORDER BY … LIMIT plans comes from ability (or inability) to terminate the query execution early, as soon as #LIMIT matches were found.

Two issues for the optimizer here:

  • Data distribution: it’s difficult to predict whether an index scan will find the matching rows early or late (TODO rephrase). Perhaps we should add a penalty for plans that take risks.
  • Partial plans and LIMIT: The whole plan has to produce N rows. How many rows should the partial plan produce for that? This depends on the fanout of the rest of the plan.

Suggested answer for the second: run the join optimizer, compute the total join output cardinality. Then, we can see which fraction of join output we will need and assume that we will need to do the same fraction of work for each table (TODO: our optimizer typically heavily over-estimates cardinality due to unknown selectivity of predicates. Will this cause an overly-optimistic estimates?)

(TODO2: another suggestion is to
1. Force a construction of query plan which produces rows in a suitable order (that is: don’t prune it away even if there are cheaper query plans which produce rows in a non-suitable order)
2. Take into account LIMIT shortcut only after the whole query plan has been constructed)

Add a sort operation when join prefix allows it

We can add a sorting operation if all of the columns that are required to compute the sort key are in the join prefix. Note that the check should take into account equality propagation (See MDEV-8989).

Example:

WHERE
t2.a= t3.a and t3.a=t4.a
ORDER BY t1.a, t2.a

For each reference to tblX.columnY in ORDER BY list, we get:
A single table_map referring to tblX
If tblX is a part of multiple equality, a bitmap of tables whose fields participate in the multiple equality.
For this example, we would get:

  table_map_1= {t1}, 
  table_map_2= {t2, t3, t4}

Then, when we have a join prefix, we can check if it covers all the table_map values that we have.

1.3 Cost of sorting
It is cheaper to sort fewer rows, so sort operation must be done at the point in the query plan where the output cardinality is minimal. Add the cost of sorting as soon as we have achieved the ordering of the columns

No join buffering in ordered-join mode

After the sorting nest has been added, the rest of the join is performed in ordered-join mode (first “table” is sorted, the join output is sorted too). Join buffering and BKA must be disabled here.

Representation of the sort operation in the JOIN prefix
TODO: Should sort operation be represented as a member of a POSITION object, or as a separate POSITION object?

1.6 Enumeration of different sorting options
If we construct a join prefix of
t1, t2
And then add a sort operation:

sort {t1, t2}

And then consider adding table t3,

We will add it afterwards:

 Sort {t1, t2} , t3 

No overlaps with semi/outer joins

It’s difficult to handle the situation where the sorting operation comes in the middle of outer join or semi-join processing.
Example:

SELECT * FROM t1 LEFT JOIN (t2, t3) ON …

It is hard to make a sort bush of

{t1, t2}

here. The suggestion is to not allow this.
It’s similar for semi-joins: Putting a sort operation after a join prefix that has [partial] duplicate row combinations will add a lot of complexity. Possible variants of condition under which sort operation can be added:

Comment by Varun Gupta (Inactive) [ 2019-06-10 ]

Plan refinement part

No linked join buffers across the bush bound

Join buffering code should be adjusted to not do this.

Create temp. table

Figure out the set of fields we need to be saved from the tables inside the sort nest, and create a temporary table with those fields.

Access to tables inside the sort nest from post-sorting context

Query execution with sort nest:
Step 1: Compute a join of tables inside the sort nest; store record combinations in the temporary table;
Step 2: Use filesort on the temporary table
Step 3: Read filesort result and join it with the tables outside the sort nest
Step 4: Return the query result (to the client or elsewhere)

When we are at Step 3 and 4, we may need to read columns of tables inside the sort nest.
This requires some extra arrangement, as the “current” values are in the temporary table.

Possible solutions:

The “Unpack” solution (aka copy-back)

After we read a row from the temporary table, put the values back into the record buffers of the original tables.
This is used by:
SJ-Materialization-Scan (see rr_sequential_and_unpack())
Join buffering code (see JOIN_CACHE::read_all_record_fields())
When reading the result of filesort’ing a single table.

Advantages: no need to modify any Item objects.
Disadvantages: Extra overhead to copy the values back.

The “refer-to-temptable like GROUP BY does” solution

This is used for example by GROUP BY code. When a tableX.fieldY is accessed from post-group operation context (e.g. when computing the HAVING clause), it is done as follows:

The Item_field referring to tableX.fieldY is wrapped inside Item_ref object.
Item_ref::val_int() translates the call to val_int_result() (the same happens for other datatypes, too).
Item_field has two pointers to table fields: “field” refers to the Field in the original table, while “result_field” refers to its corresponding field in the temporary table.
Item_field::val_int_result() returns value from the “result_field”, that is, from the temporary table.

Advantages: no need to copy
Disadvantages: need to have Item_ref objects (can we use them also or only GROUP BY code can?)

The “refer to temptable but not like GROUP BY does it” solution.

Is there anything that prevents us from changing Item_field objects to point to temptable directly? (This change will need to be un-done for SP re-execution).
(note: there is class Item_temptable_field, but it is the same as Item_field for most purposes)

Advantages: no need to copy
Disadvantages: need to walk Item trees and make substitutions (which then will need to be undone if running as PS or in SP).

Comment by Varun Gupta (Inactive) [ 2019-06-11 ]

The branch is 10.5-mdev8306

Comment by Varun Gupta (Inactive) [ 2019-06-25 ]

While testing for the optimizer part where we enumerate the plans, I found two issues, opened sub-tasks MDEV-19853 and MDEV-19854.

Comment by Varun Gupta (Inactive) [ 2019-07-03 ]

Optimizer part

Part 1 of the project is done, the planner now enumerates plans that would also consider adding a nest to a partial join that would satisfy the order by clause.
Also for the plans I have added the cost of filling and sorting the temporary table that would store the result of the partial join.

Also I have made some progress in the second part
1) Representing the order-nest in the top level select just as we do for Semi Join Materialization Strategies
So lets consider a case of 4 tables
the join order is t1,t2,t3,t4
order nest is on table (t1,t2)
So the top level select would have 3 tables <order-nest>, t3,t4

This would like this:

                 <order-nest> ---- t3 ---- t4 
                       |
                        +----t1---t2

2) Created the temp table with that would store the partial join

Comment by Varun Gupta (Inactive) [ 2019-07-09 ]

More updates

  • Instead of the order-nest for now, kept a seperate JOIN_TAB for the temp table containing the result of the partial join order. This is done currently because we need to adjust the number of JOIN_TAB's how they are allocated(will make a seperate patch for this).
  • Also the iterations over the JOIN_TAB looked a bit complex, we have this WITH_BUSH_ROOT and WITHOUT_BUSH_ROOT that is used to include and exclude the root of the SJ-nest, we can use this for the order nest too.
  • Also introduced a struct NEST_INFO to store some of the contents of the nest.

The commits are:
https://github.com/MariaDB/server/commit/ec395d289533317e30b156d9e3b17afd4b50982e
https://github.com/MariaDB/server/commit/d3d66daccbc51cf87a96a8c4cea130c0fa404056

Comment by Varun Gupta (Inactive) [ 2019-07-16 ]

Updates

MariaDB [test]> explain select t1.b from t1,t2 where t1.a=t2.a  order by t2.b limit 2;
+------+-------------+--------------+------+---------------+------+---------+------+------+----------------+
| id   | select_type | table        | type | possible_keys | key  | key_len | ref  | rows | Extra          |
+------+-------------+--------------+------+---------------+------+---------+------+------+----------------+
|    1 | SIMPLE      | t2           | ALL  | NULL          | NULL | NULL    | NULL | 100  |                |
|    1 | SIMPLE      | <order-nest> | ALL  | NULL          | NULL | NULL    | NULL | 100  | Using filesort |
|    1 | SIMPLE      | t1           | ALL  | NULL          | NULL | NULL    | NULL | 100  | Using where    |
+------+-------------+--------------+------+---------------+------+---------+------+------+----------------+

Comment by Varun Gupta (Inactive) [ 2019-07-23 ]

Updates

Comment by Varun Gupta (Inactive) [ 2019-07-30 ]

Tasks left:

  1. Need to get a bush for the order-nest (this was the suggestion that I discarded earlier as it was taking me more time to do it that way)
  2. ON expressions and add test cases
  3. MDEV-19854: Using indexes that can achieve ordering
  4. Transform function for subqueries
  5. Test if Item_ref need some special handling
  6. Adding more tests for subqueries and outer joins
  7. Make sure that the entire regression test suite passes
  8. MDEV-20129 should be pushed along with MDEV-8306 (this is done needs review)
  9. There is some work to handle the case when the first non-const table satisfies ordering like enable join buffering if the index used achieves the ordering for first non-const table
  10. Use cache_record_length, to get more precise length of rec_length for the temp table instead of using AVG_REC_LEN
  11. Improving Analyze and Explain format=json for sort-nest(MDEV-20146)
  12. Handling for Blobs (need to disable if sorting cannot be done with them)
  13. Addressing some review and adding proper comments
  14. Enable the sort-nest for multi table UPDATE and DELETE queries also
Comment by Varun Gupta (Inactive) [ 2019-08-20 ]

Issue with DEPENDENT SUBQUERY

Case 1: when the subquery is attached to the tables inside the nest

Query:

SELECT abs(t1.a) as x, t2.a,t3.a FROM t1,t2,t3
WHERE t1.b = t2.b and t3.b=t1.b and
EXISTS (select 1 from t4 where t4.b=t1.c and t4.b < 4 group by t4.c having max(t4.a) = x)
ORDER BY t2.a desc, t1.a desc limit 5;

x refers to the item abs(t1.a) as x in the parent select
the subquery has a reference to a select_list element
So when I do the substitution of the select list with sort-nest items then I
change the outer-ref to x also(unintentionally)

The idea to fix this would be:

  • From the select list filter out the items that are internal to the sort-nest
  • Create temporary table fields for all the items found above, this would mean that
    abs(t1.a) would also be in the sort-nest table
  • Then substitute in the select list abs(t1.a) as x with sort-nest(abs(t1.a)) as x
  • Also this would require change to the ref_pointer_array as those should also point
    to the new substituted items in the select list
Comment by Varun Gupta (Inactive) [ 2019-08-23 ]

So there was some more progress on taking into consideration of an index to achieve ordering for the first non-const table (MDEV-19854).
So the case in the description now takes into consideration of the limit and we can the desired plan

Here is the output for the explain:
https://jira.mariadb.org/browse/MDEV-19854?focusedCommentId=132928&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-132928

Comment by Varun Gupta (Inactive) [ 2019-08-28 ]

While testing with dbt3 I found an example where using sort-nest optimization is not picking the best accesses for the table

MariaDB [dbt3]> EXPLAIN SELECT   c_nationkey, c_name, o_orderDate, o_totalprice, c_acctbal, n_name FROM   orders, customer, nation  WHERE   c_custkey= o_custkey AND c_nationkey=n_nationkey AND   o_orderDATE >='1997-07-30' and o_orderDATE <= '1998-07-30'  ORDER BY o_orderDATE, o_totalprice DESC LIMIT 10;
+------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+
| id   | select_type | table    | type   | possible_keys             | key           | key_len | ref                   | rows | Extra                                              |
+------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+
|    1 | SIMPLE      | orders   | range  | i_o_orderdate,i_o_custkey | i_o_orderdate | 4       | NULL                  | 217  | Using index condition; Using where; Using filesort |
|    1 | SIMPLE      | customer | eq_ref | PRIMARY,i_c_nationkey     | PRIMARY       | 4       | dbt3.orders.o_custkey | 1    |                                                    |
|    1 | SIMPLE      | nation   | ALL    | PRIMARY                   | NULL          | NULL    | NULL                  | 25   | Using where                                        |
+------+-------------+----------+--------+---------------------------+---------------+---------+-----------------------+------+----------------------------------------------------+
3 rows in set (0.005 sec)
 
MariaDB [dbt3]> set use_sort_nest=0;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [dbt3]> EXPLAIN SELECT   c_nationkey, c_name, o_orderDate, o_totalprice, c_acctbal, n_name FROM   orders, customer, nation  WHERE   c_custkey= o_custkey AND c_nationkey=n_nationkey AND   o_orderDATE >='1997-07-30' and o_orderDATE <= '1998-07-30'  ORDER BY o_orderDATE, o_totalprice DESC LIMIT 10;
+------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+
| id   | select_type | table    | type   | possible_keys             | key           | key_len | ref                       | rows | Extra                                              |
+------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+
|    1 | SIMPLE      | orders   | range  | i_o_orderdate,i_o_custkey | i_o_orderdate | 4       | NULL                      | 217  | Using index condition; Using where; Using filesort |
|    1 | SIMPLE      | customer | eq_ref | PRIMARY,i_c_nationkey     | PRIMARY       | 4       | dbt3.orders.o_custkey     | 1    | Using where                                        |
|    1 | SIMPLE      | nation   | eq_ref | PRIMARY                   | PRIMARY       | 4       | dbt3.customer.c_nationkey | 1    |                                                    |
+------+-------------+----------+--------+---------------------------+---------------+---------+---------------------------+------+----------------------------------------------------+
3 rows in set (0.005 sec)

So with use_sort_nest=1 we use table scan on table nation while we could have use eq_ref access, somehow the optimizer is not picking it

Comment by Varun Gupta (Inactive) [ 2019-08-28 ]

The optimizer trace shows that table scan was preferred

                    "rest_of_plan": [
                      {
                        "plan_prefix": ["orders", "customer"],
                        "table": "nation",
                        "sort-nest": ["orders"],
                        "best_access_path": {
                          "considered_access_paths": [
                            {
                              "access_type": "eq_ref",
                              "index": "PRIMARY",
                              "rows": 1,
                              "cost": 217,
                              "chosen": true
                            },
                            {
                              "access_type": "scan",
                              "resulting_rows": 25,
                              "cost": 29,
                              "chosen": true
                            }
                          ]

Comment by Varun Gupta (Inactive) [ 2019-09-24 ]

Implementation details

Finding if the Query can consider use the cost based approach for ORDER BY with LIMIT

  • sort_nest_allowed()
  • is_order_by_expensive()

Equality propagation for Order BY Items

Propagate the multiple equalities to all the items in the ORDER BY list, this
helps us to consider various other join order that can have prefix that can
resolve the ORDER BY clause

  • propagate_equal_field_for_orderby()
  • remove_const_from_order_by()

Find indexes that can resolve the ORDER BY clause

Walk through all the usable indexes for a table and find a map of keys for
table that can resolve the ORDER BY clause

  • find_keys_that_can_achieve_ordering()

Get the cardinality estimate for the join

Run the join planner to get the estimate of the cardinality of the join.
This is needed as we want to apply the limit selectivity later while calculating
the cost of the best plan.

  • estimate_cardinality_for_join()

Run the join planner to get the best plan that will consider using a sort nest

optimizer_prune_level=0
This means no heuristic pruning of partial plans,
only partial plans are pruned by the cost of the
best plan at that instance

For each prefix if the ORDER BY clause can be resolved by the tables
in the prefix then
a) We consider adding a nest around the prefix. A separate call is added
for best_extension_by_limited_search

  • consider_adding_sort_nest()
    b) Add the cost of sorting.
  • sort_nest_oper_cost()

c) Apply the limit selectivity. This is done because after the nest
is considered we can shortcut the execution, we only consider
the fraction of rows that we would read from the nest to get
the resulting rows in the output. So this is how we take the cost
of the shortcut into consideration.

  • set_fraction_output_for_nest

d) Also special case when one can use indexes to resolve ORDER BY
clause are considered here as well. We calculate the cost to access
the index which can resolve the ORDER BY clause on the first non-const
table. This can help in shortcutting join execution.

  • get_best_index_for_order_by_limit()
  • find_cost_of_index_with_ordering()

After the best plan is picked, we create the sort-nest info structure if the best plan uses a sort-nest.

  • check_if_sort_nest_present()
  • create_sort_nest_info()

For the first non-const table that uses an index to resolve ordering setup range/index scan

  • setup_index_use_for_ordering()
  • setup_range_scan()

Creating the temp table for the sort-nest

We create a list of field items of the tables inside the nest which are needed
to be stored in the temporary table.
Then we also create a list of items of the fields of the temporary table

  • make_sort_nest()

Join buffering is disabled as it can break ordering

  • is_join_buffering_allowed()

Substitution of items belonging to tables inside the nest with the items of the temp table of the nest

Here we substitute the items of the tables inside the sort-nest with items
of the temporary table of the sort-nest.
The clauses where this substitution happens are:

  1. SELECT LIST
  2. ON-EXPR
  3. REF ACCESS items [for SJM lookup also]
  4. WHERE clause
  5. ORDER BY clause
  • substitute_base_with_nest_field_items()
    transform() is called with the function Item::replace_with_nest_items sent as a callback function

Materializing the result inside the temp table of the sort-nest

At the execution phase we materialize the result of the inner tables inside
the sort-nest and then continue the execution.

  • end_nest_materialization()
Comment by Sergei Petrunia [ 2020-01-31 ]

Before MDEV-8306, best_extension_by_limited_search had this code:

        if (join->sort_by_table &&
            join->sort_by_table !=
            join->positions[join->const_tables].table->table)
          /*
             We may have to make a temp table, note that this is only a
             heuristic since we cannot know for sure at this point.
             Hence it may be wrong.
          */
          current_read_time= COST_ADD(current_read_time, current_record_count);

Now, I see only the comment is left:

            /*
              We may have to make a temp table, note that this is only a
              heuristic since we cannot know for sure at this point.
              Hence it may be wrong.
            */
            cost= current_record_count;

Does this mean we can't provide the old behavior? (Yes I know it is ugly and incorrect) Still, being able to provide old behavior has value and we should discuss if we could do it, here.

Comment by Varun Gupta (Inactive) [ 2020-02-02 ]

The entire snippet is

      else
      {
        /*
          'join' is either the best partial QEP with 'search_depth' relations,
          or the best complete QEP so far, whichever is smaller.
        */
        if (!nest_created &&
            ((join->sort_by_table &&
              join->sort_by_table !=
              join->positions[join->const_tables].table->table) ||
              join->sort_nest_possible))
        {
          double cost;
          if (join->sort_nest_possible)
          {
            ulong rec_len= cache_record_length_for_nest(join, idx);
            cost= join->sort_nest_oper_cost(partial_join_cardinality, idx,
                                            rec_len);
            trace_one_table.add("cost_of_sorting", cost);
          }
          else
          {
            /*
              We may have to make a temp table, note that this is only a
              heuristic since we cannot know for sure at this point.
              Hence it may be wrong.
            */
            cost= current_record_count;
          }
 
          current_read_time= COST_ADD(current_read_time, cost);
        }

So when the ORDER BY LIMIT optimization is not enabled we enter the branch

else
          {
            /*
              We may have to make a temp table, note that this is only a
              heuristic since we cannot know for sure at this point.
              Hence it may be wrong.
            */
            cost= current_record_count;
          }

And then this cost of sorting is added to the cost of the current plan

 
current_read_time= COST_ADD(current_read_time, cost);

So we still provide the same old behaviour when the ORDER BY LIMIT optimization is not enabled.

Comment by Varun Gupta (Inactive) [ 2020-02-02 ]

The rebased code on top of 10.5 is in the branch 10.5-mdev8306-2

Comment by Sergei Petrunia [ 2020-09-19 ]

A related bug in MySQL to look at : https://bugs.mysql.com/bug.php?id=97001

Comment by Varun Gupta (Inactive) [ 2021-02-09 ]

The updated branch is on 10.6-order_by_limt
Also a push to buildbot has been made
http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.6-varun

Generated at Thu Feb 08 07:26:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.