Details
-
Task
-
Status: Open (View Workflow)
-
Minor
-
Resolution: Unresolved
-
None
-
None
Description
Problem description
MariaDB (and MySQL) optimizer works as follows:
1. Pick the best join order, ignoring ORDER/GROUP BY clause.
2. Check if we can change the access method on the first table to handle ORDER BY .. LIMIT by reading [fewer] records in the required order.
This approach makes the optimizer to be incapable of choosing a good query plan for join queries. Consider a query:
select * from t1, t2 where ... order by t1.key limit N
|
If the join optimizer choses a join order of (t2, t1) then the optimizer is not able to make a choice to use index t1.key to satisfy the ORDER BY...LIMIT clause.
Solution overview
The solution is to make the join optimizer to take ORDER BY..LIMIT clause into account.
When considering a join order starting from table t1, it should note that using t1.key will allow us to scan much fewer records, because we will only need to produce N output rows. (N comes from "limit N")
Possible challenges
How many records will we have to enumerate before we produce N output records? The answer depends on the conditions are there in the WHERE clause. Possible problems are
- Highly-selective conditions whose selectivity is not known to the optimizer. (e.g. conditions over non-indexed columns) (see note below about MariaDB 10.0)
- Correlation between the ORDER BY clause and WHERE condition. Optimizers typically make assumptions like "records that satisfy WHERE condition are spread uniformly across the table". For example, if WHERE clause is satisfied for 50% of the records, the optimizer assumes it needs to read two records (from any index) before it finds a record that satisfies the WHERE condition. If the data is correlated, one may need to scan through 50% of records before he finds a record that matches the WHERE condition. This correlation is difficult to account for in the optimizer.
Implementation
Making Join Optimizer account for ORDER BY LIMIT
How to combine choice of join order with choice of reading + sorting vs. reading the index that produces appropriate ordering?
The first thing that comes to mind is to modify best_access_path() function to check whether it is optimizing access to the first table in the join, and if yes, consider using an index that matches the ORDER BY ...LIMIT N clause. This approach has a problem: when we've put only the first table into join order, how do we know how many records we will have to scan for "ORDER BY .. LIMIT N"? (example: if the next table has 1 match for each record, we need N records, if the next table has 100 matches, we need N/100 records). We don't know what to put as #records when considering table access for the first table.
A possible solution to this:
1. Run join optimization as we do it currently (i.e. ignore the ORDER BY...LIMIT part)
2. Run join optimization, limiting it to producing only join orders that emit rows in the way that matches the ORDER BY clause
3. Compare the above two and make the choice.
Note: if we pick a plan that uses ORDER BY...LIMIT, we'll need to make sure that join buffering is not used with it (because it breaks the ordering)
Costs to account for
As far as my knowledge goes: the advantage of ORDER BY...LIMIT plans comes primarily from reading fewer records from base tables (in other words, from not evaluating the complete join). Cost of sorting is negligible and can be ignored by the cost calculations for the nearest future. (Not accounting for cost of sorting works against ORDER BY ... LIMIT plans. That is, the change in plans will be a bit conservative).
Predicate selectivity
Primary reliable source about condition selectivity is possible range and ref(const) accesses. We can assume them to be uncorrelated with the index that's used for ORDER BY (as long as that index doesn't cover the same columns). Is this sufficient?
Version considerations
- MySQL 5.6 has some changes in test_if_skip_sort_order() and make_join_select() code. They seem to be orthogonal to this task.
- MariaDB 10.0 will have engine-independent column statistics. These could help if we encounter a problem with non-indexed predicate selectivity.
Other things
- It should be fairly easy to put new functionality under a @@optimizer_switch flag (name could be "optimize_join_for_order_by" or something like that), so it can be turned on/off.