[MDEV-18079] Opportunistic optimization for ORDER BY LIMIT N queries - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Optimizer
Labels:
- optimizer-feature
- order-by-optimization

Description

Motivation

When we are optimizing ORDER BY ... LIMIT N, there is a choice between:

A. "Filesort plan": Read the rows of interest with the best access method, then
sort them with filesort, then take the first N rows.

B. "Index plan": find an index that matches the required ordering, start
reading matching rows through that index (either by scanning the entire index,
or by doing range access on it). Stop as soon as we have found N matches.

Plan A has a predictable cost, but it can be very "inefficient" for small
numbers in #LIMIT - we will enumerate many rows to get #LIMIT rows.

For Plan B, execution cost may be difficult to predict. If there is a WHERE
clause and a small LIMIT N, how soon we will find N matches?

Selectivity of the condition is not always known
Even if the selectivity of the condition is known (suppose it's 50%), how are
matching rows distributed in the index? If we encounter the matches first,
we will only enumerate N rows. If the matches will be encountered last,
we will have to enumerate 50% + N rows before we find N matches.

The idea to address this is to do both:

Basic idea

Start with the Index Plan, and scan a certain number of records
If we get a sufficient number of matches, continue.
Otherwise, switch to the FileSort plan.

Details

Duplication of a part of the result

If we start running the Index plan, it may produce a part of query output.

Then, Filesort plan will produce the entire query output. We'll need to make
sure the duplicate part of the query output is removed.

Possible solutions:
A. Don't pass the output of the Index Plan to the client. Accumulate it in a
temporary file (or table), and discard it if we decide to switch to the
Filesort plan.

B. Pass the output rows of Index Plan to the client. If we emitted K rows, and
then switched to Filesort plan, we will need to discard the first K rows when
reading the filesort output.

Solution #B is not workable

The problem is "peer rows" - rows that compare as equal for ORDER BY, but are
not identical for the user. Consider a query

SELECT last_name,first_name FROM people WHERE ... ORDER BY last_name LIMIT 10;

Suppose the table has rows:
('Ivanov', 'Ivan')
('Ivanov', 'Peter')

and the Index Plan produces this as first output row:

('Ivanov', 'Ivan')

Then we decide to stop and switch to Filesort plan. The filesort plan produces

('Ivanov', 'Peter')
('Ivanov', 'Ivan')
...

This shows the technique of "skip the first #K output rows from the filesort plan
output" is not workable.

(Unless we extend the sort criteria with something like underlying table rowid,
which would make all rows non-peers).

Conditions for using this optimization

The proposed execution strategy may cause double work to be done: the same rows
will be first enumerated by Index Plan and then by the Filesort Plan.
That is, the query will run with filesort plan but will have extra overhead of
running with an Index Plan, too.

Also, it doesn't make sense to take risks with an Index Plan if it will have to
enumerate nearly as many rows as the filesort plan.

First attempt at the optimization criteria:

optimistic_index_cost = cost of Index Plan, assuming we will be "reasonably lucky".

// "reasonably lucky" here may mean:

//  - matching records will come first

//  - matching records will be uniformly distributed across the enumerated set.

filesort_cost = Cost of filesort plan;

if (optimistic_index_cost < filesort_cost * 0.1)

  use opportinistic query plan;

Attachments

Issue Links

relates to

MDEV-16450 Optimizer seems to make incorrect choice when AND clause on 2 varchar columns and those 2 columns are indexed (includes ORDER BY different_column and LIMIT 1)

Stalled

MDEV-18073 get_range_limit_read_cost() doesnt adjust LIMIT for the range access

Open

MDEV-7447 Optimizer choose different index depends on scan rows in ORDER BY DESC

Open

MDEV-8306 Complete cost-based optimization for ORDER BY with LIMIT

Stalled

Activity

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 8 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2018-12-25 09:37

Updated:: 2024-11-05 13:45

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Motivation

Basic idea

Details

Duplication of a part of the result

Solution #B is not workable

Conditions for using this optimization

Attachments

Issue Links

Activity

People

Dates

Git Integration