[MDEV-37330] Ordered scans over PARTITION BY RANGE should not use priority queue - Jira

XML

Word

Printable

Details

Type: Task
Status: In Review (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: 12.3
Component/s: Partitioning
Labels:
- optimizer-easy
- optimizer-feature

Sprint:
Q4/2025 Server Development

Description

Consider a table partitioned by date:

CREATE TABLE t1 (

  date_created DATETIME

   user_id INT,

...

  INDEX idx1(user_id, date_created)

) PARTITION BY RANGE COLUMNS(date_created) (

 PARTITION `p_first`  VALUES LESS THAN ('2022-02-24 00:00:00'),

...

 PARTITION `p_2025_02` VALUES LESS THAN ('2025-03-01 00:00:00'),

...

  PARTITION `p_last` VALUES LESS THAN ...

);

And a query looking to get the last row describing a given user.

select * from t1

where user_id=1234

ORDER BY date_created DESC LIMIT 1

The index idx1(user_id, date_created) will allow to use ref access and SQL layer will make one index lookup.

However, ha_partition has only two algorithms:

unordered index scan
ordered index scan

Here, ordered index scan will be used. ha_partition will read a row from each partition and put that into the priority queue, then it will get the last one.
When there are many partitions, this will have a lot of overhead. Why can't we just read the row from partition p_last ? And after we've exhausted that, read the row from partition before the p_last and so forth?
(This is actually similar to what "unordered index scan" code does except that code starts from partitions with smaller number and continues to partitions with larger number)

Looking at the current patch

So, the current patch does this:

In ha_partition::partition_scan_set_up(), if we are using PARTITION BY RANGE($part_range_expr),

then we check if the table's SELECT_LEX has an ORDER BY expression and one of the following holds:

1. the ORDER BY expression is a field reference and it matches the $part_range_expr (which must also be a field reference)

2. the ORDER BY expression (not necessarily a field reference) matches $part_range_expr.

One thing that looks bad is is that ha_partition examines the SELECT and checks the ORDER BY clause. This breaks any conceivable API contract in the SE interface.

Suggestions

Case 1: We do an ordered scan and can bypass ha_partition using priority queue.

This happens when the optimizer decides to read rows through an index.
This can be done to get rows ordered to handle ORDER BY [LIMIT] or GROUP BY. Or, we could have inferred a narrow range on that index.

In this case, we can check PARTITION BY, index being used, and the scan type as follows:

Case 1.1: The used index matches the partitioning definition

  CREATE TABLE t(

...

    INDEX idx1(col1, ...)

  ) PARTITION BY RANGE(col1) ...

If active_index=idx1 and we're in one of those:

  ha_partition->index_first();

  ha_partition->index_read_map(...);

  ha_partition->read_range_first(...);

Then we know we don't need to use the priority queue.
We just use partitions in their order and we will produce the output in the index order.

The criteria

Partitioning must by PARTITION BY RANGE. (In PARTITION BY LIST, some special cases could be handled but do not seem to be worth doing)

There are two kinds of PARTITION BY RANGE syntax:

PARTITION BY RANGE(col1)

PARTITION BY RANGE COLUMNS (col1, col2, ... )

We can support both as long as every used colX is a column reference.

In both cases, the range partitioning tuple must be a prefix of the used index.

Case 1.2: index doesn't match but given the scanned range it is compatible

  CREATE TABLE t(

...

    INDEX idx1(prefix_cols, col1, ...)

  ) PARTITION BY RANGE(col1) ...

if active_index=idx1 and we're in one of the following:

  ha_partition->index_read_map(prefix_cols=prefix_value);

  ha_partition->read_range_first(start_key= {prefix_value, ...}

                                 end_key={prefix_value, ...} );

We will scan a subset of table rows that has prefix_cols=prefix_value. For those, we can look into partitions in their order and produce the output in the required order.

Note that these decisions can be made by ha_partition just checking its DDL and the scan parameters.

Case 2: Partition by expression

I see this example in the comments:

CREATE TABLE t1 (

...

  INDEX idx1(c, d)

) PARTITION BY RANGE (d % 5) ( ... )

select * from t1 where c = 120 ORDER BY d % 5 LIMIT 2;

The query will use filesort as the optimizer is not aware of any way to read records in the order that matches ORDER BY d % 5.
One could say that for a table

CREATE TABLE t1 (

  c int,

  d INT,

  INDEX idx1(c, d)

) PARTITION BY RANGE (d % 5) (

 PARTITION `p1` VALUES LESS THAN (1),

 PARTITION `p2` VALUES LESS THAN (2),

 PARTITION `p3` VALUES LESS THAN (3),

 PARTITION `p4` VALUES LESS THAN (4),

 PARTITION `p5` VALUES LESS THAN MAXVALUE

);

all rows in p1 have (d % 5) = 0
all rows in p2 have (d % 5) = 1
...
so if one reads one partition after another than we get rows ordered by (d % 5).
But note that this requires that each partition has only one ORDER BY value.
If we had

 PARTITION `p1` VALUES LESS THAN (1),

 PARTITION `p2` VALUES LESS THAN (4),

then we would read p2 and get rows with (d % 5) = 1, or 2, or 3 and they would come without any order.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev37330-testcase.sql
2025-11-17 10:42
2 kB
Sergei Petrunia

Issue Links

split to

MDEV-38040 unitialized handler::key_compare_result_on_equal in loose index scan of partition tables

Closed

Activity

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2025-07-28 15:11

Updated:: 4 days ago 01:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.