Details
-
Task
-
Status: In Review (View Workflow)
-
Critical
-
Resolution: Unresolved
-
Q4/2025 Server Development
Description
Consider a table partitioned by date:
CREATE TABLE t1 ( |
date_created DATETIME
|
user_id INT, |
...
|
INDEX idx1(user_id, date_created) |
) PARTITION BY RANGE COLUMNS(date_created) ( |
PARTITION `p_first` VALUES LESS THAN ('2022-02-24 00:00:00'), |
...
|
PARTITION `p_2025_02` VALUES LESS THAN ('2025-03-01 00:00:00'), |
...
|
PARTITION `p_last` VALUES LESS THAN ... |
);
|
And a query looking to get the last row describing a given user.
select * from t1 |
where user_id=1234 |
ORDER BY date_created DESC LIMIT 1 |
The index idx1(user_id, date_created) will allow to use ref access and SQL layer will make one index lookup.
However, ha_partition has only two algorithms:
- unordered index scan
- ordered index scan
Here, ordered index scan will be used. ha_partition will read a row from each partition and put that into the priority queue, then it will get the last one.
When there are many partitions, this will have a lot of overhead. Why can't we just read the row from partition p_last ? And after we've exhausted that, read the row from partition before the p_last and so forth?
(This is actually similar to what "unordered index scan" code does except that code starts from partitions with smaller number and continues to partitions with larger number)
Looking at the current patch
So, the current patch does this:
In ha_partition::partition_scan_set_up(), if we are using PARTITION BY RANGE($part_range_expr),
then we check if the table's SELECT_LEX has an ORDER BY expression and one of the following holds:
1. the ORDER BY expression is a field reference and it matches the $part_range_expr (which must also be a field reference)
2. the ORDER BY expression (not necessarily a field reference) matches $part_range_expr.
One thing that looks bad is is that ha_partition examines the SELECT and checks the ORDER BY clause. This breaks any conceivable API contract in the SE interface.
Suggestions
Case 1: We do an ordered scan and can bypass ha_partition using priority queue.
This happens when the optimizer decides to read rows through an index.
This can be done to get rows ordered to handle ORDER BY [LIMIT] or GROUP BY. Or, we could have inferred a narrow range on that index.
In this case, we can check PARTITION BY, index being used, and the scan type as follows:
Case 1.1: The used index matches the partitioning definition
CREATE TABLE t( |
...
|
INDEX idx1(col1, ...) |
 |
) PARTITION BY RANGE(col1) ... |
If active_index=idx1 and we're in one of those:
ha_partition->index_first();
|
ha_partition->index_read_map(...);
|
ha_partition->read_range_first(...);
|
Then we know we don't need to use the priority queue.
We just use partitions in their order and we will produce the output in the index order.
The criteria
Partitioning must by PARTITION BY RANGE. (In PARTITION BY LIST, some special cases could be handled but do not seem to be worth doing)
There are two kinds of PARTITION BY RANGE syntax:
PARTITION BY RANGE(col1)
|
PARTITION BY RANGE COLUMNS (col1, col2, ... )
|
We can support both as long as every used colX is a column reference.
In both cases, the range partitioning tuple must be a prefix of the used index.
Case 1.2: index doesn't match but given the scanned range it is compatible
CREATE TABLE t( |
...
|
INDEX idx1(prefix_cols, col1, ...) |
 |
) PARTITION BY RANGE(col1) ... |
if active_index=idx1 and we're in one of the following:
ha_partition->index_read_map(prefix_cols=prefix_value);
|
ha_partition->read_range_first(start_key= {prefix_value, ...}
|
end_key={prefix_value, ...} );
|
We will scan a subset of table rows that has prefix_cols=prefix_value. For those, we can look into partitions in their order and produce the output in the required order.
Note that these decisions can be made by ha_partition just checking its DDL and the scan parameters.
Case 2: Partition by expression
I see this example in the comments:
CREATE TABLE t1 ( |
...
|
INDEX idx1(c, d) |
) PARTITION BY RANGE (d % 5) ( ... ) |
select * from t1 where c = 120 ORDER BY d % 5 LIMIT 2;
|
The query will use filesort as the optimizer is not aware of any way to read records in the order that matches ORDER BY d % 5.
One could say that for a table
CREATE TABLE t1 ( |
c int, |
d INT, |
INDEX idx1(c, d) |
) PARTITION BY RANGE (d % 5) ( |
PARTITION `p1` VALUES LESS THAN (1), |
PARTITION `p2` VALUES LESS THAN (2), |
PARTITION `p3` VALUES LESS THAN (3), |
PARTITION `p4` VALUES LESS THAN (4), |
PARTITION `p5` VALUES LESS THAN MAXVALUE |
);
|
all rows in p1 have (d % 5) = 0
all rows in p2 have (d % 5) = 1
...
so if one reads one partition after another than we get rows ordered by (d % 5).
But note that this requires that each partition has only one ORDER BY value.
If we had
PARTITION `p1` VALUES LESS THAN (1),
|
PARTITION `p2` VALUES LESS THAN (4),
|
then we would read p2 and get rows with (d % 5) = 1, or 2, or 3 and they would come without any order.
Attachments
Issue Links
- split to
-
MDEV-38040 unitialized handler::key_compare_result_on_equal in loose index scan of partition tables
-
- Closed
-