Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
11.0.3, 11.4.4, 11.8.5, 12.1.2
-
None
Description
Overview
We have encountered a severe performance regression in MariaDB versions 11.0.3 through 12.1.2 (Cost Model refactor MDEV-26974). The optimizer aggressively prefers an ordered index scan (type: index) to satisfy an ORDER BY ... LIMIT clause, assuming a uniform distribution of matching rows.
On datasets with temporal locality (where matching rows are clustered at the end of the time range), this assumption fails catastrophically. The execution time regressed from sub-second (in 10.11 LTS) to >10 minutes (in 11.x).
The Failure Pattern ("The Limit Trap")
The query seeks N rows for a specific ID, ordered by time.
- Optimizer Logic: The optimizer sees ORDER BY create_time LIMIT 100. It calculates that the cost of reading the index in order is effectively zero (Random I/O penalty is low/non-existent in new cost model). It assumes a uniform distribution: "If I need 100 rows and the selectivity is 5%, I will likely find them within the first 2,000 reads."
- Reality (Skewed Distribution): The target rows (filtered by WHERE) are historically correlated and exist only at the very end of the B-Tree.
- Result: The engine scans 95%+ of the table (tens of millions of "noise" rows) before finding the first match, effectively performing a full table scan disguised as an index scan.
Environment
- Working Version: 10.11 LTS (Uses ref or index_merge + filesort)
- Broken Versions: 11.0.3, 11.4.4, 11.8.5, 12.1.2 (Uses index scan on sorting key)
- OS: Debian Bookworm / Docker
- Dataset: 122M rows, InnoDB (ticket_history log table)
Technical Analysis & Explain Plans
Query
SELECT th.name, th.history_type_id, th.create_time |
FROM ticket_history th |
WHERE th.ticket_id = 1663121 AND th.history_type_id = 17 |
ORDER BY th.create_time, th.id ASC |
LIMIT 100;
|
1. Expected Plan (10.11 LTS - Fast)
The optimizer correctly identifies that filtering by ticket_id is the selectivity winner, even if it requires a subsequent sort.
- Type: ref
- Key: ticket_id
- Rows: ~161,564 (Estimate)
- Extra: Using where; Using filesort
- Execution Time: 0.001 sec
2. Regression Plan (11.4+ - Slow)
The optimizer avoids the sort by scanning the create_time index, underestimating the cost of filtering out non-matching rows.
- Type: index
- Key: ticket_history_create_time
- Rows: 76,220 (Estimate - Note: drastically lower than the Ref estimate, creating the false appeal)
- Filtered: 4.36%
- Actual Execution: Scans ~100M+ rows due to data skew.
- Execution Time: ~600 sec
The "Crossing Point"
We observed that increasing the LIMIT (e.g., from 100 to 220) tips the cost calculation back to sanity, forcing the optimizer to switch back to ref + filesort. This indicates the cost formula for the "Index Scan" is winning by a very narrow margin in the broken versions.
Mitigation Attempts (Failed)
None of the standard switches disable this behavior in 11.x:
- SET optimizer_use_condition_selectivity = 1;
- SET optimizer_switch='rowid_filter=off';
- SET optimizer_switch='index_merge_sort_intersection=off';
Reproduction (MTR)
I've attached MDEV-38552-limit-regression.test
. Please run this with standard MTR. It creates a synthetic skew where "noise" (old data) precedes the "target" (new data) in the sort order, forcing the optimizer to scan 10,000 rows to find 1 match.
Test Logic Preview (Content of attached file):
--source include/have_innodb.inc
|
|
|
# 1. Structure
|
CREATE TABLE mre_final (
|
id bigint(20) NOT NULL AUTO_INCREMENT,
|
ticket_id bigint(20) NOT NULL,
|
create_time datetime NOT NULL,
|
name varchar(1) NOT NULL,
|
PRIMARY KEY (id),
|
KEY idx_ticket_id (ticket_id),
|
KEY idx_create_time (create_time)
|
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;
|
|
|
# 2. DATA GENERATION
|
--disable_query_log
|
|
|
# IMPORTANT: This test demonstrates the optimizer's logic error,
|
# not the production-scale performance impact. On the actual 122M row
|
# table, the index scan requires scanning ~100M rows before finding
|
# the first match (600+ seconds). This test uses 10K noise rows for
|
# MTR performance, but the optimizer's decision-making failure is identical.
|
|
|
# A. Insert 10,000 NOISE rows (Older times)
|
# These represent the rows that do NOT match the WHERE clause
|
# but sit at the start of the ORDER BY index.
|
let $i=1; |
while ($i <= 10000) |
{
|
eval INSERT INTO mre_final (ticket_id, create_time, name) VALUES (999, FROM_UNIXTIME($i), 'x'); |
inc $i; |
}
|
|
|
# B. Insert 500 TARGET rows (Newer times)
|
# These match the WHERE clause but are at the END of the ORDER BY index.
|
let $i=10001; |
while ($i <= 10500) |
{
|
eval INSERT INTO mre_final (ticket_id, create_time, name) VALUES (1663121, FROM_UNIXTIME($i), 'a'); |
inc $i; |
}
|
--enable_query_log
|
|
|
# 3. ANALYZE to ensure optimizer has correct stats
|
ANALYZE TABLE mre_final PERSISTENT FOR ALL;
|
|
|
--echo #
|
--echo # [TEST A] LIMIT 1 (The Regression) |
--echo # Expected: 'idx_ticket_id' (ref) |
--echo # Actual (11.x): 'idx_create_time' (index scan) -> Scans 10,000 rows |
--echo #
|
EXPLAIN SELECT name FROM mre_final th WHERE th.ticket_id = 1663121 ORDER BY th.create_time ASC LIMIT 1; |
|
|
--echo #
|
--echo # [TEST B] LIMIT 500 (The Control) |
--echo # Forces the optimizer to see high scan cost. |
--echo #
|
EXPLAIN SELECT name FROM mre_final th WHERE th.ticket_id = 1663121 ORDER BY th.create_time ASC LIMIT 500; |
Developer Notes / Suggested Fix
This issue appears closely related to the cost model changes in MDEV-26974, specifically how the cost of "seeking" for the next match in an ordered scan is weighed against ref access.
The optimizer seems to incorrectly calculate the probability of finding a match (p) as uniform. When p is low but LIMIT is small, the calculated cost (Rows_in_table * p * LIMIT) is artificially low. The cost model needs to either:
- Account for non-uniform distribution risk in test_if_skip_sort_order.
- Increase the penalty for "row lookup failure" during an ordered scan when a high-selectivity ref index exists.
Disclaimer: This bug report has been created with the assistance of an LLM.