Details
-
Bug
-
Status: In Progress (View Workflow)
-
Critical
-
Resolution: Unresolved
-
11.4
-
Can result in unexpected behaviour
-
Q3/2025 Server Development
Description
Problem description
Consider a query using Split Materialized (EDIT: or just derived-with-keys): (database fill steps provided below)
explain select * |
from |
t2, (select max(value), grp_id from t1 group by grp_id) DT |
where
|
t2.a= DT.grp_id;
|
+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
|
| 1 | PRIMARY | t2 | ALL | NULL | NULL | NULL | NULL | 5 | Using where |
|
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | 5 | test2.t2.a | 10 | |
|
| 2 | LATERAL DERIVED | t1 | ref | grp_id | grp_id | 5 | test2.t2.a | 100 | |
|
+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
|
Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.
This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.
Relevant sections of the optimizer trace:
"plan_prefix": "t2", |
"get_costs_for_tables": [ |
{
|
"best_access_path": { |
"table": "<derived2>", |
...
|
"choose_best_splitting": { |
"split_materialized": { |
"chosen": true |
}
|
...
|
},
|
"chosen_access_method": { |
"type": "ref", |
"rows_read": 10, |
"rows_out": 10, |
"cost": 0.00639105, |
"uses_join_buffering": false |
}
|
and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:
{
|
"plan_prefix": "t2", |
"table": "<derived2>", |
"rows_for_plan": 50, |
"cost_for_plan": 0.01806085 |
}
|
Can this be fixed by just setting rows=1?
Database fill steps:
 |
create table t1 ( |
grp_id int, |
value int, |
index (grp_id) |
);
|
 |
insert into t1 select |
A.seq, B.seq
|
from |
seq_1_to_10000 A,
|
seq_1_to_100 B
|
;
|
 |
create table t2 (a int); |
insert into t2 select seq from seq_1_to_5; |
 |
analyze table t1,t2; |
A related bugfix done recently: MDEV-30877.
Solution for derived_with_keys
Note that SELECT DISTINCT is already covered
Note that the code in TABLE::add_tmp_key already has similar logic for DISTINCT:
if (derived) |
{
|
st_select_lex* first= derived->first_select();
|
uint select_list_items= first->get_item_list()->elements;
|
if (key_parts == select_list_items) |
{
|
if ((!first->is_part_of_union() && (first->options & SELECT_DISTINCT)) || |
derived->check_distinct_in_union())
|
keyinfo->rec_per_key[key_parts - 1]= 1;
|
}
|
}
|
"if the index covers all elements of the select list, then specifying all keyparts gives rec_per_key=1".
Handling GROUP BY
An idea of what could be done here: add here a function that handles cases when
- The derived table has one SELECT (!first->is_part_of_union())
- The derived table has a GROUP BY clause (first->group)
The logic to produce estimates is as follows.
We have a SELECT with GROUP BY that is materialized into a temporary table:
create temporary table (tmp_col1, ... tmp_colK) as |
select |
col1, ..., colN
|
sum(...) as group_col1, ... |
group by |
col1, ... colN;
|
and we're considering an index on some of the temp.table columns:
INDEX( tmp_table_col1, ..., tmp_table_colN ...) |
Any index prefix that includes all GROUP BY columns will have rec_per_key=1.
Pseudo code to implement this:
my_group_list= {collect a list of items in the subquery's GROUP BY clause};
|
for each key part KP {
|
F= table->field(KP->field_number);
|
if (find F in my_group_list == NOT_FOUND)
|
continue; // Can do nothing for this key part. But might do for further key parts
|
remove F from my_group_list;
|
if (my_group_list is empty)
|
{
|
for each keypart KP2 starting from KP
|
set index->rec_per_key[KP2] = 1;
|
break;
|
}
|
}
|
Handling UNION queries
TODO. Johnston's patch tries to do some handling but currently has bugs.
UNION [DISTINCT] removes the duplicates. If the index covers all of select list, we can set
index->rec_per_key[n_key_parts - 1]=1;
|
Both UNION DISTINCT and UNION ALL let one rec_per_key from each of the select.
The initial state is that index statistics is unknown, that is, rec_per_key[i]=0 for all key parts:
key_part | kp1 | kp2 | kp3 | kp 4 |
rec_per_key | 0 | 0 | 0 | 0 |
Suppose, the first SELECT in the UNION has GROUP BY kp1,kp2,kp3. We will get this statistics:
key_part | kp1 | kp2 | kp3 | kp 4 |
rec_per_key | 0 | 0 | 1 | 1 |
Suppose, the second SELECT in the UNION has GROUP BY kp1,kp2. We will get:
key_part | kp1 | kp2 | kp3 | kp 4 |
rec_per_key | 0 | 1 | 1 | 1 |
How to add the statistics?
If rec_per_key[i]=0 for any of the SELECTs, that means UNKNOWN and the result should also be 9, UNKNOWN.
If rec_per_key[i]=1 for each of the SELECTs, then we can add 1+1=2 and have that as an estimate for the resulting table.
For or example, this gives:
key_part | kp1 | kp2 | kp3 | kp 4 |
rec_per_key | 0 | 0 | 2 | 2 |
Touching the "guesstimate" of 10 rows
In current code, best_access_path() will use a guesstimate of 10 rows when index statistics is not available.
Touching this goes outside of the scope of this MDEV entry.
Solution for Split-Materialized
Split-Materialized seems to reuse the code from derived_with_keys?
I've used the testcase in this MDEV, set rec_per_key[key_parts-1]=1 in the debugger, and the EXPLAIN that uses Split-Materialized has shown correct rows_out.
One probably needs to also go through the Optimizer Trace and see if all cardinalities are correct.
Attachments
Issue Links
- is part of
-
MDEV-26300 LATERAL DERIVED: Optimizer trace insufficient, odd computations
-
- Confirmed
-
- relates to
-
MDEV-30877 Wrong cardinality estimation for the derived table leads to slow plan with LATERAL DERIVED
-
- Closed
-
-
MDEV-37249 Assertion fail in apply_selectivity_for_table
-
- Confirmed
-
-
MDEV-26300 LATERAL DERIVED: Optimizer trace insufficient, odd computations
-
- Confirmed
-
-
MDEV-37201 Pick the most efficient subset of keys there are too many key parts
-
- Open
-