[MDEV-36321] Indexes on derived tables with GROUP BY produce wrong out_rows estimates - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 12.2.1
Component/s: Optimizer
Labels:

Sprint:
Q4/2025 Server Maintenance

Description

Problem description

Consider a query using Split Materialized (EDIT: or just derived-with-keys): (database fill steps provided below)

explain select *

from

  t2, (select max(value), grp_id from t1 group by grp_id) DT

where

  t2.a= DT.grp_id;

+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+

| id   | select_type     | table      | type | possible_keys | key    | key_len | ref        | rows | Extra       |

+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+

|    1 | PRIMARY         | t2         | ALL  | NULL          | NULL   | NULL    | NULL       | 5    | Using where |

|    1 | PRIMARY         | <derived2> | ref  | key0          | key0   | 5       | test2.t2.a | 10   |             |

|    2 | LATERAL DERIVED | t1         | ref  | grp_id        | grp_id | 5       | test2.t2.a | 100  |             |

+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+

Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.

This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.

Relevant sections of the optimizer trace:

                    "plan_prefix": "t2",

                    "get_costs_for_tables": [

                        "best_access_path": {

                          "table": "<derived2>",

...

                          "choose_best_splitting": {

                            "split_materialized": {

                              "chosen": true

...

},

                          "chosen_access_method": {

                            "type": "ref",

                            "rows_read": 10,

                            "rows_out": 10,

                            "cost": 0.00639105,

                            "uses_join_buffering": false

and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:

                    "plan_prefix": "t2",

                    "table": "<derived2>",

                    "rows_for_plan": 50,

                    "cost_for_plan": 0.01806085

Can this be fixed by just setting rows=1?

Database fill steps:

create table t1 (

  grp_id int,

  value int,

  index (grp_id)

);

insert into t1 select

  A.seq, B.seq

from

  seq_1_to_10000 A,

  seq_1_to_100 B

create table t2 (a int);

insert into t2 select seq from seq_1_to_5;

analyze table t1,t2;

A related bugfix done recently: ~~MDEV-30877~~.

Solution for derived_with_keys

Note that SELECT DISTINCT is already covered

Note that the code in TABLE::add_tmp_key already has similar logic for DISTINCT:

    if (derived)

      st_select_lex* first= derived->first_select();

      uint select_list_items= first->get_item_list()->elements;

      if (key_parts == select_list_items)

        if ((!first->is_part_of_union() && (first->options & SELECT_DISTINCT)) ||

            derived->check_distinct_in_union())

          keyinfo->rec_per_key[key_parts - 1]= 1;

"if the index covers all elements of the select list, then specifying all keyparts gives rec_per_key=1".

Handling GROUP BY

An idea of what could be done here: add here a function that handles cases when

The derived table has one SELECT (!first->is_part_of_union())
The derived table has a GROUP BY clause (first->group)

The logic to produce estimates is as follows.
We have a SELECT with GROUP BY that is materialized into a temporary table:

create temporary table (tmp_col1, ... tmp_colK)  as

select

   col1, ..., colN

   sum(...) as group_col1, ...

group by

   col1, ... colN;

and we're considering an index on some of the temp.table columns:

INDEX( tmp_table_col1,  ..., tmp_table_colN ...)

Any index prefix that includes all GROUP BY columns will have rec_per_key=1.

Pseudo code to implement this:

  my_group_list= {collect a list of items in the subquery's GROUP BY clause};

  for each key part  KP {

     F= table->field(KP->field_number);

     if (find F in my_group_list == NOT_FOUND)

        continue; // Can do nothing for this key part. But might do for further key parts

     remove F from my_group_list;

     if (my_group_list is empty)

        for each keypart KP2 starting from KP

           set index->rec_per_key[KP2] = 1;

        break;

Handling UNION queries

TODO. Johnston's patch tries to do some handling but currently has bugs.

UNION [DISTINCT] removes the duplicates. If the index covers all of select list, we can set

index->rec_per_key[n_key_parts - 1]=1;

Both UNION DISTINCT and UNION ALL let one rec_per_key from each of the select.
The initial state is that index statistics is unknown, that is, rec_per_key[i]=0 for all key parts:

key_part	kp1	kp2	kp3	kp 4
rec_per_key	0	0	0	0

Suppose, the first SELECT in the UNION has GROUP BY kp1,kp2,kp3. We will get this statistics:

key_part	kp1	kp2	kp3	kp 4
rec_per_key	0	0	1	1

Suppose, the second SELECT in the UNION has GROUP BY kp1,kp2. We will get:

key_part	kp1	kp2	kp3	kp 4
rec_per_key	0	1	1	1

How to add the statistics?
If rec_per_key[i]=0 for any of the SELECTs, that means UNKNOWN and the result should also be 9, UNKNOWN.
If rec_per_key[i]=1 for each of the SELECTs, then we can add 1+1=2 and have that as an estimate for the resulting table.

For or example, this gives:

key_part	kp1	kp2	kp3	kp 4
rec_per_key	0	0	2	2

Touching the "guesstimate" of 10 rows

In current code, best_access_path() will use a guesstimate of 10 rows when index statistics is not available.
Touching this goes outside of the scope of this MDEV entry.

Solution for Split-Materialized

Split-Materialized seems to reuse the code from derived_with_keys?

I've used the testcase in this MDEV, set rec_per_key[key_parts-1]=1 in the debugger, and the EXPLAIN that uses Split-Materialized has shown correct rows_out.
One probably needs to also go through the Optimizer Trace and see if all cardinalities are correct.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev-36321-analysis.md
8 kB
2025-07-17 19:03

Issue Links

is part of

MDEV-26300 LATERAL DERIVED: Optimizer trace insufficient, odd computations

Confirmed

relates to

MDEV-30877 Wrong cardinality estimation for the derived table leads to slow plan with LATERAL DERIVED

Closed

MDEV-37249 Assertion fail in apply_selectivity_for_table

Confirmed

MDEV-37936 Followup to MDEV-36321: out_rows for GROUP BY: use of item names?

In Review

MDEV-26300 LATERAL DERIVED: Optimizer trace insufficient, odd computations

Confirmed

MDEV-37201 Pick the most efficient subset of keys there are too many key parts

Open

(1 relates to)

Activity

People

Assignee:: Rex Johnston

Reporter:: Sergei Petrunia

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2025-03-18 13:56

Updated:: 2025-10-31 02:44

Resolved:: 2025-10-28 17:48

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.