[MDEV-30877] Wrong cardinality estimation for the derived table leads to slow plan with LATERAL DERIVED - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4.28
Fix Version/s: 11.4.6, 11.8.2
Component/s: Optimizer
Labels:
- cardinality
- split_materialized

Description

Consider the following slow query:

MariaDB [(none)]> analyze

    -> SELECT

    ->     esd.esd_c1,

    ->     esd.esd_c2,

    ->     esd.esd_c3,

    ->     es.es_c1,

    ->     es.es_c2,

    ->     es.es_c3,

    ->     es.es_c4,

    ->     es.es_c5,

    ->     es.es_c6,

    ->     es.es_c7,

    ->     es.es_c8

    -> FROM

    ->     DB1.t1 esd

    ->         INNER JOIN

    ->     DB1.t2 es ON esd.t1Id = es.t1Id

    ->         INNER JOIN

    ->     (SELECT

    ->         esd.esd_c1, MAX(es.es_c1) AS last_set

    ->     FROM

    ->         DB1.t1 esd

    ->     INNER JOIN DB1.t2 es ON esd.t1Id = es.t1Id

    ->     GROUP BY esd.esd_c1) q ON esd.esd_c1 = q.esd_c1

    -> WHERE

    ->     es.es_c1 = q.last_set;

+------+-----------------+------------+-------+---------------------------+-----------+---------+----------------------------------------------+-------+----------+----------+------------+--------------------------+

| id   | select_type     | table      | type  | possible_keys             | key       | key_len | ref                                          | rows  | r_rows   | filtered | r_filtered | Extra                    |

+------+-----------------+------------+-------+---------------------------+-----------+---------+----------------------------------------------+-------+----------+----------+------------+--------------------------+

|    1 | PRIMARY         | esd        | index | PRIMARY,UNIQUEMSG,Index_3 | UNIQUEMSG | 137     | NULL                                         | 15197 | 15197.00 |   100.00 |     100.00 | Using where; Using index |

|    1 | PRIMARY         | <derived2> | ref   | key0                      | key0      | 33      | DB1.esd.esd_c1                          | 2     | 0.98     |   100.00 |     100.00 | Using where              |

|    1 | PRIMARY         | es         | ref   | PRIMARY                   | PRIMARY   | 9       | DB1.esd.t1Id,q.last_set | 157   | 21.26    |   100.00 |     100.00 |                          |

|    2 | LATERAL DERIVED | esd        | ref   | PRIMARY,UNIQUEMSG,Index_3 | UNIQUEMSG | 33      | DB1.esd.esd_c1                          | 7     | 17.40    |   100.00 |     100.00 | Using index              |

|    2 | LATERAL DERIVED | es         | ref   | PRIMARY                   | PRIMARY   | 4       | DB1.esd.t1Id            | 875   | 783.76   |   100.00 |     100.00 | Using index              |

+------+-----------------+------------+-------+---------------------------+-----------+---------+----------------------------------------------+-------+----------+----------+------------+--------------------------+

5 rows in set (1 min 12.157 sec)

It uses LATERAL DERIVED optimization and in the optimizer trace (see attached) we can find the following cardinality estimations (with statistics up to date including histograms etc):

            "rows_estimation": [

                "table": "esd",

                "table_scan": {

                  "rows": 15197,

                  "cost": 97

},

                "table": "es",

                "table_scan": {

                  "rows": 12840562,

                  "cost": 103969

},

                "table": "<derived2>",

                "table_scan": {

                  "rows": 13304563,

                  "cost": 1.33e7

while real row counts are as follows:

MariaDB [(none)]> select count(*) from DB1.t1 esd;

+----------+

| count(*) |

+----------+

| 15197 |

+----------+

1 row in set (0.032 sec)

MariaDB [(none)]> select count(*) from DB1.t2 es;

+----------+

| count(*) |

+----------+

| 12846717 |

+----------+

1 row in set (13.979 sec)

MariaDB [(none)]> select count(*) from (SELECT

-> esd.symbol, MAX(es.es_c1) AS last_set

-> FROM

-> DB1.t1 esd

-> INNER JOIN DB1.t2 es ON esd.t1Id = es.t1Id

-> GROUP BY esd.esd_c1) q;

+----------+

| count(*) |

+----------+

| 1877 |

+----------+

1 row in set (5.507 sec)

You can see above that estimated number of rows for each of the tables is precise or very close to reality, while for the derived table it's orders of magnitude wrong. What can be done to fix this?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

example-dataset-1.sql
0.3 kB
2025-01-31 11:58
optimizer_trace_mangled.txt
1.87 MB
2023-03-19 15:14

Issue Links

relates to

MDEV-33831 Group by inside subquery / CTE: Very bad row estimate

Closed

MDEV-36321 Split Materialized produces wrong out_rows estimates

Confirmed

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Sergei Petrunia added a comment - 2025-01-31 12:17

Followup to review discussion:

The optimizer removes constant expressions from GROUP BY clause. This is done by JOIN::optimize_stage2() calling remove_const(), which happens after JOIN::optimize_inner()|make_join_statistics()|estimate_post_group_cardinality() calls.

So, in estimate_post_group_cardinality() the GROUP BY is still the original one.

Examples for debugging: example-dataset-1.sql

explain select *

from

  select

    count(*),

    bank,

    department

  from t1

  where category=1

  group by category, bank,department

) T;

with a constant table:

explain select *

from

  select

    count(*),

    bank,

    department

  from t1, t2

  where category=1 and t2.pk=1

  group by t2.pk, category, bank,department

) T;

Sergei Petrunia added a comment - 2025-01-31 12:17 Followup to review discussion: The optimizer removes constant expressions from GROUP BY clause. This is done by JOIN::optimize_stage2() calling remove_const(), which happens after JOIN::optimize_inner()|make_join_statistics()|estimate_post_group_cardinality() calls. So, in estimate_post_group_cardinality() the GROUP BY is still the original one. Examples for debugging: example-dataset-1.sql explain select * from ( select count (*), bank, department from t1 where category=1 group by category, bank,department ) T; with a constant table: explain select * from ( select count (*), bank, department from t1, t2 where category=1 and t2.pk=1 group by t2.pk, category, bank,department ) T;

Sergei Petrunia added a comment - 2025-01-31 12:40 - edited

Should we use non-prefix index statistics?

Suppose we have

   (SELECT ...

    FROM t1

    GROUP BY t1.col2)

and

  INDEX idx(col1, col2)

We're interested in nd2= n_distinct(col2).
But we only have idx->rec_per_key(1) = nd12 = n_distinct({col1, col2}).

It is clear that

   n_distinct(col2) <= n_distinct({col1, col2}) <= table_rows

and when we can't have nd2, it's better to use nd12 than table_rows.

In general, if we have

  GROUP BY  col1, col2, ... colN

and

  INDEX( ... col1 ... col2 ... colN ...)

we can use any n_distinct($INDEX_PREFIX) for any $INDEX_PREFIX that contains all of GROUP BY columns {col1, ... colN}.

If $INDEX_PREFIX contains columns other than GROUP BY columns, let's call this an SupersetEstimate (as opposed to ExactSetEstimate when it doesn't)

Should we use SupersetEstimates?

Using SupersetEstimates can handle some cases that cannot be handled with ExactSetEstimates.

On the other hand, when we only use ExactSetEstimates, the optimizer has fewer options. If we have found an index covering {col1, ... colN}, we can assume the estimate is good and not examine if other indexes provide a tighter bound estimate for the same set of columns.

If we use SupersetEstimates, we can potentially get some very inexact estimates (when extra columns in the index have high n_distinct). We will need some clever code to prefer tighter bound estimates, and probably some logic to prefer ExactSetEstimates over SupersetEstimates.

Looks like not worth doing within this MDEV?

Sergei Petrunia added a comment - 2025-01-31 12:40 - edited Should we use non-prefix index statistics? Suppose we have ( SELECT ... FROM t1 GROUP BY t1.col2) and INDEX idx(col1, col2) We're interested in nd2= n_distinct(col2) . But we only have idx->rec_per_key(1) = nd12 = n_distinct({col1, col2}) . It is clear that n_distinct(col2) <= n_distinct({col1, col2}) <= table_rows and when we can't have nd2, it's better to use nd12 than table_rows. In general, if we have GROUP BY col1, col2, ... colN and INDEX( ... col1 ... col2 ... colN ...) we can use any n_distinct($INDEX_PREFIX) for any $INDEX_PREFIX that contains all of GROUP BY columns {col1, ... colN}. If $INDEX_PREFIX contains columns other than GROUP BY columns, let's call this an SupersetEstimate (as opposed to ExactSetEstimate when it doesn't) Should we use SupersetEstimates? Using SupersetEstimates can handle some cases that cannot be handled with ExactSetEstimates. On the other hand, when we only use ExactSetEstimates, the optimizer has fewer options. If we have found an index covering {col1, ... colN}, we can assume the estimate is good and not examine if other indexes provide a tighter bound estimate for the same set of columns. If we use SupersetEstimates, we can potentially get some very inexact estimates (when extra columns in the index have high n_distinct). We will need some clever code to prefer tighter bound estimates, and probably some logic to prefer ExactSetEstimates over SupersetEstimates. Looks like not worth doing within this MDEV?

Sergei Petrunia added a comment - 2025-01-31 12:55

Addressed all other review input and pushed the result here:

commit 1919eec16143a569f884429e882975979a66f469 (HEAD -> bb-11.8-MDEV-30877-v3, origin/bb-11.8-MDEV-30877-v3)

Author: Sergei Petrunia <sergey@mariadb.com>

Date:   Thu Jan 23 20:20:00 2025 +0200

    MDEV-30877: Output cardinality for derived table ignores GROUP BY

    (Variant 3)

    When a derived table has a GROUP BY clause:

Will need to discuss which version should this go into.

Sergei Petrunia added a comment - 2025-01-31 12:55 Addressed all other review input and pushed the result here: commit 1919eec16143a569f884429e882975979a66f469 (HEAD -> bb-11.8-MDEV-30877-v3, origin/bb-11.8-MDEV-30877-v3) Author: Sergei Petrunia <sergey@mariadb.com> Date: Thu Jan 23 20:20:00 2025 +0200 MDEV-30877: Output cardinality for derived table ignores GROUP BY (Variant 3) When a derived table has a GROUP BY clause: Will need to discuss which version should this go into.

Sergei Petrunia added a comment - 2025-02-05 09:56

Based on review discussion, pushed an un-related commit:

commit 91de54e098b179f033deb1ee61fc3d7d5997f588 (HEAD -> 10.11, origin/bb-10.11-spetrunia-tmp, origin/10.11, bb-10.11-spetrunia-tmp)

Author: Sergei Petrunia <sergey@mariadb.com>

Date:   Tue Feb 4 21:46:42 2025 +0200

    Remove redundant if-statement in Index_prefix_calc::get_avg_frequency

    the code went like this:

       for ( ... ; i < prefixes ; ...)

          if (i < prefixes)

...

Sergei Petrunia added a comment - 2025-02-05 09:56 Based on review discussion, pushed an un-related commit: commit 91de54e098b179f033deb1ee61fc3d7d5997f588 (HEAD -> 10.11, origin/bb-10.11-spetrunia-tmp, origin/10.11, bb-10.11-spetrunia-tmp) Author: Sergei Petrunia <sergey@mariadb.com> Date: Tue Feb 4 21:46:42 2025 +0200 Remove redundant if-statement in Index_prefix_calc::get_avg_frequency the code went like this: for ( ... ; i < prefixes ; ...) { if (i < prefixes) { ...

Sergei Petrunia added a comment - 2025-02-05 10:46

bb-11.4-~~MDEV-30877~~-v3

Sergei Petrunia added a comment - 2025-02-05 10:46 bb-11.4- MDEV-30877 -v3

MariaDB Server

Wrong cardinality estimation for the derived table leads to slow plan with LATERAL DERIVED

Details

Description

Attachments

Attachments

Issue Links

Activity

Should we use non-prefix index statistics?

Should we use SupersetEstimates?

People

Dates

Git Integration