[MDEV-27201] Non-merged derived table: do not compute unused fields - Jira

Details

Type: Task
Status: In Review (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: 12.1
Component/s: Optimizer
Labels:
- optimizer
- optimizer-easy

Description

(Filing this based on my observations from analyzing queries at a prospect considering migration to MariaDB)

The problem: If a non-merged derived table has unused columns, they are still created and computed.

Creating a column might not be a big deal
Computing a column can cause some unused overhead
The worst part is that Table Elimination cannot eliminate the parts of query plan that compute useless information.

Testcase:

create table t1 (

  group_id int,

  a int,

  b int,

  c int

);

insert into t1 select  seq/100, seq, seq, seq from seq_1_to_10000;

create table t2 (a int);

insert into t2 values (1),(2),(3);

explain

select

  T.group_id,

  T.MAXA

from

t2,

  (select

     group_id,

     max(a) as MAXA,

     sum(b) as SUMB

   from

t1

   group by group_id

  ) as T

where

 T.group_id = t2.a;

+------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+

| id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows  | Extra                           |

+------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+

|    1 | PRIMARY     | t2         | ALL  | NULL          | NULL | NULL    | NULL      | 3     | Using where                     |

|    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t2.a | 100   |                                 |

|    2 | DERIVED     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 10000 | Using temporary; Using filesort |

+------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+

Note that sum(b) as SUMB is not used.

Put breakpoints in

create_tmp_table
Item_sum_sum::update_field

one can still see that the column for SUM() is created and the value is computed.

How it affects table elimination

Add another table into the subquery:

create table t3 (

  pk int primary key,

  col int

);

insert into t3 select  seq, seq from seq_1_to_10000;

explain

select

  T.group_id,

  T.MAXA

from

t2,

  (select

     group_id,

     max(a) as MAXA,

     sum(b) as SUMB

   from

     t1 left join t3 on t3.pk=t1.c

   group by group_id

  ) as T

where

 T.group_id = t2.a;

+------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+

| id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows | Extra                           |

+------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+

|    1 | PRIMARY     | t2         | ALL  | NULL          | NULL | NULL    | NULL      | 3    | Using where                     |

|    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t2.a | 99   |                                 |

|    2 | DERIVED     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 9963 | Using temporary; Using filesort |

+------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+

Ok, t3 is eliminated.

Now, use t3.b in the SUM:

explain

select

  T.group_id,

  T.MAXA

from

t2,

  (select

     group_id,

     max(a) as MAXA,

     sum(t3.col) as SUMB

   from

     t1 left join t3 on t3.pk=t1.c

   group by group_id

  ) as T

where

 T.group_id = t2.a;

+------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+

| id   | select_type | table      | type   | possible_keys | key     | key_len | ref       | rows | Extra                           |

+------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+

|    1 | PRIMARY     | t2         | ALL    | NULL          | NULL    | NULL    | NULL      | 3    | Using where                     |

|    1 | PRIMARY     | <derived2> | ref    | key0          | key0    | 5       | test.t2.a | 99   |                                 |

|    2 | DERIVED     | t1         | ALL    | NULL          | NULL    | NULL    | NULL      | 9963 | Using temporary; Using filesort |

|    2 | DERIVED     | t3         | eq_ref | PRIMARY       | PRIMARY | 4       | test.t1.c | 1    | Using where                     |

+------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+

And now table elimination is unable to remove t3.

Attachments

Issue Links

relates to

MDEV-26278 Table elimination does not work across derived tables

Closed

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2023-06-08 12:55

A Description of the current solution:

Preparation: table->read_set is not set to all 1 for temporary (derived) tables.

Then, at mysql_derived_optimize() phase, we know which table columns are used and which are not.

Then, if we see a sum function that produces a column that is not used,
we remove it from sum_funcs list and then its value is not computed.

Sergei Petrunia added a comment - 2023-06-08 12:55 A Description of the current solution: Preparation: table->read_set is not set to all 1 for temporary (derived) tables. Then, at mysql_derived_optimize() phase, we know which table columns are used and which are not. Then, if we see a sum function that produces a column that is not used, we remove it from sum_funcs list and then its value is not computed.

Sergei Petrunia added a comment - 2023-06-09 07:59

Take aways from the discussions:

It's hard to handle WITH ROLLUP as it copies aggregate functions. It doesn't seem to be worth it - WITH ROLLUP in a derived table is not that useful.

Sergei Petrunia added a comment - 2023-06-09 07:59 Take aways from the discussions: It's hard to handle WITH ROLLUP as it copies aggregate functions. It doesn't seem to be worth it - WITH ROLLUP in a derived table is not that useful.

Oleg Smirnov added a comment - 2023-06-20 05:52 - edited

bb-11.1-mdev-27201 presents the first version of the functionality, although a couple of tests still fail and need addressing.

commit 4c19fb9ed7dee5e2bd988737f8c334b37252c989 (HEAD -> bb-11.1-mdev-27201, origin/bb-11.1-mdev-27201)
Author: Oleg Smirnov <olernov@gmail.com>
Date: Mon Jun 19 14:42:24 2023 +0700

    MDEV-27201 Do not compute unused aggregate functions for derived tables and views

    Disable the calculation of aggregate functions whose results are not used,

    neither inside the derived table SQL nor outside, in the outer select.

    For example, in the following case:

      SELECT sum_b

      FROM (SELECT a, sum(b) AS sum_b, max(c) AS max_c, min(d) AS min_d

            FROM t1

            GROUP BY a

            HAVING max_c > 10

) T

    T.min_d is not used anywhere, thus there is no need to calculate the

    aggregate function min(d).

    This applies to both VIEWs and derived tables.

    Firstly, this commit discards setting all bits of read_set for

    temporary tables to 1, which is done by default. That allows to determine

    which fields of a view/derived table are not used in the outer query.

    Secondly, the commit traverses HAVING and ORDER BY clauses of a view

    to ensure an aggregate function which is a target for disabling is

    not referenced from there

This optimization shows a significant improvement in performance for a scenario where a view has multiple aggregate field but only few of them are selected:

create table t1 (

  group_id int,

  a int,

  b int,

  c int

);

insert into t1 select  seq/100000, seq+10, seq+20, seq+30 from seq_1_to_1000000;

create view v1 as

select

  group_id,

  sum(a) AS suma, sum(b) AS sumb, sum(c) AS sumc,

  max(a) AS maxa, max(b) AS maxb, max(c) AS maxc,

  avg(a) AS avga, avg(b) AS avgb, avg(c) AS avgc

from t1

group by group_id;

select group_id, suma from v1;

Average time of the SELECT execution for bb-11.1-mdev-27201 is 9.156 seconds for 1 million rows against 12.525 seconds for the mainstream 11.1.

Current limitations and possible TODOs.
1. The function does not disable aggregate functions splitting like this:
SELECT sum(a) + sum(b) AS sumab FROM ...
In this case sum(a) + sum(b) is calculated as two separate aggregate
functions sum(a) and sum(b). They are not disabled even if the field
"sumab" is not used.
2. If any arithmetic and/or an additional calculation are applied to an
aggregate functionm, it will not be disabled. For example:
SELECT sum(a)+1 AS suma_1 FROM ...
3. Window aggregate functions are not processed.
4. References to previous fields like this:
SELECT sum(a) AS suma, suma+1 AS suma_1 FROM ...
are not processed. Currently they are not supported at all, but if
they are supported in the future, appropriate adjustments will need to
be made for this function. There is currently no check that sum(a) is used
for suma_1, so sum(a) may be erroneously disabled.
5. Derived table elimination is not affected, however that might be beneficial.

Oleg Smirnov added a comment - 2023-06-20 05:52 - edited bb-11.1-mdev-27201 presents the first version of the functionality, although a couple of tests still fail and need addressing. commit 4c19fb9ed7dee5e2bd988737f8c334b37252c989 (HEAD -> bb-11.1-mdev-27201, origin/bb-11.1-mdev-27201) Author: Oleg Smirnov <olernov@gmail.com> Date: Mon Jun 19 14:42:24 2023 +0700 MDEV-27201 Do not compute unused aggregate functions for derived tables and views Disable the calculation of aggregate functions whose results are not used, neither inside the derived table SQL nor outside, in the outer select. For example, in the following case: SELECT sum_b FROM (SELECT a, sum(b) AS sum_b, max(c) AS max_c, min(d) AS min_d FROM t1 GROUP BY a HAVING max_c > 10 ) T T.min_d is not used anywhere, thus there is no need to calculate the aggregate function min(d). This applies to both VIEWs and derived tables. Firstly, this commit discards setting all bits of read_set for temporary tables to 1, which is done by default. That allows to determine which fields of a view/derived table are not used in the outer query. Secondly, the commit traverses HAVING and ORDER BY clauses of a view to ensure an aggregate function which is a target for disabling is not referenced from there This optimization shows a significant improvement in performance for a scenario where a view has multiple aggregate field but only few of them are selected: create table t1 ( group_id int , a int , b int , c int ); insert into t1 select seq/100000, seq+10, seq+20, seq+30 from seq_1_to_1000000; create view v1 as select group_id, sum (a) AS suma, sum (b) AS sumb, sum (c) AS sumc, max (a) AS maxa, max (b) AS maxb, max (c) AS maxc, avg (a) AS avga, avg (b) AS avgb, avg (c) AS avgc from t1 group by group_id; select group_id, suma from v1; Average time of the SELECT execution for bb-11.1-mdev-27201 is 9.156 seconds for 1 million rows against 12.525 seconds for the mainstream 11.1 . Current limitations and possible TODOs. 1. The function does not disable aggregate functions splitting like this: SELECT sum(a) + sum(b) AS sumab FROM ... In this case sum(a) + sum(b) is calculated as two separate aggregate functions sum(a) and sum(b). They are not disabled even if the field "sumab" is not used. 2. If any arithmetic and/or an additional calculation are applied to an aggregate functionm, it will not be disabled. For example: SELECT sum(a)+1 AS suma_1 FROM ... 3. Window aggregate functions are not processed. 4. References to previous fields like this: SELECT sum(a) AS suma, suma+1 AS suma_1 FROM ... are not processed. Currently they are not supported at all, but if they are supported in the future, appropriate adjustments will need to be made for this function. There is currently no check that sum(a) is used for suma_1, so sum(a) may be erroneously disabled. 5. Derived table elimination is not affected, however that might be beneficial.

Oleg Smirnov added a comment - 2023-08-23 10:26

The recent version of bb-11.1-mdev-27201 implements the logic of derived tables/views elimination (which was listed as point 5 before).

Oleg Smirnov added a comment - 2023-08-23 10:26 The recent version of bb-11.1-mdev-27201 implements the logic of derived tables/views elimination (which was listed as point 5 before).

Oleg Smirnov added a comment - 2024-08-04 06:37

Created a pull request for easier reviewing.

Oleg Smirnov added a comment - 2024-08-04 06:37 Created a pull request for easier reviewing.

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2021-12-08 15:40

Updated:: 6 days ago 13:31

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server