[MDEV-27201] Non-merged derived table: do not compute unused fields Created: 2021-12-08  Updated: 2023-12-22

Status: In Review
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 11.5

Type: Task Priority: Critical
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: optimizer, optimizer-easy

Issue Links:
Relates
relates to MDEV-26278 Table elimination does not work acros... Closed

 Description   

(Filing this based on my observations from analyzing queries at a prospect considering migration to MariaDB)

The problem: If a non-merged derived table has unused columns, they are still created and computed.

  • Creating a column might not be a big deal
  • Computing a column can cause some unused overhead
  • The worst part is that Table Elimination cannot eliminate the parts of query plan that compute useless information.

Testcase:

create table t1 (
  group_id int,
  a int, 
  b int,
  c int
);
insert into t1 select  seq/100, seq, seq, seq from seq_1_to_10000;
 
create table t2 (a int);
insert into t2 values (1),(2),(3);

explain 
select
  T.group_id,
  T.MAXA
from 
  t2,
  (select 
     group_id,
     max(a) as MAXA,
     sum(b) as SUMB
   from
     t1
   group by group_id
  ) as T 
where
 T.group_id = t2.a;

+------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+
| id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows  | Extra                           |
+------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+
|    1 | PRIMARY     | t2         | ALL  | NULL          | NULL | NULL    | NULL      | 3     | Using where                     |
|    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t2.a | 100   |                                 |
|    2 | DERIVED     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 10000 | Using temporary; Using filesort |
+------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+

Note that sum(b) as SUMB is not used.

Put breakpoints in

  • create_tmp_table
  • Item_sum_sum::update_field

one can still see that the column for SUM() is created and the value is computed.

How it affects table elimination

Add another table into the subquery:

create table t3 (
  pk int primary key,
  col int
);
insert into t3 select  seq, seq from seq_1_to_10000;

explain 
select
  T.group_id,
  T.MAXA
from 
  t2,
  (select 
     group_id,
     max(a) as MAXA,
     sum(b) as SUMB
   from
     t1 left join t3 on t3.pk=t1.c
   group by group_id
  ) as T 
where
 T.group_id = t2.a;

+------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+
| id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows | Extra                           |
+------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+
|    1 | PRIMARY     | t2         | ALL  | NULL          | NULL | NULL    | NULL      | 3    | Using where                     |
|    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t2.a | 99   |                                 |
|    2 | DERIVED     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 9963 | Using temporary; Using filesort |
+------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+

Ok, t3 is eliminated.

Now, use t3.b in the SUM:

explain 
select
  T.group_id,
  T.MAXA
from 
  t2,
  (select 
     group_id,
     max(a) as MAXA,
     sum(t3.col) as SUMB
   from
     t1 left join t3 on t3.pk=t1.c
   group by group_id
  ) as T 
where
 T.group_id = t2.a;

+------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+
| id   | select_type | table      | type   | possible_keys | key     | key_len | ref       | rows | Extra                           |
+------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+
|    1 | PRIMARY     | t2         | ALL    | NULL          | NULL    | NULL    | NULL      | 3    | Using where                     |
|    1 | PRIMARY     | <derived2> | ref    | key0          | key0    | 5       | test.t2.a | 99   |                                 |
|    2 | DERIVED     | t1         | ALL    | NULL          | NULL    | NULL    | NULL      | 9963 | Using temporary; Using filesort |
|    2 | DERIVED     | t3         | eq_ref | PRIMARY       | PRIMARY | 4       | test.t1.c | 1    | Using where                     |
+------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+

And now table elimination is unable to remove t3.



 Comments   
Comment by Sergei Petrunia [ 2023-06-08 ]

A Description of the current solution:

Preparation: table->read_set is not set to all 1 for temporary (derived) tables.

Then, at mysql_derived_optimize() phase, we know which table columns are used and which are not.

Then, if we see a sum function that produces a column that is not used,
we remove it from sum_funcs list and then its value is not computed.

Comment by Sergei Petrunia [ 2023-06-09 ]

Take aways from the discussions:

It's hard to handle WITH ROLLUP as it copies aggregate functions. It doesn't seem to be worth it - WITH ROLLUP in a derived table is not that useful.

Comment by Oleg Smirnov [ 2023-06-20 ]

bb-11.1-mdev-27201 presents the first version of the functionality, although a couple of tests still fail and need addressing.

commit 4c19fb9ed7dee5e2bd988737f8c334b37252c989 (HEAD -> bb-11.1-mdev-27201, origin/bb-11.1-mdev-27201)
Author: Oleg Smirnov <olernov@gmail.com>
Date: Mon Jun 19 14:42:24 2023 +0700

    MDEV-27201 Do not compute unused aggregate functions for derived tables and views
    
    Disable the calculation of aggregate functions whose results are not used,
    neither inside the derived table SQL nor outside, in the outer select.
    For example, in the following case:
      SELECT sum_b
      FROM (SELECT a, sum(b) AS sum_b, max(c) AS max_c, min(d) AS min_d
            FROM t1
            GROUP BY a
            HAVING max_c > 10
           ) T
    T.min_d is not used anywhere, thus there is no need to calculate the
    aggregate function min(d).
    This applies to both VIEWs and derived tables.
    
    Firstly, this commit discards setting all bits of read_set for
    temporary tables to 1, which is done by default. That allows to determine
    which fields of a view/derived table are not used in the outer query.
    Secondly, the commit traverses HAVING and ORDER BY clauses of a view
    to ensure an aggregate function which is a target for disabling is
    not referenced from there
 

This optimization shows a significant improvement in performance for a scenario where a view has multiple aggregate field but only few of them are selected:

create table t1 (
  group_id int,
  a int,
  b int,
  c int
);
 
insert into t1 select  seq/100000, seq+10, seq+20, seq+30 from seq_1_to_1000000;
 
create view v1 as
select 
  group_id, 
  sum(a) AS suma, sum(b) AS sumb, sum(c) AS sumc,
  max(a) AS maxa, max(b) AS maxb, max(c) AS maxc,
  avg(a) AS avga, avg(b) AS avgb, avg(c) AS avgc 
from t1 
group by group_id;
 
select group_id, suma from v1;

Average time of the SELECT execution for bb-11.1-mdev-27201 is 9.156 seconds for 1 million rows against 12.525 seconds for the mainstream 11.1.

Current limitations and possible TODOs.
1. The function does not disable aggregate functions splitting like this:
SELECT sum(a) + sum(b) AS sumab FROM ...
In this case sum(a) + sum(b) is calculated as two separate aggregate
functions sum(a) and sum(b). They are not disabled even if the field
"sumab" is not used.
2. If any arithmetic and/or an additional calculation are applied to an
aggregate functionm, it will not be disabled. For example:
SELECT sum(a)+1 AS suma_1 FROM ...
3. Window aggregate functions are not processed.
4. References to previous fields like this:
SELECT sum(a) AS suma, suma+1 AS suma_1 FROM ...
are not processed. Currently they are not supported at all, but if
they are supported in the future, appropriate adjustments will need to
be made for this function. There is currently no check that sum(a) is used
for suma_1, so sum(a) may be erroneously disabled.
5. Derived table elimination is not affected, however that might be beneficial.

Comment by Oleg Smirnov [ 2023-08-23 ]

The recent version of bb-11.1-mdev-27201 implements the logic of derived tables/views elimination (which was listed as point 5 before).

Generated at Thu Feb 08 09:51:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.