Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27201

Non-merged derived table: do not compute unused fields

Details

    Description

      (Filing this based on my observations from analyzing queries at a prospect considering migration to MariaDB)

      The problem: If a non-merged derived table has unused columns, they are still created and computed.

      • Creating a column might not be a big deal
      • Computing a column can cause some unused overhead
      • The worst part is that Table Elimination cannot eliminate the parts of query plan that compute useless information.

      Testcase:

      create table t1 (
        group_id int,
        a int, 
        b int,
        c int
      );
      insert into t1 select  seq/100, seq, seq, seq from seq_1_to_10000;
       
      create table t2 (a int);
      insert into t2 values (1),(2),(3);
      

      explain 
      select
        T.group_id,
        T.MAXA
      from 
        t2,
        (select 
           group_id,
           max(a) as MAXA,
           sum(b) as SUMB
         from
           t1
         group by group_id
        ) as T 
      where
       T.group_id = t2.a;
      

      +------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+
      | id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows  | Extra                           |
      +------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+
      |    1 | PRIMARY     | t2         | ALL  | NULL          | NULL | NULL    | NULL      | 3     | Using where                     |
      |    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t2.a | 100   |                                 |
      |    2 | DERIVED     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 10000 | Using temporary; Using filesort |
      +------+-------------+------------+------+---------------+------+---------+-----------+-------+---------------------------------+
      

      Note that sum(b) as SUMB is not used.

      Put breakpoints in

      • create_tmp_table
      • Item_sum_sum::update_field

      one can still see that the column for SUM() is created and the value is computed.

      How it affects table elimination

      Add another table into the subquery:

      create table t3 (
        pk int primary key,
        col int
      );
      insert into t3 select  seq, seq from seq_1_to_10000;
      

      explain 
      select
        T.group_id,
        T.MAXA
      from 
        t2,
        (select 
           group_id,
           max(a) as MAXA,
           sum(b) as SUMB
         from
           t1 left join t3 on t3.pk=t1.c
         group by group_id
        ) as T 
      where
       T.group_id = t2.a;
      

      +------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+
      | id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows | Extra                           |
      +------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+
      |    1 | PRIMARY     | t2         | ALL  | NULL          | NULL | NULL    | NULL      | 3    | Using where                     |
      |    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t2.a | 99   |                                 |
      |    2 | DERIVED     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 9963 | Using temporary; Using filesort |
      +------+-------------+------------+------+---------------+------+---------+-----------+------+---------------------------------+
      

      Ok, t3 is eliminated.

      Now, use t3.b in the SUM:

      explain 
      select
        T.group_id,
        T.MAXA
      from 
        t2,
        (select 
           group_id,
           max(a) as MAXA,
           sum(t3.col) as SUMB
         from
           t1 left join t3 on t3.pk=t1.c
         group by group_id
        ) as T 
      where
       T.group_id = t2.a;
      

      +------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+
      | id   | select_type | table      | type   | possible_keys | key     | key_len | ref       | rows | Extra                           |
      +------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+
      |    1 | PRIMARY     | t2         | ALL    | NULL          | NULL    | NULL    | NULL      | 3    | Using where                     |
      |    1 | PRIMARY     | <derived2> | ref    | key0          | key0    | 5       | test.t2.a | 99   |                                 |
      |    2 | DERIVED     | t1         | ALL    | NULL          | NULL    | NULL    | NULL      | 9963 | Using temporary; Using filesort |
      |    2 | DERIVED     | t3         | eq_ref | PRIMARY       | PRIMARY | 4       | test.t1.c | 1    | Using where                     |
      +------+-------------+------------+--------+---------------+---------+---------+-----------+------+---------------------------------+
      

      And now table elimination is unable to remove t3.

      Attachments

        Issue Links

          Activity

            A Description of the current solution:

            Preparation: table->read_set is not set to all 1 for temporary (derived) tables.

            Then, at mysql_derived_optimize() phase, we know which table columns are used and which are not.

            Then, if we see a sum function that produces a column that is not used,
            we remove it from sum_funcs list and then its value is not computed.

            psergei Sergei Petrunia added a comment - A Description of the current solution: Preparation: table->read_set is not set to all 1 for temporary (derived) tables. Then, at mysql_derived_optimize() phase, we know which table columns are used and which are not. Then, if we see a sum function that produces a column that is not used, we remove it from sum_funcs list and then its value is not computed.

            Take aways from the discussions:

            It's hard to handle WITH ROLLUP as it copies aggregate functions. It doesn't seem to be worth it - WITH ROLLUP in a derived table is not that useful.

            psergei Sergei Petrunia added a comment - Take aways from the discussions: It's hard to handle WITH ROLLUP as it copies aggregate functions. It doesn't seem to be worth it - WITH ROLLUP in a derived table is not that useful.
            oleg.smirnov Oleg Smirnov added a comment - - edited

            bb-11.1-mdev-27201 presents the first version of the functionality, although a couple of tests still fail and need addressing.

            commit 4c19fb9ed7dee5e2bd988737f8c334b37252c989 (HEAD -> bb-11.1-mdev-27201, origin/bb-11.1-mdev-27201)
            Author: Oleg Smirnov <olernov@gmail.com>
            Date: Mon Jun 19 14:42:24 2023 +0700

                MDEV-27201 Do not compute unused aggregate functions for derived tables and views
                
                Disable the calculation of aggregate functions whose results are not used,
                neither inside the derived table SQL nor outside, in the outer select.
                For example, in the following case:
                  SELECT sum_b
                  FROM (SELECT a, sum(b) AS sum_b, max(c) AS max_c, min(d) AS min_d
                        FROM t1
                        GROUP BY a
                        HAVING max_c > 10
                       ) T
                T.min_d is not used anywhere, thus there is no need to calculate the
                aggregate function min(d).
                This applies to both VIEWs and derived tables.
                
                Firstly, this commit discards setting all bits of read_set for
                temporary tables to 1, which is done by default. That allows to determine
                which fields of a view/derived table are not used in the outer query.
                Secondly, the commit traverses HAVING and ORDER BY clauses of a view
                to ensure an aggregate function which is a target for disabling is
                not referenced from there
             

            This optimization shows a significant improvement in performance for a scenario where a view has multiple aggregate field but only few of them are selected:

            create table t1 (
              group_id int,
              a int,
              b int,
              c int
            );
             
            insert into t1 select  seq/100000, seq+10, seq+20, seq+30 from seq_1_to_1000000;
             
            create view v1 as
            select 
              group_id, 
              sum(a) AS suma, sum(b) AS sumb, sum(c) AS sumc,
              max(a) AS maxa, max(b) AS maxb, max(c) AS maxc,
              avg(a) AS avga, avg(b) AS avgb, avg(c) AS avgc 
            from t1 
            group by group_id;
             
            select group_id, suma from v1;
            

            Average time of the SELECT execution for bb-11.1-mdev-27201 is 9.156 seconds for 1 million rows against 12.525 seconds for the mainstream 11.1.

            Current limitations and possible TODOs.
            1. The function does not disable aggregate functions splitting like this:
            SELECT sum(a) + sum(b) AS sumab FROM ...
            In this case sum(a) + sum(b) is calculated as two separate aggregate
            functions sum(a) and sum(b). They are not disabled even if the field
            "sumab" is not used.
            2. If any arithmetic and/or an additional calculation are applied to an
            aggregate functionm, it will not be disabled. For example:
            SELECT sum(a)+1 AS suma_1 FROM ...
            3. Window aggregate functions are not processed.
            4. References to previous fields like this:
            SELECT sum(a) AS suma, suma+1 AS suma_1 FROM ...
            are not processed. Currently they are not supported at all, but if
            they are supported in the future, appropriate adjustments will need to
            be made for this function. There is currently no check that sum(a) is used
            for suma_1, so sum(a) may be erroneously disabled.
            5. Derived table elimination is not affected, however that might be beneficial.

            oleg.smirnov Oleg Smirnov added a comment - - edited bb-11.1-mdev-27201 presents the first version of the functionality, although a couple of tests still fail and need addressing. commit 4c19fb9ed7dee5e2bd988737f8c334b37252c989 (HEAD -> bb-11.1-mdev-27201, origin/bb-11.1-mdev-27201) Author: Oleg Smirnov <olernov@gmail.com> Date: Mon Jun 19 14:42:24 2023 +0700 MDEV-27201 Do not compute unused aggregate functions for derived tables and views Disable the calculation of aggregate functions whose results are not used, neither inside the derived table SQL nor outside, in the outer select. For example, in the following case: SELECT sum_b FROM (SELECT a, sum(b) AS sum_b, max(c) AS max_c, min(d) AS min_d FROM t1 GROUP BY a HAVING max_c > 10 ) T T.min_d is not used anywhere, thus there is no need to calculate the aggregate function min(d). This applies to both VIEWs and derived tables. Firstly, this commit discards setting all bits of read_set for temporary tables to 1, which is done by default. That allows to determine which fields of a view/derived table are not used in the outer query. Secondly, the commit traverses HAVING and ORDER BY clauses of a view to ensure an aggregate function which is a target for disabling is not referenced from there This optimization shows a significant improvement in performance for a scenario where a view has multiple aggregate field but only few of them are selected: create table t1 ( group_id int , a int , b int , c int );   insert into t1 select seq/100000, seq+10, seq+20, seq+30 from seq_1_to_1000000;   create view v1 as select group_id, sum (a) AS suma, sum (b) AS sumb, sum (c) AS sumc, max (a) AS maxa, max (b) AS maxb, max (c) AS maxc, avg (a) AS avga, avg (b) AS avgb, avg (c) AS avgc from t1 group by group_id;   select group_id, suma from v1; Average time of the SELECT execution for bb-11.1-mdev-27201 is 9.156 seconds for 1 million rows against 12.525 seconds for the mainstream 11.1 . Current limitations and possible TODOs. 1. The function does not disable aggregate functions splitting like this: SELECT sum(a) + sum(b) AS sumab FROM ... In this case sum(a) + sum(b) is calculated as two separate aggregate functions sum(a) and sum(b). They are not disabled even if the field "sumab" is not used. 2. If any arithmetic and/or an additional calculation are applied to an aggregate functionm, it will not be disabled. For example: SELECT sum(a)+1 AS suma_1 FROM ... 3. Window aggregate functions are not processed. 4. References to previous fields like this: SELECT sum(a) AS suma, suma+1 AS suma_1 FROM ... are not processed. Currently they are not supported at all, but if they are supported in the future, appropriate adjustments will need to be made for this function. There is currently no check that sum(a) is used for suma_1, so sum(a) may be erroneously disabled. 5. Derived table elimination is not affected, however that might be beneficial.
            oleg.smirnov Oleg Smirnov added a comment -

            The recent version of bb-11.1-mdev-27201 implements the logic of derived tables/views elimination (which was listed as point 5 before).

            oleg.smirnov Oleg Smirnov added a comment - The recent version of bb-11.1-mdev-27201 implements the logic of derived tables/views elimination (which was listed as point 5 before).
            oleg.smirnov Oleg Smirnov added a comment -

            Created a pull request for easier reviewing.

            oleg.smirnov Oleg Smirnov added a comment - Created a pull request for easier reviewing.

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.