Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13369

Optimization for equi-joins of derived tables with GROUP BY

Details

    • 10.2.12

    Description

      Consider the following tables

      create table t1 (a int);
      insert into t1 values (5), (1), (2), (9), (7), (2), (7);
      create table t2 (a int, b int, index idx(a));
      insert into t2 values (7,10), (1,20), (2,23), (7,18), (1,30), (4,71), (3,15), (7,82);
      

      and the query

      select t1.a,t.max,t.min from t1 left join (select a, max(t2.b) max, min(t2.b) min from t2 group by t2.a) t on t1.a=t.a;
      

      The following re-writing could be applied to this query:

      =>
      select t1.a,tl.max,tl.min from t1 left join (select a, max(t2.b) max, min(t2.b) min from t2 where  t1.a=t2.a) tl on 1=1;
      

      The result of this re-writing is a query with so-called lateral derived table. This query requires refilling of the temporary table created the derived table for every new value t1.a. As the size of the derived table tl usually is much smaller than the size of the derived table t this transformation will be always beneficial. Especially this transformation will be beneficial when the join operation that joins derived table is used in a multi-way join query where only a few records of t1 are selected.
      Unfortunately now we do not support lateral derived tables on SQL level. This task will allow to use them internally for this transformation.

      The difference of this task from the task MDEV-13225 that in this task the execution plan chosen for the original derived table will be just transformed into the one that uses a lateral derived table. As a result it might happen that the optimizer chooses not the best join order for the main query. The cost of using lateral derived table will be taken into account by the optimizer later.

      Attachments

        Issue Links

          Activity

            igor Igor Babaev added a comment -

            A patch for this task that can be considered as a 'proof of concept' was sent for evaluation to Andrii.

            This patch introduced two-phase optimization for some queries.
            It also checked that the derived tables with outer references can be easily executed within the current architecture.

            igor Igor Babaev added a comment - A patch for this task that can be considered as a 'proof of concept' was sent for evaluation to Andrii. This patch introduced two-phase optimization for some queries. It also checked that the derived tables with outer references can be easily executed within the current architecture.
            igor Igor Babaev added a comment - - edited

            Alvin,
            Bar planned to add the patch to bb-10.2-compatibility on August 4.
            The comments of the patch mention mdev-13369.
            I personally work with bb-10.2-ext and I did not push it there.

            igor Igor Babaev added a comment - - edited Alvin, Bar planned to add the patch to bb-10.2-compatibility on August 4. The comments of the patch mention mdev-13369. I personally work with bb-10.2-ext and I did not push it there.
            igor Igor Babaev added a comment - - edited

            Cost aware implementation of MDEV-13369.

            Design.

            1. Every materialized derived is checked for being potentially splittable:
            a) the specification must contain only 1 select
            b) if this select is with GROUP BY then check that there is an index whose major components cover major components of the group by list (in any order)
            c) if this select is with window functions then check that all window functions use only 1 partition;
            there is an index whose major components cover some components of the partition.

            If all the conditions above are satisfied then the materialized derived is considered as potentially splittable and a structure of the type SPLITTABLE_DERIVED_INFO is attached
            to the TABLE_LIST created for this derived.
            All info about the indexes that potentially can be used for splitting is stored in this structure.

            2. The join specifying a potentially splittable derived is subject to two-phase optimization. The first phase ends when the cheapest join order is found.

            When the optimization of the main query asks for the optimization of the materialized derived tables only the first phase of the optimization is performed for each derived that is optimized in two phases.
            The check whether a join is subject to two phase optimization is performed when the first phase finishes.

            3. Let v be a potentially derived used in the main query.
            When the function choose plan() called for the main query tries to expand the current partial join order with v1 using an index access that employs the index idx (v.a,v.b,…) the
            following is checked. It is checked that some prefix of this index consists only of the fields that are selected into v directly from the fields of an underlying table (vt.a, vt.b,…) and vt.a,vt.b...are the major components of the indexes remembered in the SPLITTABLE_DERIVED_INFO structure created for v. If so, v can be effectively split.

            4. Let assume that v can be accessed via a split technique after tables t1, t2 in a partial order for the main query. Let assume that the split is performed using the equality conditions v.a=f(t1.a), v(b)=g(t2.b). When in the partial join order t1,t2,v the table v is accessed the values of f(t1.a) and g(t2.b) are fixed (considered as constants). The equalities vt.a=f(t1.a), vt.b=g(t2.b) must be added to the WHERE condition of v to get the partition of v that is determined by the conditions v.a=f(t1.a), v(b)=g(t2.b). Now the best plan is built for v with extended WHERE that actually can produce any partition p from v. At this point the cost of this plan C(p) must be evaluated . Let C(v) be the cost of producing rows of v. Let in(v) be the expected row count for the partial join t1,t2. If in(v)*C(p) < C(v) then accessing v with a splitting technique is more beneficial with this partial order.

            5. Adding conditions to existing one is not so cheap. Besides it's clear that when looking for the best plan for the main query the optimizer has not only to add conditions that can be used, but also has to remove the ones that are not valid for the current partial join order. For example when the partial order t1,v is evaluated the condition vt(b)=g(t2.b) must be removed from consideration.
            With a certain technique we can add all usable equalities at once and then invalidate usage of those of them that are not valid in the current partial join order. The technique works as follows. When conditions the vt.a=f(t1.a), vt.b=g(t2.b) are built from the conditions v.a=f(t1.a), v(b)=g(t2.b) each of items f(t1.a), g(t2.b) is cloned and each clone is provided with pointer to the parent. A clone is resolved against v, while the parent is resolved against the main query. Now if the partial join order is t1,v then vt.a=f(t1.a) can be used, while vt.b=g(t2.b) cannot because t2 is not in t1,v.

            6. The array of extended KEYUSEs is built only once when the first usage of the split technique is evaluated.
            The new variant of best_access_path() must take into account that some KEYUSEs can be invalidated for some partial join orders.

            igor Igor Babaev added a comment - - edited Cost aware implementation of MDEV-13369 . Design. 1. Every materialized derived is checked for being potentially splittable: a) the specification must contain only 1 select b) if this select is with GROUP BY then check that there is an index whose major components cover major components of the group by list (in any order) c) if this select is with window functions then check that all window functions use only 1 partition; there is an index whose major components cover some components of the partition. If all the conditions above are satisfied then the materialized derived is considered as potentially splittable and a structure of the type SPLITTABLE_DERIVED_INFO is attached to the TABLE_LIST created for this derived. All info about the indexes that potentially can be used for splitting is stored in this structure. 2. The join specifying a potentially splittable derived is subject to two-phase optimization. The first phase ends when the cheapest join order is found. When the optimization of the main query asks for the optimization of the materialized derived tables only the first phase of the optimization is performed for each derived that is optimized in two phases. The check whether a join is subject to two phase optimization is performed when the first phase finishes. 3. Let v be a potentially derived used in the main query. When the function choose plan() called for the main query tries to expand the current partial join order with v1 using an index access that employs the index idx (v.a,v.b,…) the following is checked. It is checked that some prefix of this index consists only of the fields that are selected into v directly from the fields of an underlying table (vt.a, vt.b,…) and vt.a,vt.b...are the major components of the indexes remembered in the SPLITTABLE_DERIVED_INFO structure created for v. If so, v can be effectively split. 4. Let assume that v can be accessed via a split technique after tables t1, t2 in a partial order for the main query. Let assume that the split is performed using the equality conditions v.a=f(t1.a), v(b)=g(t2.b). When in the partial join order t1,t2,v the table v is accessed the values of f(t1.a) and g(t2.b) are fixed (considered as constants). The equalities vt.a=f(t1.a), vt.b=g(t2.b) must be added to the WHERE condition of v to get the partition of v that is determined by the conditions v.a=f(t1.a), v(b)=g(t2.b). Now the best plan is built for v with extended WHERE that actually can produce any partition p from v. At this point the cost of this plan C(p) must be evaluated . Let C(v) be the cost of producing rows of v. Let in(v) be the expected row count for the partial join t1,t2. If in(v)*C(p) < C(v) then accessing v with a splitting technique is more beneficial with this partial order. 5. Adding conditions to existing one is not so cheap. Besides it's clear that when looking for the best plan for the main query the optimizer has not only to add conditions that can be used, but also has to remove the ones that are not valid for the current partial join order. For example when the partial order t1,v is evaluated the condition vt(b)=g(t2.b) must be removed from consideration. With a certain technique we can add all usable equalities at once and then invalidate usage of those of them that are not valid in the current partial join order. The technique works as follows. When conditions the vt.a=f(t1.a), vt.b=g(t2.b) are built from the conditions v.a=f(t1.a), v(b)=g(t2.b) each of items f(t1.a), g(t2.b) is cloned and each clone is provided with pointer to the parent. A clone is resolved against v, while the parent is resolved against the main query. Now if the partial join order is t1,v then vt.a=f(t1.a) can be used, while vt.b=g(t2.b) cannot because t2 is not in t1,v. 6. The array of extended KEYUSEs is built only once when the first usage of the split technique is evaluated. The new variant of best_access_path() must take into account that some KEYUSEs can be invalidated for some partial join orders.
            igor Igor Babaev added a comment -

            A full cost-base solution was pushed into the 10.3 tree

            igor Igor Babaev added a comment - A full cost-base solution was pushed into the 10.3 tree

            People

              igor Igor Babaev
              igor Igor Babaev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.