Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26301

Split optimization refills temporary table too many times

Details

    Description

      Short

      LATERAL DERIVED optimization refills the temp. table every time the derived table is accessed, regardless of whether the inputs have changed or not.

      LATERAL DERIVED's cost calculations match the execution and also assume the temp.table is refilled every time.
      The join optimizer typically over-estimates join output cardinality, so the optimizer considers LATERAL DERIVED optimization to be even more expensive.

      Taken together, these two properties cause the optimizer to miss using LATERAL DERIVED optimization where it is very advantageous.

      Long

      (This is based on a real-world case but the dataset I am using is totally artificial)

      Tables that will make a prefix before the lateral:

      # 5 values
      create table t1(a int, b int);
      insert into t1 select seq,seq from seq_1_to_5;
       
      # 5 value groups of size 2 each
      create table t2(a int, b int, key(a));
      insert into t2
      select A.seq,B.seq from seq_1_to_25 A, seq_1_to_2 B;
       
      # 5 value groups of size 3 each
      create table t3(a int, b int, key(a));
      insert into t3
      select A.seq,B.seq from seq_1_to_5 A, seq_1_to_3 B;
       
      analyze table t1,t2,t3 persistent for all;
      

      explain
      select * from
        (t1 left join t2 on t2.a=t1.b) left join t3 on t3.a=t1.b;
      

      +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref       | rows | Extra       |
      +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
      |    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL      | 5    |             |
      |    1 | SIMPLE      | t2    | ref  | a             | a    | 5       | test.t1.b | 2    | Using where |
      |    1 | SIMPLE      | t3    | ref  | a             | a    | 5       | test.t1.b | 3    | Using where |
      +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
      

      Now, tables for the LATERAL DERIVED:

      create table t10 (
        grp_id int,
        col1 int,
        key(grp_id)
      );
       
      # 100 groups of 100 values each
      insert into t10
      select
        A.seq,
        B.seq
      from
        seq_1_to_100 A,
        seq_1_to_100 B;
       
      # and X10 multiplier
      create table t11 (
        col1 int,
        col2 int
      );
      insert into t11
      select A.seq, A.seq from seq_1_to_10 A;
       
      analyze table t10,t11 persistent for all;
      

      Now, the query:

      explain
      select * from
        (
          (t1 left join t2 on t2.a=t1.b)
          left join t3 on t3.a=t1.b
        ) left join (select grp_id, count(*)
                     from t10 left join t11 on t11.col1=t10.col1
                     group by grp_id) T on T.grp_id=t1.b;
      

      Note that the join is between T and t1, the first table. t1 has 5 rows. However, the optimizer assumes that the table is refilled t1xt2xt3: 30 times.
      Because of that, it doesn't pick LATERAL DERIVED:

      +------+-------------+------------+------+---------------+------+---------+-----------+-------+-------------------------------------------------+
      | id   | select_type | table      | type | possible_keys | key  | key_len | ref       | rows  | Extra                                           |
      +------+-------------+------------+------+---------------+------+---------+-----------+-------+-------------------------------------------------+
      |    1 | PRIMARY     | t1         | ALL  | NULL          | NULL | NULL    | NULL      | 5     |                                                 |
      |    1 | PRIMARY     | t2         | ref  | a             | a    | 5       | test.t1.b | 2     | Using where                                     |
      |    1 | PRIMARY     | t3         | ref  | a             | a    | 5       | test.t1.b | 3     | Using where                                     |
      |    1 | PRIMARY     | <derived2> | ref  | key0          | key0 | 5       | test.t1.b | 1000  | Using where                                     |
      |    2 | DERIVED     | t10        | ALL  | grp_id        | NULL | NULL    | NULL      | 10000 | Using temporary; Using filesort                 |
      |    2 | DERIVED     | t11        | ALL  | NULL          | NULL | NULL    | NULL      | 10    | Using where; Using join buffer (flat, BNL join) |
      +------+-------------+------------+------+---------------+------+---------+-----------+-------+-------------------------------------------------+
      

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value

            Execution part should collect lateral's dependencies and keep them in a Cached_item

            psergei Sergei Petrunia added a comment - Execution part should collect lateral's dependencies and keep them in a Cached_item
            psergei Sergei Petrunia made changes -
            Assignee Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            Affects Version/s 10.5 [ 23123 ]
            Affects Version/s 10.6 [ 24028 ]
            psergei Sergei Petrunia made changes -
            Component/s Optimizer [ 10200 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.5 [ 23123 ]
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            psergei Sergei Petrunia made changes -
            Labels split_materialized
            psergei Sergei Petrunia made changes -
            Description LATERAL DERIVED optimization will refill the temp. table every time the derived table is accessed, regardless of whether the inputs were changed or not.

            The cost calculations also make this assumption. Provided that the join optimizer typically over-estimates the partial join cardinalities, this makes the optimizer think that using LATERAL DERIVED optimization is more expensive than it actually is.
            h2. Short
            LATERAL DERIVED optimization refills the temp. table every time the derived table is accessed, regardless of whether the inputs have changed or not.

            LATERAL DERIVED's cost calculations match the execution and also assume the temp.table is refilled every time.
            The join optimizer typically over-estimates join output cardinality, so the optimizer considers LATERAL DERIVED optimization to be even more expensive.

            Taken together, these two properties cause the optimizer to miss using LATERAL DERIVED optimization where it is very advantageous.

            h2. Long
            (This is based on a real-world case but the dataset I am using is totally artificial)

            Tables that will make a prefix before the lateral:

            {code}
            # 5 values
            create table t1(a int, b int);
            insert into t1 select seq,seq from seq_1_to_5;

            # 5 value groups of size 2 each
            create table t2(a int, b int, key(a));
            insert into t2
            select A.seq,B.seq from seq_1_to_25 A, seq_1_to_2 B;

            # 5 value groups of size 3 each
            create table t3(a int, b int, key(a));
            insert into t3
            select A.seq,B.seq from seq_1_to_5 A, seq_1_to_3 B;

            analyze table t1,t2,t3 persistent for all;
            {code}

            {code}
            explain
            select * from
              (t1 left join t2 on t2.a=t1.b) left join t3 on t3.a=t1.b;
            {code}
            {code}
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 5 | |
            | 1 | SIMPLE | t2 | ref | a | a | 5 | test.t1.b | 2 | Using where |
            | 1 | SIMPLE | t3 | ref | a | a | 5 | test.t1.b | 3 | Using where |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            {code}


            psergei Sergei Petrunia made changes -
            Description h2. Short
            LATERAL DERIVED optimization refills the temp. table every time the derived table is accessed, regardless of whether the inputs have changed or not.

            LATERAL DERIVED's cost calculations match the execution and also assume the temp.table is refilled every time.
            The join optimizer typically over-estimates join output cardinality, so the optimizer considers LATERAL DERIVED optimization to be even more expensive.

            Taken together, these two properties cause the optimizer to miss using LATERAL DERIVED optimization where it is very advantageous.

            h2. Long
            (This is based on a real-world case but the dataset I am using is totally artificial)

            Tables that will make a prefix before the lateral:

            {code}
            # 5 values
            create table t1(a int, b int);
            insert into t1 select seq,seq from seq_1_to_5;

            # 5 value groups of size 2 each
            create table t2(a int, b int, key(a));
            insert into t2
            select A.seq,B.seq from seq_1_to_25 A, seq_1_to_2 B;

            # 5 value groups of size 3 each
            create table t3(a int, b int, key(a));
            insert into t3
            select A.seq,B.seq from seq_1_to_5 A, seq_1_to_3 B;

            analyze table t1,t2,t3 persistent for all;
            {code}

            {code}
            explain
            select * from
              (t1 left join t2 on t2.a=t1.b) left join t3 on t3.a=t1.b;
            {code}
            {code}
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 5 | |
            | 1 | SIMPLE | t2 | ref | a | a | 5 | test.t1.b | 2 | Using where |
            | 1 | SIMPLE | t3 | ref | a | a | 5 | test.t1.b | 3 | Using where |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            {code}


            h2. Short
            LATERAL DERIVED optimization refills the temp. table every time the derived table is accessed, regardless of whether the inputs have changed or not.

            LATERAL DERIVED's cost calculations match the execution and also assume the temp.table is refilled every time.
            The join optimizer typically over-estimates join output cardinality, so the optimizer considers LATERAL DERIVED optimization to be even more expensive.

            Taken together, these two properties cause the optimizer to miss using LATERAL DERIVED optimization where it is very advantageous.

            h2. Long
            (This is based on a real-world case but the dataset I am using is totally artificial)

            Tables that will make a prefix before the lateral:

            {code}
            # 5 values
            create table t1(a int, b int);
            insert into t1 select seq,seq from seq_1_to_5;

            # 5 value groups of size 2 each
            create table t2(a int, b int, key(a));
            insert into t2
            select A.seq,B.seq from seq_1_to_25 A, seq_1_to_2 B;

            # 5 value groups of size 3 each
            create table t3(a int, b int, key(a));
            insert into t3
            select A.seq,B.seq from seq_1_to_5 A, seq_1_to_3 B;

            analyze table t1,t2,t3 persistent for all;
            {code}

            {code}
            explain
            select * from
              (t1 left join t2 on t2.a=t1.b) left join t3 on t3.a=t1.b;
            {code}
            {code}
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 5 | |
            | 1 | SIMPLE | t2 | ref | a | a | 5 | test.t1.b | 2 | Using where |
            | 1 | SIMPLE | t3 | ref | a | a | 5 | test.t1.b | 3 | Using where |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            {code}

            Now, tables for the LATERAL DERIVED:
            {code}
            create table t10 (
              grp_id int,
              col1 int,
              key(grp_id)
            );

            # 100 groups of 100 values each
            insert into t10
            select
              A.seq,
              B.seq
            from
              seq_1_to_100 A,
              seq_1_to_100 B;

            # and X10 multiplier
            create table t11 (
              col1 int,
              col2 int
            );
            insert into t11
            select A.seq, A.seq from seq_1_to_10 A;

            analyze table t10,t11 persistent for all;
            {code}

            psergei Sergei Petrunia made changes -
            Description h2. Short
            LATERAL DERIVED optimization refills the temp. table every time the derived table is accessed, regardless of whether the inputs have changed or not.

            LATERAL DERIVED's cost calculations match the execution and also assume the temp.table is refilled every time.
            The join optimizer typically over-estimates join output cardinality, so the optimizer considers LATERAL DERIVED optimization to be even more expensive.

            Taken together, these two properties cause the optimizer to miss using LATERAL DERIVED optimization where it is very advantageous.

            h2. Long
            (This is based on a real-world case but the dataset I am using is totally artificial)

            Tables that will make a prefix before the lateral:

            {code}
            # 5 values
            create table t1(a int, b int);
            insert into t1 select seq,seq from seq_1_to_5;

            # 5 value groups of size 2 each
            create table t2(a int, b int, key(a));
            insert into t2
            select A.seq,B.seq from seq_1_to_25 A, seq_1_to_2 B;

            # 5 value groups of size 3 each
            create table t3(a int, b int, key(a));
            insert into t3
            select A.seq,B.seq from seq_1_to_5 A, seq_1_to_3 B;

            analyze table t1,t2,t3 persistent for all;
            {code}

            {code}
            explain
            select * from
              (t1 left join t2 on t2.a=t1.b) left join t3 on t3.a=t1.b;
            {code}
            {code}
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 5 | |
            | 1 | SIMPLE | t2 | ref | a | a | 5 | test.t1.b | 2 | Using where |
            | 1 | SIMPLE | t3 | ref | a | a | 5 | test.t1.b | 3 | Using where |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            {code}

            Now, tables for the LATERAL DERIVED:
            {code}
            create table t10 (
              grp_id int,
              col1 int,
              key(grp_id)
            );

            # 100 groups of 100 values each
            insert into t10
            select
              A.seq,
              B.seq
            from
              seq_1_to_100 A,
              seq_1_to_100 B;

            # and X10 multiplier
            create table t11 (
              col1 int,
              col2 int
            );
            insert into t11
            select A.seq, A.seq from seq_1_to_10 A;

            analyze table t10,t11 persistent for all;
            {code}

            h2. Short
            LATERAL DERIVED optimization refills the temp. table every time the derived table is accessed, regardless of whether the inputs have changed or not.

            LATERAL DERIVED's cost calculations match the execution and also assume the temp.table is refilled every time.
            The join optimizer typically over-estimates join output cardinality, so the optimizer considers LATERAL DERIVED optimization to be even more expensive.

            Taken together, these two properties cause the optimizer to miss using LATERAL DERIVED optimization where it is very advantageous.

            h2. Long
            (This is based on a real-world case but the dataset I am using is totally artificial)

            Tables that will make a prefix before the lateral:

            {code}
            # 5 values
            create table t1(a int, b int);
            insert into t1 select seq,seq from seq_1_to_5;

            # 5 value groups of size 2 each
            create table t2(a int, b int, key(a));
            insert into t2
            select A.seq,B.seq from seq_1_to_25 A, seq_1_to_2 B;

            # 5 value groups of size 3 each
            create table t3(a int, b int, key(a));
            insert into t3
            select A.seq,B.seq from seq_1_to_5 A, seq_1_to_3 B;

            analyze table t1,t2,t3 persistent for all;
            {code}

            {code}
            explain
            select * from
              (t1 left join t2 on t2.a=t1.b) left join t3 on t3.a=t1.b;
            {code}
            {code}
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 5 | |
            | 1 | SIMPLE | t2 | ref | a | a | 5 | test.t1.b | 2 | Using where |
            | 1 | SIMPLE | t3 | ref | a | a | 5 | test.t1.b | 3 | Using where |
            +------+-------------+-------+------+---------------+------+---------+-----------+------+-------------+
            {code}

            Now, tables for the LATERAL DERIVED:
            {code}
            create table t10 (
              grp_id int,
              col1 int,
              key(grp_id)
            );

            # 100 groups of 100 values each
            insert into t10
            select
              A.seq,
              B.seq
            from
              seq_1_to_100 A,
              seq_1_to_100 B;

            # and X10 multiplier
            create table t11 (
              col1 int,
              col2 int
            );
            insert into t11
            select A.seq, A.seq from seq_1_to_10 A;

            analyze table t10,t11 persistent for all;
            {code}

            Now, the query:
            {code:sql}
            explain
            select * from
              (
                (t1 left join t2 on t2.a=t1.b)
                left join t3 on t3.a=t1.b
              ) left join (select grp_id, count(*)
                           from t10 left join t11 on t11.col1=t10.col1
                           group by grp_id) T on T.grp_id=t1.b;
            {code}

            Note that the join is between T and t1, the first table. t1 has 5 rows. However, the optimizer assumes that the table is refilled t1xt2xt3: 30 times.
            Because of that, it doesn't pick LATERAL DERIVED:

            {code}
            +------+-------------+------------+------+---------------+------+---------+-----------+-------+-------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+------------+------+---------------+------+---------+-----------+-------+-------------------------------------------------+
            | 1 | PRIMARY | t1 | ALL | NULL | NULL | NULL | NULL | 5 | |
            | 1 | PRIMARY | t2 | ref | a | a | 5 | test.t1.b | 2 | Using where |
            | 1 | PRIMARY | t3 | ref | a | a | 5 | test.t1.b | 3 | Using where |
            | 1 | PRIMARY | <derived2> | ref | key0 | key0 | 5 | test.t1.b | 1000 | Using where |
            | 2 | DERIVED | t10 | ALL | grp_id | NULL | NULL | NULL | 10000 | Using temporary; Using filesort |
            | 2 | DERIVED | t11 | ALL | NULL | NULL | NULL | NULL | 10 | Using where; Using join buffer (flat, BNL join) |
            +------+-------------+------------+------+---------------+------+---------+-----------+-------+-------------------------------------------------+
            {code}

            http://lists.askmonty.org/pipermail/commits/2021-August/014711.html

            Also in the ES repo in 10.5.10-7-ment1259-v2

            psergei Sergei Petrunia added a comment - http://lists.askmonty.org/pipermail/commits/2021-August/014711.html Also in the ES repo in 10.5.10-7-ment1259-v2
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 124078 ] MariaDB v4 [ 144608 ]
            psergei Sergei Petrunia made changes -
            igor Igor Babaev (Inactive) made changes -
            Summary LATERAL DERIVED refills the temp. table too many times Split optimization refills temp. table too many times
            igor Igor Babaev (Inactive) made changes -
            Summary Split optimization refills temp. table too many times Split optimization refills temporary table too many times
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            igor Igor Babaev (Inactive) made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            igor Igor Babaev (Inactive) made changes -
            Affects Version/s 10.4 [ 22408 ]
            serg Sergei Golubchik made changes -
            Priority Critical [ 2 ] Blocker [ 1 ]
            psergei Sergei Petrunia added a comment - Input for the latest patch: https://lists.launchpad.net/maria-developers/msg13314.html

            A patch to be applied over Igor's fix. It just adds optimizer trace coverage: spetrunia-mdev26301-tracing.diff

            psergei Sergei Petrunia added a comment - A patch to be applied over Igor's fix. It just adds optimizer trace coverage: spetrunia-mdev26301-tracing.diff
            psergei Sergei Petrunia made changes -
            Attachment spetrunia-mdev26301-tracing.diff [ 69858 ]
            psergei Sergei Petrunia added a comment - spetrunia-mdev26301-tracing.diff
            ralf.gebhardt Ralf Gebhardt made changes -
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.4.29 [ 28510 ]
            Fix Version/s 10.5.20 [ 28512 ]
            Fix Version/s 10.6.13 [ 28514 ]
            Fix Version/s 10.9.6 [ 28520 ]
            Fix Version/s 10.10.4 [ 28522 ]
            Fix Version/s 10.11.3 [ 28524 ]
            Fix Version/s 11.1.1 [ 28704 ]
            Fix Version/s 10.8.8 [ 28518 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            danblack Daniel Black made changes -
            danblack Daniel Black made changes -

            For this MDEV, there were two commits pushed:

            • The fix itself

              commit ce7ffe61d836fe9f0cfc1087f058bc40d66e5cfb
              Author: Igor Babaev <igor@askmonty.org>
              Date:   Tue May 2 23:17:07 2023 -0700
               
                  MDEV-26301 Split optimization refills temporary table too many times
                  
                  This patch optimizes the number of refills for the lateral derived table
                  to which a materialized derived table subject to split optimization is
                  is converted. This optimized number of refills is now considered as the
                  expected number of refills of the materialized derived table when searching
                  for the best possible splitting of the table.
              

            • A patch adding optimizer trace coverage:

              commit ed3e6f66a265952afded33fb134f6f8bcc31aa89
              Author: Sergei Petrunia <sergey@mariadb.com>
              Date:   Wed May 3 13:49:32 2023 +0300
               
                  MDEV-26301: Split optimization refills: Optimizer Trace coverage
                  
                  Add Optimizer Trace printouts.
              

            psergei Sergei Petrunia added a comment - For this MDEV, there were two commits pushed: The fix itself commit ce7ffe61d836fe9f0cfc1087f058bc40d66e5cfb Author: Igor Babaev <igor@askmonty.org> Date: Tue May 2 23:17:07 2023 -0700   MDEV-26301 Split optimization refills temporary table too many times This patch optimizes the number of refills for the lateral derived table to which a materialized derived table subject to split optimization is is converted. This optimized number of refills is now considered as the expected number of refills of the materialized derived table when searching for the best possible splitting of the table. A patch adding optimizer trace coverage: commit ed3e6f66a265952afded33fb134f6f8bcc31aa89 Author: Sergei Petrunia <sergey@mariadb.com> Date: Wed May 3 13:49:32 2023 +0300   MDEV-26301: Split optimization refills: Optimizer Trace coverage Add Optimizer Trace printouts.
            danblack Daniel Black made changes -
            lstartseva Lena Startseva made changes -
            alice Alice Sherepa made changes -
            alice Alice Sherepa made changes -
            alice Alice Sherepa made changes -
            alice Alice Sherepa made changes -
            alice Alice Sherepa made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 109159

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.