Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36321

Split Materialized produces wrong out_rows estimates

Details

    Description

      Consider a query using Split Materialized: (database fill steps provided below)

      explain select * 
      from 
        t2, (select max(value), grp_id from t1 group by grp_id) DT
      where
        t2.a= DT.grp_id;
      

      +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
      | id   | select_type     | table      | type | possible_keys | key    | key_len | ref        | rows | Extra       |
      +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
      |    1 | PRIMARY         | t2         | ALL  | NULL          | NULL   | NULL    | NULL       | 5    | Using where |
      |    1 | PRIMARY         | <derived2> | ref  | key0          | key0   | 5       | test2.t2.a | 10   |             |
      |    2 | LATERAL DERIVED | t1         | ref  | grp_id        | grp_id | 5       | test2.t2.a | 100  |             |
      +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
      

      Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.

      This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.

      Relevant sections of the optimizer trace:

                          "plan_prefix": "t2",
                          "get_costs_for_tables": [
                            {
                              "best_access_path": {
                                "table": "<derived2>",
                                ...
                                "choose_best_splitting": {
                                  "split_materialized": {
                                    "chosen": true
                                  }
                                  ...
                                },
                                "chosen_access_method": {
                                  "type": "ref",
                                  "rows_read": 10,
                                  "rows_out": 10,
                                  "cost": 0.00639105,
                                  "uses_join_buffering": false
                                }
      

      and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:

                        {
                          "plan_prefix": "t2",
                          "table": "<derived2>",
                          "rows_for_plan": 50,
                          "cost_for_plan": 0.01806085
                        }
      

      Can this be fixed by just setting rows=1?

      Database fill steps:

       
      create table t1 (
        grp_id int, 
        value int,
        index (grp_id)
      );
       
      insert into t1 select 
        A.seq, B.seq
      from 
        seq_1_to_10000 A, 
        seq_1_to_100 B
      ;
       
      create table t2 (a int);
      insert into t2 select seq from seq_1_to_5;
       
      analyze table t1,t2;
      

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            psergei Sergei Petrunia made changes -
            Summary Split Materialized produces wrong out_rows estiamtes Split Materialized produces wrong out_rows estimates
            psergei Sergei Petrunia made changes -
            Labels split_materialized
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.4 [ 29301 ]
            psergei Sergei Petrunia made changes -
            Assignee Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            psergei Sergei Petrunia made changes -
            Description Consider a query using Split Materialized: (database fill steps provided below)
            {code:sql}
            explain select *
            from
              t2, (select max(value), grp_id from t1 group by grp_id) DT
            where
              t2.a= DT.grp_id;
            {code}

            {code}
            +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
            | 1 | PRIMARY | t2 | ALL | NULL | NULL | NULL | NULL | 5 | Using where |
            | 1 | PRIMARY | <derived2> | ref | key0 | key0 | 5 | test2.t2.a | 10 | |
            | 2 | LATERAL DERIVED | t1 | ref | grp_id | grp_id | 5 | test2.t2.a | 100 | |
            +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
            {code}

            Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.

            This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.

            Relevant sections of the optimizer trace:
            {code:js}
                                "plan_prefix": "t2",
                                "get_costs_for_tables": [
                                  {
                                    "best_access_path": {
                                      "table": "<derived2>",
                                      ...
                                      "choose_best_splitting": {
                                        "split_materialized": {
                                          "chosen": true
                                        }
                                        ...
                                      },
                                      "chosen_access_method": {
                                        "type": "ref",
                                        "rows_read": 10,
                                        "rows_out": 10,
                                        "cost": 0.00639105,
                                        "uses_join_buffering": false
                                      }
            {code}
            and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:
            {code:js}
                              {
                                "plan_prefix": "t2",
                                "table": "<derived2>",
                                "rows_for_plan": 50,
                                "cost_for_plan": 0.01806085
                              }
            {code}

            Can this be fixed by just setting rows=1?
            Consider a query using Split Materialized: (database fill steps provided below)
            {code:sql}
            explain select *
            from
              t2, (select max(value), grp_id from t1 group by grp_id) DT
            where
              t2.a= DT.grp_id;
            {code}

            {code}
            +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
            | 1 | PRIMARY | t2 | ALL | NULL | NULL | NULL | NULL | 5 | Using where |
            | 1 | PRIMARY | <derived2> | ref | key0 | key0 | 5 | test2.t2.a | 10 | |
            | 2 | LATERAL DERIVED | t1 | ref | grp_id | grp_id | 5 | test2.t2.a | 100 | |
            +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
            {code}

            Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.

            This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.

            Relevant sections of the optimizer trace:
            {code:js}
                                "plan_prefix": "t2",
                                "get_costs_for_tables": [
                                  {
                                    "best_access_path": {
                                      "table": "<derived2>",
                                      ...
                                      "choose_best_splitting": {
                                        "split_materialized": {
                                          "chosen": true
                                        }
                                        ...
                                      },
                                      "chosen_access_method": {
                                        "type": "ref",
                                        "rows_read": 10,
                                        "rows_out": 10,
                                        "cost": 0.00639105,
                                        "uses_join_buffering": false
                                      }
            {code}
            and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:
            {code:js}
                              {
                                "plan_prefix": "t2",
                                "table": "<derived2>",
                                "rows_for_plan": 50,
                                "cost_for_plan": 0.01806085
                              }
            {code}

            Can this be fixed by just setting rows=1?

            Database fill steps:
            {code:sql}

            create table t1 (
              grp_id int,
              value int,
              index (grp_id)
            );

            insert into t1 select
              A.seq, B.seq
            from
              seq_1_to_10000 A,
              seq_1_to_100 B
            ;

            create table t2 (a int);
            insert into t2 select seq from seq_1_to_5;

            analyze table t1,t2;
            {code}

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.