Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36321

Indexes on derived tables with GROUP BY produce wrong out_rows estimates

    XMLWordPrintable

Details

    • Can result in unexpected behaviour
    • Q3/2025 Server Development

    Description

      Problem description

      Consider a query using Split Materialized (EDIT: or just derived-with-keys): (database fill steps provided below)

      explain select * 
      from 
        t2, (select max(value), grp_id from t1 group by grp_id) DT
      where
        t2.a= DT.grp_id;
      

      +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
      | id   | select_type     | table      | type | possible_keys | key    | key_len | ref        | rows | Extra       |
      +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
      |    1 | PRIMARY         | t2         | ALL  | NULL          | NULL   | NULL    | NULL       | 5    | Using where |
      |    1 | PRIMARY         | <derived2> | ref  | key0          | key0   | 5       | test2.t2.a | 10   |             |
      |    2 | LATERAL DERIVED | t1         | ref  | grp_id        | grp_id | 5       | test2.t2.a | 100  |             |
      +------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
      

      Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.

      This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.

      Relevant sections of the optimizer trace:

                          "plan_prefix": "t2",
                          "get_costs_for_tables": [
                            {
                              "best_access_path": {
                                "table": "<derived2>",
                                ...
                                "choose_best_splitting": {
                                  "split_materialized": {
                                    "chosen": true
                                  }
                                  ...
                                },
                                "chosen_access_method": {
                                  "type": "ref",
                                  "rows_read": 10,
                                  "rows_out": 10,
                                  "cost": 0.00639105,
                                  "uses_join_buffering": false
                                }
      

      and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:

                        {
                          "plan_prefix": "t2",
                          "table": "<derived2>",
                          "rows_for_plan": 50,
                          "cost_for_plan": 0.01806085
                        }
      

      Can this be fixed by just setting rows=1?

      Database fill steps:

       
      create table t1 (
        grp_id int, 
        value int,
        index (grp_id)
      );
       
      insert into t1 select 
        A.seq, B.seq
      from 
        seq_1_to_10000 A, 
        seq_1_to_100 B
      ;
       
      create table t2 (a int);
      insert into t2 select seq from seq_1_to_5;
       
      analyze table t1,t2;
      

      A related bugfix done recently: MDEV-30877.

      Solution for derived_with_keys

      Note that SELECT DISTINCT is already covered

      Note that the code in TABLE::add_tmp_key already has similar logic for DISTINCT:

          if (derived)
          {
            st_select_lex* first= derived->first_select();
            uint select_list_items= first->get_item_list()->elements;
            if (key_parts == select_list_items)
            {
              if ((!first->is_part_of_union() && (first->options & SELECT_DISTINCT)) ||
                  derived->check_distinct_in_union())
                keyinfo->rec_per_key[key_parts - 1]= 1;
            }
          }
      

      "if the index covers all elements of the select list, then specifying all keyparts gives rec_per_key=1".

      Handling GROUP BY

      An idea of what could be done here: add here a function that handles cases when

      • The derived table has one SELECT (!first->is_part_of_union())
      • The derived table has a GROUP BY clause (first->group)

      The logic to produce estimates is as follows.
      We have a SELECT with GROUP BY that is materialized into a temporary table:

      create temporary table (tmp_col1, ... tmp_colK)  as
      select 
         col1, ..., colN
         sum(...) as group_col1, ...
      group by 
         col1, ... colN;
      

      and we're considering an index on some of the temp.table columns:

      INDEX( tmp_table_col1,  ..., tmp_table_colN ...)
      

      Any index prefix that includes all GROUP BY columns will have rec_per_key=1.

      Pseudo code to implement this:

        my_group_list= {collect a list of items in the subquery's GROUP BY clause};
        for each key part  KP { 
           F= table->field(KP->field_number); 
           if (find F in my_group_list == NOT_FOUND)
              continue; // Can do nothing for this key part. But might do for further key parts
           remove F from my_group_list;
           if (my_group_list is empty)
           {
              for each keypart KP2 starting from KP 
                 set index->rec_per_key[KP2] = 1;
              break;
           }
         }
      

      Handling UNION queries

      TODO. Johnston's patch tries to do some handling but currently has bugs.

      UNION [DISTINCT] removes the duplicates. If the index covers all of select list, we can set

      index->rec_per_key[n_key_parts - 1]=1;
      

      Both UNION DISTINCT and UNION ALL let one rec_per_key from each of the select.
      The initial state is that index statistics is unknown, that is, rec_per_key[i]=0 for all key parts:

      key_part kp1 kp2 kp3 kp 4
      rec_per_key 0 0 0 0

      Suppose, the first SELECT in the UNION has GROUP BY kp1,kp2,kp3. We will get this statistics:

      key_part kp1 kp2 kp3 kp 4
      rec_per_key 0 0 1 1

      Suppose, the second SELECT in the UNION has GROUP BY kp1,kp2. We will get:

      key_part kp1 kp2 kp3 kp 4
      rec_per_key 0 1 1 1

      How to add the statistics?
      If rec_per_key[i]=0 for any of the SELECTs, that means UNKNOWN and the result should also be 9, UNKNOWN.
      If rec_per_key[i]=1 for each of the SELECTs, then we can add 1+1=2 and have that as an estimate for the resulting table.

      For or example, this gives:

      key_part kp1 kp2 kp3 kp 4
      rec_per_key 0 0 2 2

      Touching the "guesstimate" of 10 rows

      In current code, best_access_path() will use a guesstimate of 10 rows when index statistics is not available.
      Touching this goes outside of the scope of this MDEV entry.

      Solution for Split-Materialized

      Split-Materialized seems to reuse the code from derived_with_keys?

      I've used the testcase in this MDEV, set rec_per_key[key_parts-1]=1 in the debugger, and the EXPLAIN that uses Split-Materialized has shown correct rows_out.
      One probably needs to also go through the Optimizer Trace and see if all cardinalities are correct.

      Attachments

        Issue Links

          Activity

            People

              Johnston Rex Johnston
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.