Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29395

Review input for 10.7-selectivity tree

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.10(EOL)
    • N/A
    • Optimizer
    • None

    Description

      Review comments for the 10.7-selectivity tree.

      Request to fix the trace

      commit 7b93762bb8a7ab9fbc89158db4025060531cd66e
      Author: Monty <monty@mariadb.org>
      Date:   Sat Oct 9 16:16:10 2021 +0300
       
          Update cost for hash and cached joins
      

      This piece:

      --- a/sql/opt_trace.cc
      +++ b/sql/opt_trace.cc
      @@ -677,6 +677,11 @@ void print_final_join_order(JOIN *join)
         for (j= join->join_tab,i=0 ; i < join->top_join_tab_count;
              i++, j++)
           best_order.add_table_name(j);
      +  best_order.end();
      +
      +  /* Write information about the resulting join */
      +  Json_writer_object obj(join->thd, "best_access_method");
      +  obj.add("cost", join->best_read);
       }
      

      makes the trace have this:

                {
                  "best_join_order": ["<derived2>"],
                  "best_access_method": {
                    "cost": 2.799
                  }
                },
      

      Why do we need "best_access_method" ? I think just "cost" would be ok.

      Also, rename "found_matching_rows_cost" to cond_check_cost?

      Wrong code in distinct keys patch:

      commit 5997f57a9d2d3a3bc30a38ccb815aceff1ce94dc
      Author: Monty <monty@mariadb.org>
      Date:   Wed May 4 17:26:43 2022 +0300
       
          Derived tables and union can now create distinct keys
          
          The idea is that instead of marking all select_lex's with DISTINCT, we
          only mark those that really need distinct result.
      

      has

      @@ -1677,6 +1678,13 @@ Item_in_subselect::Item_in_subselect(THD *thd, Item * left_exp,
             Item_row(thd, static_cast<Item_row*>(left_exp));
         func= &eq_creator;
         init(select_lex, new (thd->mem_root) select_exists_subselect(thd, this));
      +  select_lex->distinct= 1;
      +  /*
      +    If this is is 'xxx IN (SELECT ...) mark that the we are only interested in
      +    unique values for the select
      +  */
      +  if (select_lex->first_inner_unit())
      +    select_lex->first_inner_unit()->distinct= 1;
         max_columns= UINT_MAX;
         set_maybe_null();
         reset();
      

      touching select_lex->first_inner_unit() is incorrect!

      Consider an example:

       
      create table t10 (a int, b int, c int);
      insert into t10 select seq, seq, seq from seq_1_to_10;
       
      create table t11 as select * from t10;
      create table t12 as select * from t11;
       
      explain 
      select 
        a in (select count(*)
              from 
                t10,
                (select a,b from t11) T
              where
                T.b=t10.b
              group by
                t10.c
              )
      from
       t10;
      

      When we get to the select_lex->first_inner_unit()->distinct= 1; line, we have:

      (gdb) p select_lex->master_unit()
        $50 = (st_select_lex_unit *) 0x7fffa0019df0
      (gdb) p select_lex->first_inner_unit()
        $51 = (st_select_lex_unit *) 0x7fffa00188c0
       
      (gdb) p select_lex->first_inner_unit()->first_select()->select_number
        $53 = 3
       
      (gdb) p select_lex->master_unit()->first_select()->select_number
        $54 = 2
       
      (gdb) p select_lex->first_inner_unit()->first_select()->join_list->elem(0)->alias
        $62 = {str = 0x7fffa00181b0 "t11", length = 3}
      

      That is, we're setting distinct=1 for the "(select a,b from t11)" !

      maybe instead of

      +    select_lex->first_inner_unit()->distinct= 1;
      

      we should do

      +    select_lex->master_unit()->distinct= 1;
      

      If I remove the above line (in Item_in_subselect and Item_allany_subselect), I get test result changes in:
      main.derived_cond_pushdown main.opt_tvc main.derived_opt main.myisam_explain_non_select_all main.table_value_constr main.cte_nonrecursive.

      ... Actually Item_exists_subselect::Item_exists subselect has the variant of the code I'm suggesting:

        select_lex->distinct= 1;
        select_lex->master_unit()->distinct= 1;
      

      It was added by Monty'e earlier patch:

      commit 5997f57a9d2d3a3bc30a38ccb815aceff1ce94dc
      Author:	Monty <monty@mariadb.org>  Wed May  4 17:26:43 2022
      Committer:	Monty <monty@mariadb.org>  Sun Jul 10 15:09:22 2022
       
      Derived tables and union can now create distinct keys
      

      A testcase showing new code causes wrong results: See attached _derived_unique.test.

      A fix in convert_subq_to_sj won't work

      Looking at the changed testcases, one could imagine that it would be reasonable to try setting select_lex->distinct=1 for the derived tables inside semi-join subqueries... alas, convert_subq_to_sj is called after the derived table was created.

      (Why convert_subq_to_sj? because for mergeable semi-joins, we do know that one duplicate row of each table will be sufficient...)

      Comment#2: what about UNIONs?

      following the above, should we set distinct=1 for other UNION members?

      Odd code in ha_partition::rndpos_time

      commit 07ffb3abc1004a102c2c605c7c280913741d5d87
      Author: Monty <monty@mariadb.org>
      Date:   Thu Aug 11 13:05:23 2022 +0300
       
          TEMPORARY PUSH: Changing all cost calculation to be given in ms
      

      +IO_AND_CPU_COST ha_partition::rndpos_time(ha_rows rows)
      +    IO_AND_CPU_COST cost= m_file[i]->keyread_time(inx, ranges, rows_per_part,
      +                                                  blocks);
      +    read_time.io+= cost.io;
      +    read_time.cpu= cost.cpu;
      +  }
      

      why is io added while cpu is not?

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia added a comment - - edited

            Note to self:

                          {
                            records= (double) table->opt_range[key].rows;
                            trace_access_idx.add("used_range_estimates", "clipped down");
                          }
                          else if (unlikely(trace_access_idx.trace_started()))
                          {
                            if (table->opt_range_keys.is_set(key))
                            {
                              trace_access_idx.add("used_range_estimates", false);
                              trace_access_idx.add("reason", "not better than ref estimates");
                            }
                            else
                              trace_access_idx.add("reason", "not available");
                          }
            

            The output in the trace is misleading, it needs to be removed.

            Excessive:

                  if (unlikely(trace_access_idx.trace_started()))
                    trace_access_idx.
                      add("rows_after_filter", records_after_filter).
                      add("cost", tmp);
            

            (and also startup_cost).

            psergei Sergei Petrunia added a comment - - edited Note to self: { records= (double) table->opt_range[key].rows; trace_access_idx.add("used_range_estimates", "clipped down"); } else if (unlikely(trace_access_idx.trace_started())) { if (table->opt_range_keys.is_set(key)) { trace_access_idx.add("used_range_estimates", false); trace_access_idx.add("reason", "not better than ref estimates"); } else trace_access_idx.add("reason", "not available"); } The output in the trace is misleading, it needs to be removed. Excessive: if (unlikely(trace_access_idx.trace_started())) trace_access_idx. add("rows_after_filter", records_after_filter). add("cost", tmp); (and also startup_cost).
            psergei Sergei Petrunia added a comment - - edited

            The latest version of the commit:

            Date:   Mon Nov 1 12:34:24 2021 +0200
             
                Update row and key fetch cost models to take into account data copy costs
            

            has this part:

                - Don't use matchings_records_in_range() to try to estimate the number
                  of filtered rows for ranges. The reason is that we want to ensure
                  that 'range' is calculated similar to 'ref'. There is also more work
                  needed to calculate the selectivity when using ranges and ranges and
                  filtering.  This causes filtering column in EXPLAIN EXTENDED to be
                  100.00 for some cases where range cannot use filtering.
                  (main.rowid_filter)
            

            which includes changes like this one:

            --- main/rowid_filter.result    2022-09-11 17:43:09.346891437 +0300
            +++ main/rowid_filter.reject    2022-09-11 17:43:30.078821819 +0300
            ...
            set statement optimizer_switch='rowid_filter=on' for SELECT l_orderkey, l_linenumber, l_shipdate, l_quantity FROM lineitem
            WHERE  l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND
            l_quantity > 45;
            ...
            @@ -238,7 +238,7 @@
                       "key_length": "4",
                       "used_key_parts": ["l_shipDATE"],
                       "rows": 509,
            -          "filtered": 11.69025803,
            +          "filtered": 100,
                       "index_condition": "lineitem.l_shipDATE between '1997-01-01' and '1997-06-30'",
                       "attached_condition": "lineitem.l_quantity > 45"
                     }
            

            This clearly looks like a regression.

            ANOTHER ISSUE: the new code doesn't seem to compute cost of quick select + Using join buffer!

            psergei Sergei Petrunia added a comment - - edited The latest version of the commit: Date: Mon Nov 1 12:34:24 2021 +0200   Update row and key fetch cost models to take into account data copy costs has this part: - Don't use matchings_records_in_range() to try to estimate the number of filtered rows for ranges. The reason is that we want to ensure that 'range' is calculated similar to 'ref'. There is also more work needed to calculate the selectivity when using ranges and ranges and filtering. This causes filtering column in EXPLAIN EXTENDED to be 100.00 for some cases where range cannot use filtering. (main.rowid_filter) which includes changes like this one: --- main/rowid_filter.result 2022-09-11 17:43:09.346891437 +0300 +++ main/rowid_filter.reject 2022-09-11 17:43:30.078821819 +0300 ... set statement optimizer_switch='rowid_filter=on' for SELECT l_orderkey, l_linenumber, l_shipdate, l_quantity FROM lineitem WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND l_quantity > 45; ... @@ -238,7 +238,7 @@ "key_length": "4", "used_key_parts": ["l_shipDATE"], "rows": 509, - "filtered": 11.69025803, + "filtered": 100, "index_condition": "lineitem.l_shipDATE between '1997-01-01' and '1997-06-30'", "attached_condition": "lineitem.l_quantity > 45" } This clearly looks like a regression. ANOTHER ISSUE : the new code doesn't seem to compute cost of quick select + Using join buffer!
            monty Michael Widenius added a comment - - edited

            Update cost for hash and cached joins request

            • All requested changes done

            Derived tables and union can now create distinct keys

            • Fixed by adding a test if the sub query contains a sum function and if
              yes, disable the new optimization.

            Odd code in ha_partition::rndpos_time

            • Bug. This is already fixed in later commit that will be merged to this one.

            Comment about last commit and that 'filtered' is not correctly updated

            • This code was 'code in progress' (not finished) and pushed from my
              laptop just before leaving China. I was just working on some last
              issued on fixing that filtered would be correctly calculated
              everything considering. This is to the last part of the changeset
              after which the code should be 'complete', except fixing test cases
              and merge with 10.11
            • Have not checked the issue with select and using join buffer. It should do that, don't know why it would not do that.
              Still, the last code push is not tested at all and there has been a lot of new things added to POSITION, so anything is possible (but should be trivial to fix)
            monty Michael Widenius added a comment - - edited Update cost for hash and cached joins request All requested changes done Derived tables and union can now create distinct keys Fixed by adding a test if the sub query contains a sum function and if yes, disable the new optimization. Odd code in ha_partition::rndpos_time Bug. This is already fixed in later commit that will be merged to this one. Comment about last commit and that 'filtered' is not correctly updated This code was 'code in progress' (not finished) and pushed from my laptop just before leaving China. I was just working on some last issued on fixing that filtered would be correctly calculated everything considering. This is to the last part of the changeset after which the code should be 'complete', except fixing test cases and merge with 10.11 Have not checked the issue with select and using join buffer. It should do that, don't know why it would not do that. Still, the last code push is not tested at all and there has been a lot of new things added to POSITION, so anything is possible (but should be trivial to fix)

            All issued taken care of (except of last temporary push which is still work in progress).
            10.7-selectivity updated

            monty Michael Widenius added a comment - All issued taken care of (except of last temporary push which is still work in progress). 10.7-selectivity updated

            People

              monty Michael Widenius
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.