Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6017

Add support for Indexes on Expressions

    XMLWordPrintable

Details

    Description

      An index on expression means something like

      CREATE TABLE t1 (a int, b int, INDEX (a/2+b));
      ...
      SELECT * FROM t1 WHERE a/2+b=100;
      

      in this case the optimizer should be able to use an index.

      This task naturally splits in two steps:

      1. add expression matching into the optimizer, use it for generated columns. Like in
        CREATE TABLE t1 (a int, b int, c INT GENERATED ALWAYS AS (a/2+b), INDEX (c));
      2. support the syntax to create an index on expression directly, this will automatically create a hidden generated column under the hood

      original task description is visible in the history

      Details about the patch in the Pull Request

      contents

      SQL Syntax changes
        Optimizer support
          ORDER BY is not handled
          Optimizer support - details
      Problems in the patch
        Minor issues
      

      Details about the patch in the https://github.com/MariaDB/server/pull/1381 :

      SQL Syntax changes

      It adds syntax to support creating indexes directly on expressions: One can do this:

      CREATE TABLE t ( 
        ...
        INDEX(col1+col2)
      );
      

      or this:

      alter table t1 add index ((a*2-b));
      

      In current MariaDB, one has to define the column first:

      CREATE TABLE t (
        col1 int, col2 int, 
        ...
        col3 int as (col1+col2) virtual, 
        INDEX(col3)
      );
      

      (Interesting, the datatype for the indexed column is auto-inferred?)
      (Note that the patch is incomplete - e.g. SHOW CREATE TABLE doesn't work correctly)

      Optimizer support

      The basic approach is as follows:

      • Create a clone of a WHERE/ON clause.
      • In the copy, search for expressions identical to virtual column definition and replace them with Item_field referring to the virtual column.
      • If we've made any such substitution(s), replace the origin WHERE/ON expression with orig_expression AND clone_with_modifications.

      This rewrite is performed for all WHERE/ON clauses and for all indexed virtual columns
      that are in the scope. No speedups seem to be employed, we just use nested loops:

      for each indexed virtual column  $VCOL in any table {
         for each WHERE/ON clause $COND {
            // note that we didn't even check used_tables() to see if $COND can have references to $VCOL
        }
      }
      

      The result will typically contain clauses with duplication.

      After this, the optimizer can process any sargable uses of virtual_column_name in a regular way.

      ORDER BY is not handled

      Note that this doesn't handle ORDER BY LIMIT.
      The following will not be able to skip sorting:

      ALTER TABLE t1 ADD INDEX(FUNC(t1.col));
      SELECT * FROM t1 ORDER BY FUNC(t1.col) LIMIT 1;
      

      Optimizer support - details

      • there is something fancy done about cloning items. Something like "We can't clone Item_subselect so let the clone have the same Item_subselect* object".
      • TODO: would this work with VIEWs?

      Problems in the patch

      Duplication of WHERE/ON clauses looks like an overkill.

      Modifying the clone() call to clone some items but not others seems like a recipe for bugs.

      (The following "law" was suggested by Igor: an Item tree may not have multiple references to the same Item* object. There are some exceptions to this: one can refer to the same item through Item_direct_view_ref objects, one can refer to Items representing constant literals, etc.
      These exceptions do not cover directly referring to the same Item_subselect object )

      Minor issues

      The tree with the patch doesn't pass test suite, it crashes in the bootstrap.
      SHOW CREATE TABLE doesn't work correctly.

      Alternative suggestion 1 - make optimizer recognize vcol_expr

      That, is the condition in form

      virtual_column_x_expr CMP const
      

      should be processed in the same way

      virtual_column_x  CMP const
      

      It can be implemented as follows:

      Step 1 - find indexed virtual columns

      Enable possible_keys

      add_key_fields() does two things:
      1. It builds potential ref access
      2. It collects possible_keys for the range optimzier.

      We will need to add hooks to:

      • Item_bool_func2::add_key_fields_optimize_op() // for equalities
      • Item_func_null_predicate::add_key_fields() // for IS NULL
      • Item_func_in::add_key_fields() // for degenerate "x IN (CONST)"
      • and several other items.

      Look for is_local_field() call everywhere and change it accordingly.

      Enable the range optimizer

      for range access, hook into

      • Item_bool_func::get_full_func_mm_tree_for_args()
      • Item_func_in::get_mm_tree

      Enable skipping sorting

      for skipping sorting, hook somewhere in test_if_skip_sort_order().

      *Serg's objection*: what if the optimizer rewrites the expression into something that's not identical to the vcol definition? Why don't we do a "local" rewrite of "(expr ... vcol_expr)" with "(expr ... vcol_expr) AND (expr ... vcol_column)" instead? We can mark the injected expression as redundant so it is removed at final plan fix-up stage.

      Alternative suggestion 2 - "local" rewrite

      The idea is to do a "local" rewrite of any eligible condition

      virtual_column_expr  CMP const
      

      into

      (virtual_column_expr  CMP const)  AND (virtual_column  CMP const)
      

      • Q1: This will require item cloning which is known not be reliable
        A1: disable when cloning doesn't work

      Do we need to clone always or we can just replace virtual_column_expr with virtual_column in most cases?
      If there is an optimization that can handle virtual_column_expr CMP const shouldn't it also be able to handle virtual_column CMP const ?

      Alternative 3 - MySQL's approach: substitution

      MySQL walks the WHERE and other conditions, locates the items
      where substituting vcol_expr would make sense: in sargable
      predicates like Item_func_gt, Item_func_lt, etc (see Item::gc_subst_analyzer)

      Then, it checks if the item argument is identical to vcol_expr. The check
      is done with Item::eq() call.

      Note that this call in MySQL has provisions to handle VIEWs:

      Item_view_ref(Item_field(t1.col)) is considered identical to t1.col.

      See UnwrapArgForEq for details.

      On the other hand, this will not handle equality substitution: consider

      alter table t1 add INDEX((a+b))
      select * from t1 where a=1 and (a+b)=10

      In this example, when we reach the substitution phase, the WHERE will be (1+b)=10, the "1" being an Item_int.

      Multiple ways to substitute

      What if there are multiple virtual columns that share the same expression?
      This can easily happen if one uses e.g. DATE(order_time) in many indexes.

      MySQL would do the first substitution and ignore the rest.

      (It seems one can fix this by "collapsing" identical virtual columns into one...)

      "Assistance" for JSON functions - typecast removal

      Functions that extract values from JSON documents return strings.
      One may need to use COLLATE, CAST, or JSON_UNQUOTE to properly compare them.

      MySQL code has provisions to "swallow" these functions when comparing vcol_expr with the expression used in the query.

      Attachments

        Issue Links

          Activity

            People

              lstartseva Lena Startseva
              fale Fabio Alessandro Locati
              Votes:
              21 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.