Details
-
New Feature
-
Status: In Testing (View Workflow)
-
Critical
-
Resolution: Unresolved
Description
An index on expression means something like
CREATE TABLE t1 (a int, b int, INDEX (a/2+b)); |
...
|
SELECT * FROM t1 WHERE a/2+b=100; |
in this case the optimizer should be able to use an index.
This task naturally splits in two steps:
- add expression matching into the optimizer, use it for generated columns. Like in
CREATE TABLE t1 (a int, b int, c INT GENERATED ALWAYS AS (a/2+b), INDEX (c)); - support the syntax to create an index on expression directly, this will automatically create a hidden generated column under the hood
original task description is visible in the history
Details about the patch in the Pull Request
contents
SQL Syntax changes
|
Optimizer support
|
ORDER BY is not handled
|
Optimizer support - details
|
Problems in the patch
|
Minor issues
|
Details about the patch in the https://github.com/MariaDB/server/pull/1381 :
SQL Syntax changes
It adds syntax to support creating indexes directly on expressions: One can do this:
CREATE TABLE t ( |
...
|
INDEX(col1+col2) |
);
|
or this:
alter table t1 add index ((a*2-b)); |
In current MariaDB, one has to define the column first:
CREATE TABLE t ( |
col1 int, col2 int, |
...
|
col3 int as (col1+col2) virtual, |
INDEX(col3) |
);
|
(Interesting, the datatype for the indexed column is auto-inferred?)
(Note that the patch is incomplete - e.g. SHOW CREATE TABLE doesn't work correctly)
Optimizer support
The basic approach is as follows:
- Create a clone of a WHERE/ON clause.
- In the copy, search for expressions identical to virtual column definition and replace them with Item_field referring to the virtual column.
- If we've made any such substitution(s), replace the origin WHERE/ON expression with orig_expression AND clone_with_modifications.
This rewrite is performed for all WHERE/ON clauses and for all indexed virtual columns
that are in the scope. No speedups seem to be employed, we just use nested loops:
for each indexed virtual column $VCOL in any table {
|
for each WHERE/ON clause $COND {
|
// note that we didn't even check used_tables() to see if $COND can have references to $VCOL
|
}
|
}
|
The result will typically contain clauses with duplication.
After this, the optimizer can process any sargable uses of virtual_column_name in a regular way.
ORDER BY is not handled
Note that this doesn't handle ORDER BY LIMIT.
The following will not be able to skip sorting:
ALTER TABLE t1 ADD INDEX(FUNC(t1.col)); |
SELECT * FROM t1 ORDER BY FUNC(t1.col) LIMIT 1; |
Optimizer support - details
- there is something fancy done about cloning items. Something like "We can't clone Item_subselect so let the clone have the same Item_subselect* object".
- TODO: would this work with VIEWs?
Problems in the patch
Duplication of WHERE/ON clauses looks like an overkill.
Modifying the clone() call to clone some items but not others seems like a recipe for bugs.
(The following "law" was suggested by Igor: an Item tree may not have multiple references to the same Item* object. There are some exceptions to this: one can refer to the same item through Item_direct_view_ref objects, one can refer to Items representing constant literals, etc.
These exceptions do not cover directly referring to the same Item_subselect object )
Minor issues
The tree with the patch doesn't pass test suite, it crashes in the bootstrap.
SHOW CREATE TABLE doesn't work correctly.
Alternative suggestion 1 - make optimizer recognize vcol_expr
That, is the condition in form
virtual_column_x_expr CMP const
|
should be processed in the same way
virtual_column_x CMP const
|
It can be implemented as follows:
Step 1 - find indexed virtual columns
Enable possible_keys
add_key_fields() does two things:
1. It builds potential ref access
2. It collects possible_keys for the range optimzier.
We will need to add hooks to:
- Item_bool_func2::add_key_fields_optimize_op() // for equalities
- Item_func_null_predicate::add_key_fields() // for IS NULL
- Item_func_in::add_key_fields() // for degenerate "x IN (CONST)"
- and several other items.
Look for is_local_field() call everywhere and change it accordingly.
Enable the range optimizer
for range access, hook into
- Item_bool_func::get_full_func_mm_tree_for_args()
- Item_func_in::get_mm_tree
Enable skipping sorting
for skipping sorting, hook somewhere in test_if_skip_sort_order().
*Serg's objection*: what if the optimizer rewrites the expression into something that's not identical to the vcol definition? Why don't we do a "local" rewrite of "(expr ... vcol_expr)" with "(expr ... vcol_expr) AND (expr ... vcol_column)" instead? We can mark the injected expression as redundant so it is removed at final plan fix-up stage.
Alternative suggestion 2 - "local" rewrite
The idea is to do a "local" rewrite of any eligible condition
virtual_column_expr CMP const
|
into
(virtual_column_expr CMP const) AND (virtual_column CMP const) |
- Q1: This will require item cloning which is known not be reliable
A1: disable when cloning doesn't work
Do we need to clone always or we can just replace virtual_column_expr with virtual_column in most cases?
If there is an optimization that can handle virtual_column_expr CMP const shouldn't it also be able to handle virtual_column CMP const ?
Alternative 3 - MySQL's approach: substitution
MySQL walks the WHERE and other conditions, locates the items
where substituting vcol_expr would make sense: in sargable
predicates like Item_func_gt, Item_func_lt, etc (see Item::gc_subst_analyzer)
Then, it checks if the item argument is identical to vcol_expr. The check
is done with Item::eq() call.
Note that this call in MySQL has provisions to handle VIEWs:
Item_view_ref(Item_field(t1.col)) is considered identical to t1.col.
See UnwrapArgForEq for details.
On the other hand, this will not handle equality substitution: consider
alter table t1 add INDEX((a+b))
select * from t1 where a=1 and (a+b)=10
In this example, when we reach the substitution phase, the WHERE will be (1+b)=10, the "1" being an Item_int.
Multiple ways to substitute
What if there are multiple virtual columns that share the same expression?
This can easily happen if one uses e.g. DATE(order_time) in many indexes.
MySQL would do the first substitution and ignore the rest.
(It seems one can fix this by "collapsing" identical virtual columns into one...)
"Assistance" for JSON functions - typecast removal
Functions that extract values from JSON documents return strings.
One may need to use COLLATE, CAST, or JSON_UNQUOTE to properly compare them.
MySQL code has provisions to "swallow" these functions when comparing vcol_expr with the expression used in the query.
Attachments
Issue Links
- blocks
-
MDEV-4414 INDEX - SUBSTRING, LEFT and others string functions that could be optimized with index
- Open
-
MDEV-4430 INDEX Optimization - Virtual Column, Dynamic Column and Index optimization
- Closed
-
MXS-3624 Use indexed virtual columns if such exist
- Closed
-
MXS-5346 Take indexes on expressions into use
- In Progress
- duplicates
-
MDEV-17993 WHERE F(X) = constY should used F(X) generated index and query WHERE G=constY
- Closed
- includes
-
MDEV-5631 Implement Reverse Key Index
- Closed
- is blocked by
-
MDEV-5800 indexes on virtual (not materialized) columns
- Closed