Details
-
New Feature
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
Conditions looking like
SUBSTR(str_col, 1, n) = const_str
|
where the prefix of str_col is compared against a constant value can be rewritten as range conditions as
const_str_low <= str_col <= const_str_high
|
which will improve the optimizer capabilities of building better execution plan (for example, using indexes on column str_col and applying range optimizations).
The similar transformation is already implemented for conditions like
str_col LIKE 'foo%' |
aiming for the same goal.
Similar rewrite is implemented for DATE() and YEAR() function at MDEV-8320, and that approach may be used as a reference.
Attachments
Issue Links
- includes
-
MDEV-4414 INDEX - SUBSTRING, LEFT and others string functions that could be optimized with index
-
- Closed
-
- relates to
-
MDEV-35564 cannot convert backslash to swe7
-
- Closed
-
Activity
ycp, yes, you were going in the right direction but several more steps are needed.
Did you see the call to my_like_range() in Item_func_like::get_mm_leaf ?
For the "%" in LIKE pattern, it does this:
if (*ptr == w_many) /* '%' in SQL */ |
{
|
/* Calculate length of keys */ |
*min_length= (cs->state & (MY_CS_BINSORT | MY_CS_NOPAD)) ?
|
(size_t) (min_str - min_org) : |
res_length;
|
*max_length= res_length;
|
do |
{
|
*min_str++= 0;
|
*max_str++= (char) cs->max_sort_char; |
} while (min_str != min_end); |
return 0; |
note the 0 and cs->max_sort_char added to min and max endpoints...
Hi psergei, ptal thanks
081dd346da7 upstream/bb-11.7-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
|
Reviewer notes to self:
LIKE handles contractions correctly: mdev34911-play-with-contractions.sql
this is done with opt_range asking the collation handler to produce the endpoints:
if (field->charset()->like_range(res->ptr(), res->length(), |
escape, wild_one, wild_many,
|
field_length,
|
(char*) min_str + offset, |
(char*) max_str + offset, |
&min_length, &max_length))
|
LEFT is not handled (we probably should handle it)
The used approach doesn't extend easily to non-equalities: LEFT(x) <'foo' or SUBSTR(col, 1,10) IN (value-list) (which was requested way back in MDEV-4414 and also used in some support cases). . It's a pity we don't handle IN.
There are two implementation options:
A. Make sargable (like current patch does)
B. Rewrite into col LIKE 'string-literal%' (? is it really equivalent?)
Do we need to take into account the length of string we're comparing with?
substring(col, 1, 2) = 'more-than-two-characters' |
This is always FALSE.
But one can't just count the characters: First, mind the contractions, it could be that substring(col1, 1, 3)-> 'foo' but it is equal to 'Fooooo' in some collation.
Second, mind the endspace. substring(col, 1,2) -> 'ab' and it's equal to 'ab ' in many collations.
Anyhow, note that this is not equivalent to col LIKE 'more-than-two-chars%'.
substring(col, 1, 20)= '17-characters-str' |
- can only be true when col='17-characters-str' , right? (again, changing to LIKE won't be equivalent).
and finally
substring(col, 1, 17)= '17-characters-str' |
means the col can be anything starting with '17-characters-str'.
(Q: does it mean the condition is EQUIVALENT to col LIKE '17-characters-str%' ? )
Discussed with Bar.
Some take-aways
LIKE doesn't do collation's equality comparisons.
It does per-character comparison. Example:
MariaDB [test]> set names utf8 collate utf8_danish_ci;
|
Query OK, 0 rows affected (0.001 sec)
|
|
MariaDB [test]> SELECT 'Ã…x'='AAX', 'Ã…x' LIKE 'AAX';
|
+-------------+------------------+
|
| 'Ã…x'='AAX' | 'Ã…x' LIKE 'AAX' |
|
+-------------+------------------+
|
| 1 | 0 |
|
+-------------+------------------+
|
1 row in set (0.001 sec)
|
This does not necessarily invalidate the approach used in this MDEV
but we need to look carefully.
Contractions
The task might be simpler if we only allow it for collations that do not have contractions.
- utf8_general_ci doesn't have contractions.
- utf8_unicode_ci doesn't have contractions.
- utf8mb4_uca1400_ai_ci (the new default) does have contractions.
It seems it's still worth trying to avoid adding the "no contractions" limitation.
Escaping
What if the string in question already has the pattern symbols: _,% ?
They will need to be escaped.
To the question of LIKE being different from equality:
CREATE TABLE `t12` (
|
`a` varchar(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci DEFAULT NULL,
|
KEY `a` (`a`)
|
);
|
...
select a, a LIKE 'Ã…X', a LIKE 'AAX' from t12 order by a;
|
+------+--------------+--------------+
|
| a | a LIKE 'Ã…X' | a LIKE 'AAX' |
|
+------+--------------+--------------+
|
| A | 0 | 0 |
|
| Z | 0 | 0 |
|
| AAX | 0 | 1 |
|
| Ã…x | 1 | 0 |
|
EDIT: fixed the ranges to be for the right queries:
LIKE generates ranges for different constants:
analyze format=json select * from t12 where a LIKE 'AAX' order by a;
|
optimizer_trace shows:
|
"ranges": ["(AAX) <= (a) <= (AAX)"]
|
and
analyze format=json select * from t12 where a LIKE 'Ã…X' order by a;
|
optimizer_trace shows:
|
"ranges": ["(Ã…X) <= (a) <= (Ã…X)"]
|
|
But the lookup code uses equality operator semantics so both queries will scan rows with a=AAX and rows with a=Ã…X.
How LIKE works
key1 LIKE 'bar%'
|
needs the first 3 characters to compare as equal to chars {b, a, r}.
When {b,a,r} doesn't have an unfinished contraction, the lookup key is:
min_key= {b, a, r, repeat(min_sort_char, N) }
|
|
max_key= {b, a, r, repeat(max_sort_char, N)}
|
(LIKE-HAS-WIDER-RANGE):
this range is wider than necessary as it also covers letter combinations which are equal to "bar" but not equal to it on a character-by-character basis. (e.g. in Danish aa=Ã… but a!=Ã…).
Suppose {b,a,r} has a potential unfinished contraction starting with 'r' (*- see below). Some made-up collation could have {r,u} -> b or {r,c}->v.
then we need the range to include all possible contraction continuations. In our made-up collation the lookup range would need to include {b,a,v} as {b,a,r,c} would compare as {b,a,v}.
This is done by including all characters at the potential contraction start:
min_key= { b, a, repeat(min_sort_char) }
|
|
max_key= { b, a, repeat(max_sort_char) }
|
- for some reason assumes the unfinished contraction can only start with the last letter before the pattern. In our example, with "r", not with "a,r".
How LEFT/SUBSTRING would work
LEFT(key1, N) = 'bar'
|
This means the string of first N characters of key1 compares as equal to 'bar'.
The "first N" part is complex to check, so we'll use a weaker condition:
"key1 starts with a string that compares as equal string to 'bar'"
|
This is what LIKE is doing due to LIKE-HAS-WIDER-RANGE.
LIKE's handling of possible unfinished contractions is applicable here.
ycp ok, takeaways of the above:
- We can use like_range() call to construct endpoints.
- We need to take care of cases like SUBSTRING(col, 1, 4)= 'a%_b' and escape the pattern chars. my_like_range() should get (EDIT: jira swallowed backslashes )
my_like_range( ... 'a\%\_b%' ...)
Other review input:
Please also handle LEFT(col, N). It's the same as SUBSTRING(col1, 1, N), right?
Would it be possible to move with_sargable_substr() to be a member of
Item_func_eq instead of Item_bool_func2?
This check is not good:
!strcmp(((Item_func *) args[0])->func_name(), "substr") |
Please add Item_func_substr::functype() and check that.
In Item_bool_func::get_mm_parts(), I see this:
if (with_sargable_substr()) |
sel_arg= get_mm_leaf_for_sargable(this, param, key_part->field, key_part, type, value); |
But then inside get_mm_leaf_for_sargable() there's yet another call to with_sargable_substr():
if (item->with_sargable_substr()) |
res->append("%", 1); |
(and there's also the third call in with_sargable_substr()).
At least one call can be removed by adding an argument to get_mm_leaf_for_sargable().
The name get_mm_leaf_for_sargable() doesn't look very meaningful to me.
Do names like get_mm_leaf_for_LIKE() or get_mm_leaf_for_like_pattern() describe what it does?
Hi psergei, thanks for the comments. ptal at the revised patch:
57a70ffd077 bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
|
Addressing the review comments:
- escape wildcards: done. Also escaping backslashes, and added testcases accordingly.
- also handle LEFT: done. Also added testcases for coverage.
- move with_sargable_substr to be a member of Item_func_eq instead of Item_bool_func2: I don't think that's necessary. The divergence between our conditions and the LIKE condition happens at get_mm_leaf which is called from get_mm_parts. And there's no Item_func_eq::get_mm_parts - it goes to Item_bool_func::get_mm_parts. So if we make the check a Item_func_eq method, we'll need to check the functype in Item_bool_func::get_mm_parts as well as casting to Item_func_eq on a match. The same goes to the other invocation in Item_bool_func2::add_key_fields_optimize_op().
- add Item_func_substr::functype() and check that: done. Also did the same for Item_func_left
- remove call to with_sargable_substr() from get_mm_leaf_for_sargable(): done. Now checking that the type is not LIKE which implies taking the path of with_sargable_substr() returning true.
- rename get_mm_leaf_for_sargable: done.
Hi psergei, ptal at the following revised patch, thanks:
7c75761ebbd upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
|
Addressed review comments:
- [X] move with_sargable_substr() to be called later in Item_bool_func2_with_rev::get_mm_tree()
- [X] don't call with_sargable_substr repeatedly in the Item_bool_func::get_mm_parts() loop
- [X] (note) substr(a, 1, unknown) = 'ab' is ok, because it is a necessary condition. the results will be further filtered correctly. but add a test
- [X] views not handled, use real_item and add a test
- [X] fix stack overrun in escaping string in get_mm_leaf_for_LIKE(), and use more descriptive var names for the pointers, and add a testcase
- [X] item_bool_func::with_sargable_substr() doesn't mention LEFT in the comment
- [X] less vcalls in item_bool_func::with_sargable_substr(). use caching.
- [X] (note) Item_func::BITMAP_LIKE is ok - added a comment
ycp, thanks. I think we're getting closer to finish but there are some steps to make still.
Review input part 1
Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this
This
+# strangely, the results here differ from the results if we extract
|
+# this testing block to a standalone mtr test, in the way that ce and
|
+# cé are swapped (ce before cé here, and ce after cé if standalone),
|
+# but given the results are the same between like and substr, it is
|
+# not a concern here |
needs to be gone before the push.
I've investigated the cause.
In func_str.test, %cé is 3 characters:
+create table t2 like t1;
|
+insert into t2 values('%cé');
|
+insert into t2 values('%ce');
|
+select hex(s1) from t2;
|
+hex(s1)
|
+0000002500000063000000E9
|
+000000250000006300000065
|
while in a separate .test file, it is 4 characters:
+create table t2 like t1;
|
+insert into t2 values('%cé');
|
+insert into t2 values('%ce');
|
+select hex(s1) from t2;
|
+hex(s1)
|
+0000002500000063000000C3000000A9
|
+000000250000006300000065
|
THD::make_string_literal() converts the literals from character_set_client.
The test case sets @@collation_connection=utf32_czech_ci, but the server doesn't support the client sending incoming data in utf32, so character_set_client remains:
- latin1 in the separate test file.
- utf8 in func_str.test because a test above does SET NAMES utf8.
So, these are the charsets that the server gets the data from client.
Then, it honors the @@collation_connection=utf32_czech_ci setting and converts literals to that charset/collation.
Conversion of "cé" from latin1 produces two characters, while conversion from utf8 produces one character.
I assume this is correct. We just need to remove the comment.
Review input part 2
It looks like moving escaping into a separate function was a good idea:
commit bc257b1b9edf330348464e469cca8a763c9806c6 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911)
|
Author: Sergei Petrunia <sergey@mariadb.com>
|
Date: Thu Nov 21 21:45:12 2024 +0200
|
|
Move escaping into escape_like_characters(), add comments
|
I'll do a bit more commenting /cleanups.
lstartseva, testing can start on this branch: https://github.com/MariaDB/server/commits/bb-11.8-mdev-34911-plus-cleanup
Hi psergei,
> Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this
It's because the addition of LEFT_FUNC and SUBSTR_FUNC results in a different code path in spider group by handler query construction. Added a note to the commit message
> I assume this is correct. We just need to remove the comment.
Thanks for the explanation. That makes sense, see also https://mariadb.com/kb/en/setting-character-sets-and-collations/#examples. Removed the comment
> It looks like moving escaping into a separate function was a good idea:
> ...
> I'll do a bit more commenting /cleanups.
Thanks. This sounds like a WIP on your part, so I'll wait until you are done before incorporating your suggestions.
Updated patch:
158b9263e52 upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
|
ycp, yes. Now I've finished my cleanup suggestions here.
commit e7d43688844904dfd9027b2328693823f9fa67e9 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911)
|
Author: Sergei Petrunia <sergey@mariadb.com>
|
Date: Mon Nov 25 10:30:01 2024 +0200
|
|
Move SargableLeft code to opt_sargable_left.cc
|
|
commit ab93436b99aef0c1ddcd73d6001610b3f0bbd90a
|
please take a look.
psergei, thanks. I've made some small adjustment on top:
babbea6cc71 upstream/bb-11.8-mdev-34911-plus-cleanup Some adjustment on top of the cleanup of MDEV-34991 commits
|
Squashed all commits into one and added you as a co-author:
2bf8d31baeb upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
|
lstartseva, please use the squashed commit above for testing, thanks.
ycp, there is an interesting behavior for CHARSET=swe7 : for type varchar() and TEXT access_type is not range. But for type BLOB everything is good and access_type is range
Testcase (varchar):
create table t (c varchar(5), key (c)) DEFAULT CHARSET=swe7; |
insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null); |
explain format=json select * from t where substr(c, 1, 4) = 'b%_c'; |
drop table t; |
Result: access_type is index
EXPLAIN
|
{
|
"query_block": { |
"select_id": 1, |
"cost": 0.006758432, |
"nested_loop": [ |
{
|
"table": { |
"table_name": "t", |
"access_type": "index", |
"possible_keys": ["c"], |
"key": "c", |
"key_length": "8", |
"used_key_parts": ["c"], |
"loops": 1, |
"rows": 7, |
"cost": 0.006758432, |
"filtered": 100, |
"attached_condition": "substr(t.c,1,4) = 'b%_c'", |
"using_index": true |
}
|
}
|
]
|
}
|
}
|
Testcase (TEXT):
create table t (c TEXT, key (c)) DEFAULT CHARSET=swe7; |
insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null); |
explain format=json select * from t where substr(c, 1, 4) = 'b%_c'; |
drop table t; |
Result: access_type is ALL
EXPLAIN
|
{
|
"query_block": { |
"select_id": 1, |
"cost": 0.011130435, |
"nested_loop": [ |
{
|
"table": { |
"table_name": "t", |
"access_type": "ALL", |
"possible_keys": ["c"], |
"loops": 1, |
"rows": 7, |
"cost": 0.011130435, |
"filtered": 100, |
"attached_condition": "substr(t.c,1,4) = 'b%_c'" |
}
|
}
|
]
|
}
|
}
|
If we do at the base commit eff9c198e32a828f610b93fad3a0f0eb63b3ded2
create table t (c varchar(5), key (c)) DEFAULT CHARSET=swe7; |
insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null); |
explain format=json select * from t where substr(c, 1, 4) = 'b%_c'; |
explain format=json select * from t where c like 'b\\%_c%'; |
drop table t; |
we get
CURRENT_TEST: main.mdev_34911_like
|
mysqltest: At line 4: query 'explain format=json select * from t where c like 'b\\%_c%'' failed: ER_CANT_AGGREGATE_2COLLATIONS (1267): Illegal mix of collations (swe7_swedish_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation 'like'
|
It could be MDEV-35041, so I will need to test on the freshly merged main branch.
Testing at the current main f0961301c81c7f5b009c012c076abc326b203b4a, I'm getting the same "Illegal mix of collations" error, despite it containing the MDEV-35041 patch.
So I cannot easily compare LIKE with SUBSTR in this case.
However, if we do explain format=json select * from t where c like 'b%_c%';, then the error disappears and the output has access_type range. So something fishy is happening with the escape here
I did some investigation. It failed at escaping, more specifically when appending a backslash:
// escape_like_characters:
|
if ((ret= my_ci_wc_mb(cs, (my_wc_t) '\\', dst, dst_end)) <= 0) |
return true; /* No space - no LIKE optimize */ |
the wc to mb function resolves to my_wc_mb_8bit
int my_wc_mb_8bit(CHARSET_INFO *cs,my_wc_t wc, |
uchar *str,
|
uchar *end)
|
{
|
MY_UNI_IDX *idx;
|
|
if (str >= end) |
return MY_CS_TOOSMALL; |
|
for (idx=cs->tab_from_uni; idx->tab ; idx++) |
{
|
if (idx->from <= wc && idx->to >= wc) |
{
|
str[0]= idx->tab[wc - idx->from];
|
return (!str[0] && wc) ? MY_CS_ILUNI : 1; // returns MY_CS_ILUNI here |
}
|
}
|
return MY_CS_ILUNI; |
}
|
The cs->tab_from_uni->tab has the following arrangement:
(rr) p *cs->tab_from_uni->tab@128
|
$51 = "\000\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?\000ABCDEFGHIJKLMNOPQRSTUVWXYZ\000\000\000\000_\000abcdefghijklmnopqrstuvwxyz\000\000\000\000"
|
and unlike ascii, the 92nd element is not backslash, but a \000.
The "good" news is that the corresponding
c like 'b\\%_c%'
|
failed for the same reason, i.e. my_wc_mb_8bit failing to convert a backslash to the swe7 charset. So I think it is ok and can be regarded as intended behaviour that this case does not result in a sargable substr. psergei
In below the returned conv is NULL due to the conversion failure of the backslash char, which leads to the error reporting by my_coll_agg_error(args, nargs, fname.str, item_sep);
// Type_std_attributes::agg_item_set_converter:
|
for (i= 0, arg= args; i < nargs; i++, arg+= item_sep) |
{
|
Item* conv= (*arg)->safe_charset_converter(thd, coll.collation); // the returned conv == NULL |
if (conv == *arg) |
continue; |
|
if (!conv) |
{
|
if (nargs >=2 && nargs <= 3) |
{
|
/* restore the original arguments for better error message */ |
args[0]= safe_args[0];
|
args[item_sep]= safe_args[1];
|
}
|
if (nargs == 1 && single_err) |
{
|
/* |
Use *single_err to produce an error message mentioning two
|
collations.
|
*/
|
if (single_err->first) |
my_coll_agg_error(args[0]->collation, single_err->coll, fname.str);
|
else |
my_coll_agg_error(single_err->coll, args[0]->collation, fname.str);
|
}
|
else |
my_coll_agg_error(args, nargs, fname.str, item_sep);
|
return TRUE; |
}
|
Trace:
my_wc_mb_8bit > my_convert_using_func > my_convert > copy_and_convert > String::copy > String::copy > Item_string::Item_string > Item::const_charset_converter > Item::const_charset_converter > Item_string::safe_charset_converter > Type_std_attributes::agg_item_set_converter > Type_std_attributes::agg_arg_charsets > Type_std_attributes::agg_arg_charsets_for_comparison > Item_func_or_sum::agg_arg_charsets_for_comparison > Item_func_like::fix_length_and_dec > Item_func::fix_fields > Item_func_like::fix_fields > Item::fix_fields_if_needed > Item::fix_fields_if_needed_for_scalar > Item::fix_fields_if_needed_for_bool > setup_conds > setup_without_group > JOIN::prepare > mysql_select
Testing done.
MDEV-35564 is marked as not a bug, so this task is ok to push
Thanks for the testing - pushed e021770667851233c8cda34e9360606adfd3ea0c to main.
btw julien.fritsch, there are two unreleased 11.8.* versions in the dropdown menu of fix version/s, is that right? I picked the first one 11.8.0
A minimal testcase:
The explains of like and substr output the following, respectively:
explain select * from t where c like 'bb%';
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t range c c 19 NULL 2 Using where; Using index
explain select * from t where substr(c, 1, 2) = 'bb';
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t index NULL c 19 NULL 4 Using where; Using index
I understand that we want the second explain to output similarly to the first, if not the same. At least the type field should be range.
Comparison of the traces of like and substr (with rr) tells me the main discriminator is the behaviour of Item::get_mm_tree(), called from SQL_SELECT::test_quick_select(). Item::get_mm_tree() is the entry point of the RangeAnalysisModule as documented in opt_range.cc.
For like, it creates min_str and max_str as the bounds of the interval:
// in Item_func_like::get_mm_leaf
# 19 is the value of range->start_key.length later, but any sufficiently large number here can be used to make the point
(rr) p *min_str@19
$106 = "\000\020\000bb", '\t' <repeats 14 times>
(rr) p *max_str@19
$107 = "\000\020\000bb\364\217\277\277\364\217\277\277\364\217\277\277 "
And the two bounds can be seen here ("bb" and "bb\364", not sure why the upper bound for a char is 244 though).
I hacked a couple changes to get "range" in the type column in the explain output, so the next step is to get the same kind of min and max for the interval, as currently I only have
# at 1ab674089ce upstream/bb-11.7.mdev-34911-wip [wip] MDEV-34911
(rr) p range->start_key.length
$28 = 19
(rr) p *range->start_key.key@19
$26 = "\000\002\000bb", '\000' <repeats 13 times>
(rr) p *range->end_key.key@19
$27 = "\000\002\000bb", '\000' <repeats 13 times>
and the SELECT statement yields no result.
To be continued...