[MDEV-34911] Make SUBSTR(col, 1, n) = const_str sargable - Jira

Yuchen Pei added a comment - 2024-09-17 05:03 - edited

A minimal testcase:

create table t (c varchar(4), key (c));

insert into t values ('ddd'), ('bbcd'), ('bba'), ('aaa');

explain select * from t where c like 'bb%';

explain select * from t where substr(c, 1, 2) = 'bb';

select * from t where substr(c, 1, 2) = 'bb';

drop table t;

The explains of like and substr output the following, respectively:

explain select * from t where c like 'bb%';

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra

1	SIMPLE	t	range	c	c	19	NULL	2	Using where; Using index

explain select * from t where substr(c, 1, 2) = 'bb';

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra

1	SIMPLE	t	index	NULL	c	19	NULL	4	Using where; Using index

I understand that we want the second explain to output similarly to the first, if not the same. At least the type field should be range.

Comparison of the traces of like and substr (with rr) tells me the main discriminator is the behaviour of Item::get_mm_tree(), called from SQL_SELECT::test_quick_select(). Item::get_mm_tree() is the entry point of the RangeAnalysisModule as documented in opt_range.cc.

For like, it creates min_str and max_str as the bounds of the interval:

// in Item_func_like::get_mm_leaf

  SEL_ARG *tree= new (param->mem_root) SEL_ARG(field, min_str, max_str);

# 19 is the value of range->start_key.length later, but any sufficiently large number here can be used to make the point

(rr) p *min_str@19

$106 = "\000\020\000bb", '\t' <repeats 14 times>

(rr) p *max_str@19

$107 = "\000\020\000bb\364\217\277\277\364\217\277\277\364\217\277\277  "

And the two bounds can be seen here ("bb" and "bb\364", not sure why the upper bound for a char is 244 though).

I hacked a couple changes to get "range" in the type column in the explain output, so the next step is to get the same kind of min and max for the interval, as currently I only have

# at 1ab674089ce upstream/bb-11.7.mdev-34911-wip [wip] MDEV-34911

(rr) p range->start_key.length

$28 = 19

(rr) p *range->start_key.key@19

$26 = "\000\002\000bb", '\000' <repeats 13 times>

(rr) p *range->end_key.key@19

$27 = "\000\002\000bb", '\000' <repeats 13 times>

and the SELECT statement yields no result.

To be continued...

Yuchen Pei added a comment - 2024-09-17 05:03 - edited A minimal testcase: create table t (c varchar (4), key (c)); insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'aaa' ); explain select * from t where c like 'bb%' ; explain select * from t where substr(c, 1, 2) = 'bb' ; select * from t where substr(c, 1, 2) = 'bb' ; drop table t; The explains of like and substr output the following, respectively: explain select * from t where c like 'bb%'; id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE t range c c 19 NULL 2 Using where; Using index explain select * from t where substr(c, 1, 2) = 'bb'; id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE t index NULL c 19 NULL 4 Using where; Using index I understand that we want the second explain to output similarly to the first, if not the same. At least the type field should be range . Comparison of the traces of like and substr (with rr) tells me the main discriminator is the behaviour of Item::get_mm_tree() , called from SQL_SELECT::test_quick_select() . Item::get_mm_tree() is the entry point of the RangeAnalysisModule as documented in opt_range.cc . For like , it creates min_str and max_str as the bounds of the interval: // in Item_func_like::get_mm_leaf SEL_ARG *tree= new (param->mem_root) SEL_ARG(field, min_str, max_str); # 19 is the value of range->start_key.length later, but any sufficiently large number here can be used to make the point (rr) p *min_str@19 $106 = "\000\020\000bb", '\t' <repeats 14 times> (rr) p *max_str@19 $107 = "\000\020\000bb\364\217\277\277\364\217\277\277\364\217\277\277 " And the two bounds can be seen here ("bb" and "bb\364", not sure why the upper bound for a char is 244 though). I hacked a couple changes to get "range" in the type column in the explain output, so the next step is to get the same kind of min and max for the interval, as currently I only have # at 1ab674089ce upstream/bb-11.7.mdev-34911-wip [wip] MDEV-34911 (rr) p range->start_key.length $28 = 19 (rr) p *range->start_key.key@19 $26 = "\000\002\000bb", '\000' <repeats 13 times> (rr) p *range->end_key.key@19 $27 = "\000\002\000bb", '\000' <repeats 13 times> and the SELECT statement yields no result. To be continued...

Sergei Petrunia added a comment - 2024-10-24 13:23

ycp, yes, you were going in the right direction but several more steps are needed.

Did you see the call to my_like_range() in Item_func_like::get_mm_leaf ?

For the "%" in LIKE pattern, it does this:

    if (*ptr == w_many)                         /* '%' in SQL */

      /* Calculate length of keys */

      *min_length= (cs->state & (MY_CS_BINSORT | MY_CS_NOPAD)) ?

                    (size_t) (min_str - min_org) :

                    res_length;

      *max_length= res_length;

do

        *min_str++= 0;

        *max_str++= (char) cs->max_sort_char;

      } while (min_str != min_end);

      return 0;

note the 0 and cs->max_sort_char added to min and max endpoints...

Sergei Petrunia added a comment - 2024-10-24 13:23 ycp , yes, you were going in the right direction but several more steps are needed. Did you see the call to my_like_range() in Item_func_like::get_mm_leaf ? For the "%" in LIKE pattern, it does this: if (*ptr == w_many) /* '%' in SQL */ { /* Calculate length of keys */ *min_length= (cs->state & (MY_CS_BINSORT | MY_CS_NOPAD)) ? ( size_t ) (min_str - min_org) : res_length; *max_length= res_length; do { *min_str++= 0; *max_str++= ( char ) cs->max_sort_char; } while (min_str != min_end); return 0; note the 0 and cs->max_sort_char added to min and max endpoints...

Yuchen Pei added a comment - 2024-10-29 06:29 - edited

Hi psergei, ptal thanks

081dd346da7 upstream/bb-11.7-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

Yuchen Pei added a comment - 2024-10-29 06:29 - edited Hi psergei , ptal thanks 081dd346da7 upstream/bb-11.7-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

Sergei Petrunia added a comment - 2024-10-29 12:33 - edited

Reviewer notes to self:
LIKE handles contractions correctly: mdev34911-play-with-contractions.sql
this is done with opt_range asking the collation handler to produce the endpoints:

   if (field->charset()->like_range(res->ptr(), res->length(),

                                    escape, wild_one, wild_many,

                                    field_length,

                                    (char*) min_str + offset,

                                    (char*) max_str + offset,

                                    &min_length, &max_length))

LEFT is not handled (we probably should handle it)

The used approach doesn't extend easily to non-equalities: LEFT(x) <'foo' or SUBSTR(col, 1,10) IN (value-list) (which was requested way back in ~~MDEV-4414~~ and also used in some support cases). . It's a pity we don't handle IN.

There are two implementation options:

A. Make sargable (like current patch does)
B. Rewrite into col LIKE 'string-literal%' (? is it really equivalent?)

Do we need to take into account the length of string we're comparing with?

substring(col, 1, 2) = 'more-than-two-characters'

This is always FALSE.
But one can't just count the characters: First, mind the contractions, it could be that substring(col1, 1, 3)-> 'foo' but it is equal to 'Fooooo' in some collation.

Second, mind the endspace. substring(col, 1,2) -> 'ab' and it's equal to 'ab ' in many collations.

Anyhow, note that this is not equivalent to col LIKE 'more-than-two-chars%'.

substring(col, 1, 20)= '17-characters-str'

can only be true when col='17-characters-str' , right? (again, changing to LIKE won't be equivalent).

and finally

substring(col, 1, 17)= '17-characters-str'

means the col can be anything starting with '17-characters-str'.
(Q: does it mean the condition is EQUIVALENT to col LIKE '17-characters-str%' ? )

Sergei Petrunia added a comment - 2024-10-29 12:33 - edited Reviewer notes to self: LIKE handles contractions correctly: mdev34911-play-with-contractions.sql this is done with opt_range asking the collation handler to produce the endpoints: if (field->charset()->like_range(res->ptr(), res->length(), escape, wild_one, wild_many, field_length, ( char *) min_str + offset, ( char *) max_str + offset, &min_length, &max_length)) LEFT is not handled (we probably should handle it) The used approach doesn't extend easily to non-equalities: LEFT(x) <'foo' or SUBSTR(col, 1,10) IN (value-list) (which was requested way back in MDEV-4414 and also used in some support cases). . It's a pity we don't handle IN. There are two implementation options: A. Make sargable (like current patch does) B. Rewrite into col LIKE 'string-literal%' (? is it really equivalent?) Do we need to take into account the length of string we're comparing with? substring (col, 1, 2) = 'more-than-two-characters' This is always FALSE. But one can't just count the characters: First, mind the contractions, it could be that substring(col1, 1, 3)-> 'foo' but it is equal to 'Fooooo' in some collation. Second, mind the endspace. substring(col, 1,2) -> 'ab' and it's equal to 'ab ' in many collations. Anyhow, note that this is not equivalent to col LIKE 'more-than-two-chars%' . substring (col, 1, 20)= '17-characters-str' can only be true when col='17-characters-str' , right? (again, changing to LIKE won't be equivalent). and finally substring (col, 1, 17)= '17-characters-str' means the col can be anything starting with '17-characters-str' . (Q: does it mean the condition is EQUIVALENT to col LIKE '17-characters-str%' ? )

Sergei Petrunia added a comment - 2024-11-04 08:41

Discussed with Bar.

Some take-aways

LIKE doesn't do collation's equality comparisons.

It does per-character comparison. Example:

MariaDB [test]> set names utf8 collate utf8_danish_ci;

Query OK, 0 rows affected (0.001 sec)

MariaDB [test]> SELECT 'Åx'='AAX', 'Åx' LIKE 'AAX';

+-------------+------------------+

| 'Åx'='AAX'  | 'Åx' LIKE 'AAX'  |

+-------------+------------------+

|           1 |                0 |

+-------------+------------------+

1 row in set (0.001 sec)

This does not necessarily invalidate the approach used in this MDEV
but we need to look carefully.

Contractions

The task might be simpler if we only allow it for collations that do not have contractions.

utf8_general_ci doesn't have contractions.
utf8_unicode_ci doesn't have contractions.
utf8mb4_uca1400_ai_ci (the new default) does have contractions.

It seems it's still worth trying to avoid adding the "no contractions" limitation.

Escaping

What if the string in question already has the pattern symbols: _,% ?
They will need to be escaped.

Sergei Petrunia added a comment - 2024-11-04 08:41 Discussed with Bar. Some take-aways LIKE doesn't do collation's equality comparisons. It does per-character comparison. Example: MariaDB [test]> set names utf8 collate utf8_danish_ci; Query OK, 0 rows affected (0.001 sec) MariaDB [test]> SELECT 'Åx'='AAX', 'Åx' LIKE 'AAX'; +-------------+------------------+ | 'Åx'='AAX' | 'Åx' LIKE 'AAX' | +-------------+------------------+ | 1 | 0 | +-------------+------------------+ 1 row in set (0.001 sec) This does not necessarily invalidate the approach used in this MDEV but we need to look carefully. Contractions The task might be simpler if we only allow it for collations that do not have contractions. utf8_general_ci doesn't have contractions. utf8_unicode_ci doesn't have contractions. utf8mb4_uca1400_ai_ci (the new default) does have contractions. It seems it's still worth trying to avoid adding the "no contractions" limitation. Escaping What if the string in question already has the pattern symbols: _,% ? They will need to be escaped.

Sergei Petrunia added a comment - 2024-11-04 11:19 - edited

To the question of LIKE being different from equality:

CREATE TABLE `t12` (

  `a` varchar(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci DEFAULT NULL,

  KEY `a` (`a`)

);

...

select a, a LIKE 'ÅX', a LIKE 'AAX' from t12 order by a;

+------+--------------+--------------+

| a    | a LIKE 'ÅX'  | a LIKE 'AAX' |

+------+--------------+--------------+

| A    |            0 |            0 |

| Z    |            0 |            0 |

| AAX  |            0 |            1 |

| Åx   |            1 |            0 |

EDIT: fixed the ranges to be for the right queries:

LIKE generates ranges for different constants:

analyze format=json select * from t12 where  a LIKE 'AAX' order by a;

optimizer_trace shows:

                      "ranges": ["(AAX) <= (a) <= (AAX)"]

and

analyze format=json select * from t12 where  a LIKE 'ÅX' order by a;

optimizer_trace shows:

                      "ranges": ["(ÅX) <= (a) <= (ÅX)"]

But the lookup code uses equality operator semantics so both queries will scan rows with a=AAX and rows with a=ÅX.

Sergei Petrunia added a comment - 2024-11-04 11:19 - edited To the question of LIKE being different from equality: CREATE TABLE `t12` ( `a` varchar(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci DEFAULT NULL, KEY `a` (`a`) ); ... select a, a LIKE 'ÅX', a LIKE 'AAX' from t12 order by a; +------+--------------+--------------+ | a | a LIKE 'ÅX' | a LIKE 'AAX' | +------+--------------+--------------+ | A | 0 | 0 | | Z | 0 | 0 | | AAX | 0 | 1 | | Åx | 1 | 0 | EDIT: fixed the ranges to be for the right queries : LIKE generates ranges for different constants: analyze format=json select * from t12 where a LIKE 'AAX' order by a; optimizer_trace shows: "ranges": ["(AAX) <= (a) <= (AAX)"] and analyze format=json select * from t12 where a LIKE 'ÅX' order by a; optimizer_trace shows: "ranges": ["(ÅX) <= (a) <= (ÅX)"] But the lookup code uses equality operator semantics so both queries will scan rows with a=AAX and rows with a=ÅX.

Sergei Petrunia added a comment - 2024-11-04 12:43 - edited

How LIKE works

 key1 LIKE 'bar%'

needs the first 3 characters to compare as equal to chars {b, a, r}.
When {b,a,r} doesn't have an unfinished contraction, the lookup key is:

min_key= {b, a, r, repeat(min_sort_char, N) }

max_key= {b, a, r, repeat(max_sort_char, N)}

(LIKE-HAS-WIDER-RANGE):
this range is wider than necessary as it also covers letter combinations which are equal to "bar" but not equal to it on a character-by-character basis. (e.g. in Danish aa=Å but a!=Å).

Suppose {b,a,r} has a potential unfinished contraction starting with 'r' (*- see below). Some made-up collation could have {r,u} -> b or {r,c}->v.

then we need the range to include all possible contraction continuations. In our made-up collation the lookup range would need to include {b,a,v} as {b,a,r,c} would compare as {b,a,v}.

This is done by including all characters at the potential contraction start:

  min_key= { b, a, repeat(min_sort_char) }

  max_key= { b, a, repeat(max_sort_char) }

- for some reason assumes the unfinished contraction can only start with the last letter before the pattern. In our example, with "r", not with "a,r".

How LEFT/SUBSTRING would work

  LEFT(key1, N) = 'bar'

This means the string of first N characters of key1 compares as equal to 'bar'.

The "first N" part is complex to check, so we'll use a weaker condition:

"key1 starts with a string that compares as equal string to 'bar'"

This is what LIKE is doing due to LIKE-HAS-WIDER-RANGE.
LIKE's handling of possible unfinished contractions is applicable here.

Sergei Petrunia added a comment - 2024-11-04 12:43 - edited How LIKE works key1 LIKE 'bar%' needs the first 3 characters to compare as equal to chars {b, a, r}. When {b,a,r} doesn't have an unfinished contraction, the lookup key is: min_key= {b, a, r, repeat(min_sort_char, N) } max_key= {b, a, r, repeat(max_sort_char, N)} (LIKE-HAS-WIDER-RANGE): this range is wider than necessary as it also covers letter combinations which are equal to "bar" but not equal to it on a character-by-character basis. (e.g. in Danish aa=Å but a!=Å). Suppose {b,a,r} has a potential unfinished contraction starting with 'r' (*- see below). Some made-up collation could have {r,u} -> b or {r,c}->v. then we need the range to include all possible contraction continuations. In our made-up collation the lookup range would need to include {b,a,v} as {b,a,r,c} would compare as {b,a,v}. This is done by including all characters at the potential contraction start: min_key= { b, a, repeat(min_sort_char) } max_key= { b, a, repeat(max_sort_char) } - for some reason assumes the unfinished contraction can only start with the last letter before the pattern. In our example, with "r", not with "a,r". How LEFT/SUBSTRING would work LEFT(key1, N) = 'bar' This means the string of first N characters of key1 compares as equal to 'bar'. The "first N" part is complex to check, so we'll use a weaker condition: "key1 starts with a string that compares as equal string to 'bar'" This is what LIKE is doing due to LIKE-HAS-WIDER-RANGE. LIKE's handling of possible unfinished contractions is applicable here.

Sergei Petrunia added a comment - 2024-11-05 08:03 - edited

ycp ok, takeaways of the above:

We can use like_range() call to construct endpoints.
We need to take care of cases like SUBSTRING(col, 1, 4)= 'a%_b' and escape the pattern chars. my_like_range() should get (EDIT: jira swallowed backslashes )
my_like_range( ... 'a\%\_b%' ...)

Other review input:
Please also handle LEFT(col, N). It's the same as SUBSTRING(col1, 1, N), right?

Would it be possible to move with_sargable_substr() to be a member of
Item_func_eq instead of Item_bool_func2?

This check is not good:

 !strcmp(((Item_func *) args[0])->func_name(), "substr")

Please add Item_func_substr::functype() and check that.

In Item_bool_func::get_mm_parts(), I see this:

        if (with_sargable_substr())

          sel_arg= get_mm_leaf_for_sargable(this, param, key_part->field, key_part, type, value);

But then inside get_mm_leaf_for_sargable() there's yet another call to with_sargable_substr():

  if (item->with_sargable_substr())

    res->append("%", 1);

(and there's also the third call in with_sargable_substr()).

At least one call can be removed by adding an argument to get_mm_leaf_for_sargable().

The name get_mm_leaf_for_sargable() doesn't look very meaningful to me.
Do names like get_mm_leaf_for_LIKE() or get_mm_leaf_for_like_pattern() describe what it does?

Sergei Petrunia added a comment - 2024-11-05 08:03 - edited ycp ok, takeaways of the above: We can use like_range() call to construct endpoints. We need to take care of cases like SUBSTRING(col, 1, 4)= 'a%_b' and escape the pattern chars. my_like_range() should get ( EDIT : jira swallowed backslashes ) my_like_range( ... 'a\%\_b%' ...) Other review input: Please also handle LEFT(col, N). It's the same as SUBSTRING(col1, 1, N), right? Would it be possible to move with_sargable_substr() to be a member of Item_func_eq instead of Item_bool_func2? This check is not good: ! strcmp (((Item_func *) args[0])->func_name(), "substr" ) Please add Item_func_substr::functype() and check that. In Item_bool_func::get_mm_parts(), I see this: if (with_sargable_substr()) sel_arg= get_mm_leaf_for_sargable( this , param, key_part->field, key_part, type, value); But then inside get_mm_leaf_for_sargable() there's yet another call to with_sargable_substr(): if (item->with_sargable_substr()) res->append( "%" , 1); (and there's also the third call in with_sargable_substr()). At least one call can be removed by adding an argument to get_mm_leaf_for_sargable(). The name get_mm_leaf_for_sargable() doesn't look very meaningful to me. Do names like get_mm_leaf_for_LIKE() or get_mm_leaf_for_like_pattern() describe what it does?

Yuchen Pei added a comment - 2024-11-18 06:18 - edited

Hi psergei, thanks for the comments. ptal at the revised patch:

57a70ffd077 bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

Addressing the review comments:

escape wildcards: done. Also escaping backslashes, and added testcases accordingly.
also handle LEFT: done. Also added testcases for coverage.
move with_sargable_substr to be a member of Item_func_eq instead of Item_bool_func2: I don't think that's necessary. The divergence between our conditions and the LIKE condition happens at get_mm_leaf which is called from get_mm_parts. And there's no Item_func_eq::get_mm_parts - it goes to Item_bool_func::get_mm_parts. So if we make the check a Item_func_eq method, we'll need to check the functype in Item_bool_func::get_mm_parts as well as casting to Item_func_eq on a match. The same goes to the other invocation in Item_bool_func2::add_key_fields_optimize_op().
add Item_func_substr::functype() and check that: done. Also did the same for Item_func_left
remove call to with_sargable_substr() from get_mm_leaf_for_sargable(): done. Now checking that the type is not LIKE which implies taking the path of with_sargable_substr() returning true.
rename get_mm_leaf_for_sargable: done.

Yuchen Pei added a comment - 2024-11-18 06:18 - edited Hi psergei , thanks for the comments. ptal at the revised patch: 57a70ffd077 bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str Addressing the review comments: escape wildcards: done. Also escaping backslashes, and added testcases accordingly. also handle LEFT: done. Also added testcases for coverage. move with_sargable_substr to be a member of Item_func_eq instead of Item_bool_func2: I don't think that's necessary. The divergence between our conditions and the LIKE condition happens at get_mm_leaf which is called from get_mm_parts. And there's no Item_func_eq::get_mm_parts - it goes to Item_bool_func::get_mm_parts. So if we make the check a Item_func_eq method, we'll need to check the functype in Item_bool_func::get_mm_parts as well as casting to Item_func_eq on a match. The same goes to the other invocation in Item_bool_func2::add_key_fields_optimize_op(). add Item_func_substr::functype() and check that: done. Also did the same for Item_func_left remove call to with_sargable_substr() from get_mm_leaf_for_sargable(): done. Now checking that the type is not LIKE which implies taking the path of with_sargable_substr() returning true. rename get_mm_leaf_for_sargable: done.

Sergei Petrunia added a comment - 2024-11-19 17:06

Review input provided on Slack today

Sergei Petrunia added a comment - 2024-11-19 17:06 Review input provided on Slack today

Yuchen Pei added a comment - 2024-11-20 04:44

Hi psergei, ptal at the following revised patch, thanks:

7c75761ebbd upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

Addressed review comments:

[X] move with_sargable_substr() to be called later in Item_bool_func2_with_rev::get_mm_tree()
[X] don't call with_sargable_substr repeatedly in the Item_bool_func::get_mm_parts() loop
[X] (note) substr(a, 1, unknown) = 'ab' is ok, because it is a necessary condition. the results will be further filtered correctly. but add a test
[X] views not handled, use real_item and add a test
[X] fix stack overrun in escaping string in get_mm_leaf_for_LIKE(), and use more descriptive var names for the pointers, and add a testcase
[X] item_bool_func::with_sargable_substr() doesn't mention LEFT in the comment
[X] less vcalls in item_bool_func::with_sargable_substr(). use caching.
[X] (note) Item_func::BITMAP_LIKE is ok - added a comment

Yuchen Pei added a comment - 2024-11-20 04:44 Hi psergei , ptal at the following revised patch, thanks: 7c75761ebbd upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str Addressed review comments: [X] move with_sargable_substr() to be called later in Item_bool_func2_with_rev::get_mm_tree() [X] don't call with_sargable_substr repeatedly in the Item_bool_func::get_mm_parts() loop [X] (note) substr(a, 1, unknown) = 'ab' is ok, because it is a necessary condition. the results will be further filtered correctly. but add a test [X] views not handled, use real_item and add a test [X] fix stack overrun in escaping string in get_mm_leaf_for_LIKE(), and use more descriptive var names for the pointers, and add a testcase [X] item_bool_func::with_sargable_substr() doesn't mention LEFT in the comment [X] less vcalls in item_bool_func::with_sargable_substr(). use caching. [X] (note) Item_func::BITMAP_LIKE is ok - added a comment

Sergei Petrunia added a comment - 2024-11-22 09:34 - edited

ycp, thanks. I think we're getting closer to finish but there are some steps to make still.

Review input part 1

Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this

This

+# strangely, the results here differ from the results if we extract

+# this testing block to a standalone mtr test, in the way that ce and

+# cé are swapped (ce before cé here, and ce after cé if standalone),

+# but given the results are the same between like and substr, it is

+# not a concern here

needs to be gone before the push.
I've investigated the cause.
In func_str.test, %cé is 3 characters:

+create table t2 like t1;

+insert into t2 values('%cé');

+insert into t2 values('%ce');

+select hex(s1) from t2;

+hex(s1)

+0000002500000063000000E9

+000000250000006300000065

while in a separate .test file, it is 4 characters:

+create table t2 like t1;

+insert into t2 values('%cé');

+insert into t2 values('%ce');

+select hex(s1) from t2;

+hex(s1)

+0000002500000063000000C3000000A9

+000000250000006300000065

THD::make_string_literal() converts the literals from character_set_client.
The test case sets @@collation_connection=utf32_czech_ci, but the server doesn't support the client sending incoming data in utf32, so character_set_client remains:

latin1 in the separate test file.
utf8 in func_str.test because a test above does SET NAMES utf8.

So, these are the charsets that the server gets the data from client.
Then, it honors the @@collation_connection=utf32_czech_ci setting and converts literals to that charset/collation.

Conversion of "cé" from latin1 produces two characters, while conversion from utf8 produces one character.
I assume this is correct. We just need to remove the comment.

Sergei Petrunia added a comment - 2024-11-22 09:34 - edited ycp , thanks. I think we're getting closer to finish but there are some steps to make still. Review input part 1 Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this This +# strangely, the results here differ from the results if we extract +# this testing block to a standalone mtr test, in the way that ce and +# cé are swapped (ce before cé here, and ce after cé if standalone), +# but given the results are the same between like and substr, it is +# not a concern here needs to be gone before the push. I've investigated the cause. In func_str.test, %cé is 3 characters: +create table t2 like t1; +insert into t2 values('%cé'); +insert into t2 values('%ce'); +select hex(s1) from t2; +hex(s1) +0000002500000063000000E9 +000000250000006300000065 while in a separate .test file, it is 4 characters: +create table t2 like t1; +insert into t2 values('%cé'); +insert into t2 values('%ce'); +select hex(s1) from t2; +hex(s1) +0000002500000063000000C3000000A9 +000000250000006300000065 THD::make_string_literal() converts the literals from character_set_client . The test case sets @@collation_connection=utf32_czech_ci, but the server doesn't support the client sending incoming data in utf32, so character_set_client remains: latin1 in the separate test file. utf8 in func_str.test because a test above does SET NAMES utf8. So, these are the charsets that the server gets the data from client. Then, it honors the @@collation_connection=utf32_czech_ci setting and converts literals to that charset/collation. Conversion of "cé" from latin1 produces two characters, while conversion from utf8 produces one character. I assume this is correct. We just need to remove the comment.

Sergei Petrunia added a comment - 2024-11-22 10:31

Review input part 2

It looks like moving escaping into a separate function was a good idea:

commit bc257b1b9edf330348464e469cca8a763c9806c6 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911)

Author: Sergei Petrunia <sergey@mariadb.com>

Date:   Thu Nov 21 21:45:12 2024 +0200

    Move escaping into escape_like_characters(), add comments

I'll do a bit more commenting /cleanups.

Sergei Petrunia added a comment - 2024-11-22 10:31 Review input part 2 It looks like moving escaping into a separate function was a good idea: commit bc257b1b9edf330348464e469cca8a763c9806c6 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911) Author: Sergei Petrunia <sergey@mariadb.com> Date: Thu Nov 21 21:45:12 2024 +0200 Move escaping into escape_like_characters(), add comments I'll do a bit more commenting /cleanups.

Sergei Petrunia added a comment - 2024-11-22 10:38

lstartseva, testing can start on this branch: https://github.com/MariaDB/server/commits/bb-11.8-mdev-34911-plus-cleanup

Sergei Petrunia added a comment - 2024-11-22 10:38 lstartseva , testing can start on this branch: https://github.com/MariaDB/server/commits/bb-11.8-mdev-34911-plus-cleanup

Yuchen Pei added a comment - 2024-11-25 00:27

Hi psergei,

> Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this

It's because the addition of LEFT_FUNC and SUBSTR_FUNC results in a different code path in spider group by handler query construction. Added a note to the commit message

> I assume this is correct. We just need to remove the comment.

Thanks for the explanation. That makes sense, see also https://mariadb.com/kb/en/setting-character-sets-and-collations/#examples. Removed the comment

> It looks like moving escaping into a separate function was a good idea:
> ...
> I'll do a bit more commenting /cleanups.

Thanks. This sounds like a WIP on your part, so I'll wait until you are done before incorporating your suggestions.

Updated patch:

158b9263e52 upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

Yuchen Pei added a comment - 2024-11-25 00:27 Hi psergei , > Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this It's because the addition of LEFT_FUNC and SUBSTR_FUNC results in a different code path in spider group by handler query construction. Added a note to the commit message > I assume this is correct. We just need to remove the comment. Thanks for the explanation. That makes sense, see also https://mariadb.com/kb/en/setting-character-sets-and-collations/#examples . Removed the comment > It looks like moving escaping into a separate function was a good idea: > ... > I'll do a bit more commenting /cleanups. Thanks. This sounds like a WIP on your part, so I'll wait until you are done before incorporating your suggestions. Updated patch: 158b9263e52 upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

Sergei Petrunia added a comment - 2024-11-25 11:48

ycp, yes. Now I've finished my cleanup suggestions here.

commit e7d43688844904dfd9027b2328693823f9fa67e9 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911)

Author: Sergei Petrunia <sergey@mariadb.com>

Date:   Mon Nov 25 10:30:01 2024 +0200

    Move SargableLeft code to opt_sargable_left.cc

commit ab93436b99aef0c1ddcd73d6001610b3f0bbd90a

please take a look.

Sergei Petrunia added a comment - 2024-11-25 11:48 ycp , yes. Now I've finished my cleanup suggestions here. commit e7d43688844904dfd9027b2328693823f9fa67e9 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911) Author: Sergei Petrunia <sergey@mariadb.com> Date: Mon Nov 25 10:30:01 2024 +0200 Move SargableLeft code to opt_sargable_left.cc commit ab93436b99aef0c1ddcd73d6001610b3f0bbd90a please take a look.

Yuchen Pei added a comment - 2024-11-26 00:32

psergei, thanks. I've made some small adjustment on top:

babbea6cc71 upstream/bb-11.8-mdev-34911-plus-cleanup Some adjustment on top of the cleanup of MDEV-34991 commits

Squashed all commits into one and added you as a co-author:

2bf8d31baeb upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

lstartseva, please use the squashed commit above for testing, thanks.

Yuchen Pei added a comment - 2024-11-26 00:32 psergei , thanks. I've made some small adjustment on top: babbea6cc71 upstream/bb-11.8-mdev-34911-plus-cleanup Some adjustment on top of the cleanup of MDEV-34991 commits Squashed all commits into one and added you as a co-author: 2bf8d31baeb upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str lstartseva , please use the squashed commit above for testing, thanks.

Lena Startseva added a comment - 2024-12-02 09:13 - edited

ycp, there is an interesting behavior for CHARSET=swe7 : for type varchar() and TEXT access_type is not range. But for type BLOB everything is good and access_type is range

Testcase (varchar):

create table t (c varchar(5), key (c)) DEFAULT CHARSET=swe7;

insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null);

explain format=json select * from t where  substr(c, 1, 4) = 'b%_c';

drop table t;

Result: access_type is index

EXPLAIN

  "query_block": {

    "select_id": 1,

    "cost": 0.006758432,

    "nested_loop": [

        "table": {

          "table_name": "t",

          "access_type": "index",

          "possible_keys": ["c"],

          "key": "c",

          "key_length": "8",

          "used_key_parts": ["c"],

          "loops": 1,

          "rows": 7,

          "cost": 0.006758432,

          "filtered": 100,

          "attached_condition": "substr(t.c,1,4) = 'b%_c'",

          "using_index": true

Testcase (TEXT):

create table t (c TEXT, key (c)) DEFAULT CHARSET=swe7;

insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null);

explain format=json select * from t where  substr(c, 1, 4) = 'b%_c';

drop table t;

Result: access_type is ALL

EXPLAIN

  "query_block": {

    "select_id": 1,

    "cost": 0.011130435,

    "nested_loop": [

        "table": {

          "table_name": "t",

          "access_type": "ALL",

          "possible_keys": ["c"],

          "loops": 1,

          "rows": 7,

          "cost": 0.011130435,

          "filtered": 100,

          "attached_condition": "substr(t.c,1,4) = 'b%_c'"

Lena Startseva added a comment - 2024-12-02 09:13 - edited ycp , there is an interesting behavior for CHARSET=swe7 : for type varchar() and TEXT access_type is not range. But for type BLOB everything is good and access_type is range Testcase (varchar): create table t (c varchar (5), key (c)) DEFAULT CHARSET=swe7; insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'b%_cd' ), ( 'aaa' ), ( 'b' ), ( null ); explain format=json select * from t where substr(c, 1, 4) = 'b%_c' ; drop table t; Result: access_type is index EXPLAIN { "query_block" : { "select_id" : 1, "cost" : 0.006758432, "nested_loop" : [ { "table" : { "table_name" : "t" , "access_type" : "index" , "possible_keys" : [ "c" ], "key" : "c" , "key_length" : "8" , "used_key_parts" : [ "c" ], "loops" : 1, "rows" : 7, "cost" : 0.006758432, "filtered" : 100, "attached_condition" : "substr(t.c,1,4) = 'b%_c'" , "using_index" : true } } ] } } Testcase (TEXT): create table t (c TEXT, key (c)) DEFAULT CHARSET=swe7; insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'b%_cd' ), ( 'aaa' ), ( 'b' ), ( null ); explain format=json select * from t where substr(c, 1, 4) = 'b%_c' ; drop table t; Result: access_type is ALL EXPLAIN { "query_block" : { "select_id" : 1, "cost" : 0.011130435, "nested_loop" : [ { "table" : { "table_name" : "t" , "access_type" : "ALL" , "possible_keys" : [ "c" ], "loops" : 1, "rows" : 7, "cost" : 0.011130435, "filtered" : 100, "attached_condition" : "substr(t.c,1,4) = 'b%_c'" } } ] } }

Yuchen Pei added a comment - 2024-12-03 00:45

If we do at the base commit eff9c198e32a828f610b93fad3a0f0eb63b3ded2

create table t (c varchar(5), key (c)) DEFAULT CHARSET=swe7;

insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null);

explain format=json select * from t where  substr(c, 1, 4) = 'b%_c';

explain format=json select * from t where  c like 'b\\%_c%';

drop table t;

we get

CURRENT_TEST: main.mdev_34911_like

mysqltest: At line 4: query 'explain format=json select * from t where  c like 'b\\%_c%'' failed: ER_CANT_AGGREGATE_2COLLATIONS (1267): Illegal mix of collations (swe7_swedish_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation 'like'

It could be ~~MDEV-35041~~, so I will need to test on the freshly merged main branch.

Yuchen Pei added a comment - 2024-12-03 00:45 If we do at the base commit eff9c198e32a828f610b93fad3a0f0eb63b3ded2 create table t (c varchar (5), key (c)) DEFAULT CHARSET=swe7; insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'b%_cd' ), ( 'aaa' ), ( 'b' ), ( null ); explain format=json select * from t where substr(c, 1, 4) = 'b%_c' ; explain format=json select * from t where c like 'b\\%_c%' ; drop table t; we get CURRENT_TEST: main.mdev_34911_like mysqltest: At line 4: query 'explain format=json select * from t where c like 'b\\%_c%'' failed: ER_CANT_AGGREGATE_2COLLATIONS (1267): Illegal mix of collations (swe7_swedish_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation 'like' It could be MDEV-35041 , so I will need to test on the freshly merged main branch.

Yuchen Pei added a comment - 2024-12-03 02:28 - edited

Testing at the current main f0961301c81c7f5b009c012c076abc326b203b4a, I'm getting the same "Illegal mix of collations" error, despite it containing the ~~MDEV-35041~~ patch.

So I cannot easily compare LIKE with SUBSTR in this case.

However, if we do explain format=json select * from t where c like 'b%_c%';, then the error disappears and the output has access_type range. So something fishy is happening with the escape here

Yuchen Pei added a comment - 2024-12-03 02:28 - edited Testing at the current main f0961301c81c7f5b009c012c076abc326b203b4a, I'm getting the same "Illegal mix of collations" error, despite it containing the MDEV-35041 patch. So I cannot easily compare LIKE with SUBSTR in this case. However, if we do explain format=json select * from t where c like 'b%_c%'; , then the error disappears and the output has access_type range. So something fishy is happening with the escape here

Yuchen Pei added a comment - 2024-12-03 05:33 - edited

I did some investigation. It failed at escaping, more specifically when appending a backslash:

// escape_like_characters:

      if ((ret= my_ci_wc_mb(cs, (my_wc_t) '\\', dst, dst_end)) <= 0)

        return true; /* No space - no LIKE optimize */

the wc to mb function resolves to my_wc_mb_8bit

int my_wc_mb_8bit(CHARSET_INFO *cs,my_wc_t wc,

		  uchar *str,

		  uchar *end)

  MY_UNI_IDX *idx;

  if (str >= end)

    return MY_CS_TOOSMALL;

  for (idx=cs->tab_from_uni; idx->tab ; idx++)

    if (idx->from <= wc && idx->to >= wc)

      str[0]= idx->tab[wc - idx->from];

      return (!str[0] && wc) ? MY_CS_ILUNI : 1; // returns MY_CS_ILUNI here

  return MY_CS_ILUNI;

The cs->tab_from_uni->tab has the following arrangement:

(rr) p *cs->tab_from_uni->tab@128

$51 = "\000\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?\000ABCDEFGHIJKLMNOPQRSTUVWXYZ\000\000\000\000_\000abcdefghijklmnopqrstuvwxyz\000\000\000\000"

and unlike ascii, the 92nd element is not backslash, but a \000.

Yuchen Pei added a comment - 2024-12-03 05:33 - edited I did some investigation. It failed at escaping, more specifically when appending a backslash: // escape_like_characters: if ((ret= my_ci_wc_mb(cs, (my_wc_t) '\\' , dst, dst_end)) <= 0) return true ; /* No space - no LIKE optimize */ the wc to mb function resolves to my_wc_mb_8bit int my_wc_mb_8bit(CHARSET_INFO *cs,my_wc_t wc, uchar *str, uchar *end) { MY_UNI_IDX *idx; if (str >= end) return MY_CS_TOOSMALL; for (idx=cs->tab_from_uni; idx->tab ; idx++) { if (idx->from <= wc && idx->to >= wc) { str[0]= idx->tab[wc - idx->from]; return (!str[0] && wc) ? MY_CS_ILUNI : 1; // returns MY_CS_ILUNI here } } return MY_CS_ILUNI; } The cs->tab_from_uni->tab has the following arrangement: (rr) p *cs->tab_from_uni->tab@128 $51 = "\000\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?\000ABCDEFGHIJKLMNOPQRSTUVWXYZ\000\000\000\000_\000abcdefghijklmnopqrstuvwxyz\000\000\000\000" and unlike ascii, the 92nd element is not backslash, but a \000 .

Yuchen Pei added a comment - 2024-12-03 06:34 - edited

The "good" news is that the corresponding

c like 'b\\%_c%'

failed for the same reason, i.e. my_wc_mb_8bit failing to convert a backslash to the swe7 charset. So I think it is ok and can be regarded as intended behaviour that this case does not result in a sargable substr. psergei

In below the returned conv is NULL due to the conversion failure of the backslash char, which leads to the error reporting by my_coll_agg_error(args, nargs, fname.str, item_sep);

// Type_std_attributes::agg_item_set_converter:

  for (i= 0, arg= args; i < nargs; i++, arg+= item_sep)

    Item* conv= (*arg)->safe_charset_converter(thd, coll.collation); // the returned conv == NULL

    if (conv == *arg)

      continue;

    if (!conv)

      if (nargs >=2 && nargs <= 3)

        /* restore the original arguments for better error message */

        args[0]= safe_args[0];

        args[item_sep]= safe_args[1];

      if (nargs == 1 && single_err)

/*

          Use *single_err to produce an error message mentioning two

          collations.

*/

        if (single_err->first)

          my_coll_agg_error(args[0]->collation, single_err->coll, fname.str);

        else

          my_coll_agg_error(single_err->coll, args[0]->collation, fname.str);

      else

        my_coll_agg_error(args, nargs, fname.str, item_sep);

      return TRUE;

Trace:

my_wc_mb_8bit > my_convert_using_func > my_convert > copy_and_convert > String::copy > String::copy > Item_string::Item_string > Item::const_charset_converter > Item::const_charset_converter > Item_string::safe_charset_converter > Type_std_attributes::agg_item_set_converter > Type_std_attributes::agg_arg_charsets > Type_std_attributes::agg_arg_charsets_for_comparison > Item_func_or_sum::agg_arg_charsets_for_comparison > Item_func_like::fix_length_and_dec > Item_func::fix_fields > Item_func_like::fix_fields > Item::fix_fields_if_needed > Item::fix_fields_if_needed_for_scalar > Item::fix_fields_if_needed_for_bool > setup_conds > setup_without_group > JOIN::prepare > mysql_select

Yuchen Pei added a comment - 2024-12-03 06:34 - edited The "good" news is that the corresponding c like 'b\\%_c%' failed for the same reason, i.e. my_wc_mb_8bit failing to convert a backslash to the swe7 charset. So I think it is ok and can be regarded as intended behaviour that this case does not result in a sargable substr. psergei In below the returned conv is NULL due to the conversion failure of the backslash char, which leads to the error reporting by my_coll_agg_error(args, nargs, fname.str, item_sep); // Type_std_attributes::agg_item_set_converter: for (i= 0, arg= args; i < nargs; i++, arg+= item_sep) { Item* conv= (*arg)->safe_charset_converter(thd, coll.collation); // the returned conv == NULL if (conv == *arg) continue ; if (!conv) { if (nargs >=2 && nargs <= 3) { /* restore the original arguments for better error message */ args[0]= safe_args[0]; args[item_sep]= safe_args[1]; } if (nargs == 1 && single_err) { /* Use *single_err to produce an error message mentioning two collations. */ if (single_err->first) my_coll_agg_error(args[0]->collation, single_err->coll, fname.str); else my_coll_agg_error(single_err->coll, args[0]->collation, fname.str); } else my_coll_agg_error(args, nargs, fname.str, item_sep); return TRUE; } Trace: my_wc_mb_8bit > my_convert_using_func > my_convert > copy_and_convert > String::copy > String::copy > Item_string::Item_string > Item::const_charset_converter > Item::const_charset_converter > Item_string::safe_charset_converter > Type_std_attributes::agg_item_set_converter > Type_std_attributes::agg_arg_charsets > Type_std_attributes::agg_arg_charsets_for_comparison > Item_func_or_sum::agg_arg_charsets_for_comparison > Item_func_like::fix_length_and_dec > Item_func::fix_fields > Item_func_like::fix_fields > Item::fix_fields_if_needed > Item::fix_fields_if_needed_for_scalar > Item::fix_fields_if_needed_for_bool > setup_conds > setup_without_group > JOIN::prepare > mysql_select

Lena Startseva added a comment - 2024-12-19 16:05

Testing done.
~~MDEV-35564~~ is marked as not a bug, so this task is ok to push

Lena Startseva added a comment - 2024-12-19 16:05 Testing done. MDEV-35564 is marked as not a bug, so this task is ok to push

Yuchen Pei added a comment - 2025-01-02 00:29

Thanks for the testing - pushed e021770667851233c8cda34e9360606adfd3ea0c to main.

btw julien.fritsch, there are two unreleased 11.8.* versions in the dropdown menu of fix version/s, is that right? I picked the first one 11.8.0

Yuchen Pei added a comment - 2025-01-02 00:29 Thanks for the testing - pushed e021770667851233c8cda34e9360606adfd3ea0c to main. btw julien.fritsch , there are two unreleased 11.8.* versions in the dropdown menu of fix version/s, is that right? I picked the first one 11.8.0

MariaDB Server

Make SUBSTR(col, 1, n) = const_str sargable

Details

Description

Attachments

Attachments

Issue Links

Activity