Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34911

Make SUBSTR(col, 1, n) = const_str sargable

Details

    Description

      Conditions looking like

      SUBSTR(str_col, 1, n) = const_str 
      

      where the prefix of str_col is compared against a constant value can be rewritten as range conditions as

      const_str_low <= str_col <= const_str_high
      

      which will improve the optimizer capabilities of building better execution plan (for example, using indexes on column str_col and applying range optimizations).
      The similar transformation is already implemented for conditions like

      str_col LIKE 'foo%'
      

      aiming for the same goal.

      Similar rewrite is implemented for DATE() and YEAR() function at MDEV-8320, and that approach may be used as a reference.

      Attachments

        Issue Links

          Activity

            ycp Yuchen Pei added a comment - - edited

            A minimal testcase:

            create table t (c varchar(4), key (c));
            insert into t values ('ddd'), ('bbcd'), ('bba'), ('aaa');
            explain select * from t where c like 'bb%';
            explain select * from t where substr(c, 1, 2) = 'bb';
            select * from t where substr(c, 1, 2) = 'bb';
            drop table t;
            

            The explains of like and substr output the following, respectively:

            explain select * from t where c like 'bb%';
            id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
            1	SIMPLE	t	range	c	c	19	NULL	2	Using where; Using index
            explain select * from t where substr(c, 1, 2) = 'bb';
            id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
            1	SIMPLE	t	index	NULL	c	19	NULL	4	Using where; Using index
            

            I understand that we want the second explain to output similarly to the first, if not the same. At least the type field should be range.

            Comparison of the traces of like and substr (with rr) tells me the main discriminator is the behaviour of Item::get_mm_tree(), called from SQL_SELECT::test_quick_select(). Item::get_mm_tree() is the entry point of the RangeAnalysisModule as documented in opt_range.cc.

            For like, it creates min_str and max_str as the bounds of the interval:

            // in Item_func_like::get_mm_leaf
              SEL_ARG *tree= new (param->mem_root) SEL_ARG(field, min_str, max_str);
            

            # 19 is the value of range->start_key.length later, but any sufficiently large number here can be used to make the point
            (rr) p *min_str@19
            $106 = "\000\020\000bb", '\t' <repeats 14 times>
            (rr) p *max_str@19
            $107 = "\000\020\000bb\364\217\277\277\364\217\277\277\364\217\277\277  "
            

            And the two bounds can be seen here ("bb" and "bb\364", not sure why the upper bound for a char is 244 though).

            I hacked a couple changes to get "range" in the type column in the explain output, so the next step is to get the same kind of min and max for the interval, as currently I only have

            # at 1ab674089ce upstream/bb-11.7.mdev-34911-wip [wip] MDEV-34911
            (rr) p range->start_key.length
            $28 = 19
            (rr) p *range->start_key.key@19
            $26 = "\000\002\000bb", '\000' <repeats 13 times>
            (rr) p *range->end_key.key@19
            $27 = "\000\002\000bb", '\000' <repeats 13 times>
            

            and the SELECT statement yields no result.

            To be continued...

            ycp Yuchen Pei added a comment - - edited A minimal testcase: create table t (c varchar (4), key (c)); insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'aaa' ); explain select * from t where c like 'bb%' ; explain select * from t where substr(c, 1, 2) = 'bb' ; select * from t where substr(c, 1, 2) = 'bb' ; drop table t; The explains of like and substr output the following, respectively: explain select * from t where c like 'bb%'; id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE t range c c 19 NULL 2 Using where; Using index explain select * from t where substr(c, 1, 2) = 'bb'; id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE t index NULL c 19 NULL 4 Using where; Using index I understand that we want the second explain to output similarly to the first, if not the same. At least the type field should be range . Comparison of the traces of like and substr (with rr) tells me the main discriminator is the behaviour of Item::get_mm_tree() , called from SQL_SELECT::test_quick_select() . Item::get_mm_tree() is the entry point of the RangeAnalysisModule as documented in opt_range.cc . For like , it creates min_str and max_str as the bounds of the interval: // in Item_func_like::get_mm_leaf SEL_ARG *tree= new (param->mem_root) SEL_ARG(field, min_str, max_str); # 19 is the value of range->start_key.length later, but any sufficiently large number here can be used to make the point (rr) p *min_str@19 $106 = "\000\020\000bb", '\t' <repeats 14 times> (rr) p *max_str@19 $107 = "\000\020\000bb\364\217\277\277\364\217\277\277\364\217\277\277 " And the two bounds can be seen here ("bb" and "bb\364", not sure why the upper bound for a char is 244 though). I hacked a couple changes to get "range" in the type column in the explain output, so the next step is to get the same kind of min and max for the interval, as currently I only have # at 1ab674089ce upstream/bb-11.7.mdev-34911-wip [wip] MDEV-34911 (rr) p range->start_key.length $28 = 19 (rr) p *range->start_key.key@19 $26 = "\000\002\000bb", '\000' <repeats 13 times> (rr) p *range->end_key.key@19 $27 = "\000\002\000bb", '\000' <repeats 13 times> and the SELECT statement yields no result. To be continued...

            ycp, yes, you were going in the right direction but several more steps are needed.

            Did you see the call to my_like_range() in Item_func_like::get_mm_leaf ?

            For the "%" in LIKE pattern, it does this:

                if (*ptr == w_many)                         /* '%' in SQL */
                {
                  /* Calculate length of keys */
                  *min_length= (cs->state & (MY_CS_BINSORT | MY_CS_NOPAD)) ?
                                (size_t) (min_str - min_org) :
                                res_length;
                  *max_length= res_length;
                  do
                  {
                    *min_str++= 0;
                    *max_str++= (char) cs->max_sort_char;
                  } while (min_str != min_end);
                  return 0;
            

            note the 0 and cs->max_sort_char added to min and max endpoints...

            psergei Sergei Petrunia added a comment - ycp , yes, you were going in the right direction but several more steps are needed. Did you see the call to my_like_range() in Item_func_like::get_mm_leaf ? For the "%" in LIKE pattern, it does this: if (*ptr == w_many) /* '%' in SQL */ { /* Calculate length of keys */ *min_length= (cs->state & (MY_CS_BINSORT | MY_CS_NOPAD)) ? ( size_t ) (min_str - min_org) : res_length; *max_length= res_length; do { *min_str++= 0; *max_str++= ( char ) cs->max_sort_char; } while (min_str != min_end); return 0; note the 0 and cs->max_sort_char added to min and max endpoints...
            ycp Yuchen Pei added a comment - - edited

            Hi psergei, ptal thanks

            081dd346da7 upstream/bb-11.7-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
            

            ycp Yuchen Pei added a comment - - edited Hi psergei , ptal thanks 081dd346da7 upstream/bb-11.7-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
            psergei Sergei Petrunia added a comment - - edited

            Reviewer notes to self:
            LIKE handles contractions correctly: mdev34911-play-with-contractions.sql
            this is done with opt_range asking the collation handler to produce the endpoints:

               if (field->charset()->like_range(res->ptr(), res->length(),                                                                                       
                                                escape, wild_one, wild_many,
                                                field_length,
                                                (char*) min_str + offset,
                                                (char*) max_str + offset,
                                                &min_length, &max_length))
            

            LEFT is not handled (we probably should handle it)

            The used approach doesn't extend easily to non-equalities: LEFT(x) <'foo' or SUBSTR(col, 1,10) IN (value-list) (which was requested way back in MDEV-4414 and also used in some support cases). . It's a pity we don't handle IN.

            There are two implementation options:

            A. Make sargable (like current patch does)
            B. Rewrite into col LIKE 'string-literal%' (? is it really equivalent?)

            Do we need to take into account the length of string we're comparing with?

            substring(col, 1, 2) = 'more-than-two-characters'
            

            This is always FALSE.
            But one can't just count the characters: First, mind the contractions, it could be that substring(col1, 1, 3)-> 'foo' but it is equal to 'Fooooo' in some collation.

            Second, mind the endspace. substring(col, 1,2) -> 'ab' and it's equal to 'ab ' in many collations.

            Anyhow, note that this is not equivalent to col LIKE 'more-than-two-chars%'.

            substring(col, 1, 20)= '17-characters-str'
            

            • can only be true when col='17-characters-str' , right? (again, changing to LIKE won't be equivalent).

            and finally

            substring(col, 1, 17)= '17-characters-str'
            

            means the col can be anything starting with '17-characters-str'.
            (Q: does it mean the condition is EQUIVALENT to col LIKE '17-characters-str%' ? )

            psergei Sergei Petrunia added a comment - - edited Reviewer notes to self: LIKE handles contractions correctly: mdev34911-play-with-contractions.sql this is done with opt_range asking the collation handler to produce the endpoints: if (field->charset()->like_range(res->ptr(), res->length(), escape, wild_one, wild_many, field_length, ( char *) min_str + offset, ( char *) max_str + offset, &min_length, &max_length)) LEFT is not handled (we probably should handle it) The used approach doesn't extend easily to non-equalities: LEFT(x) <'foo' or SUBSTR(col, 1,10) IN (value-list) (which was requested way back in MDEV-4414 and also used in some support cases). . It's a pity we don't handle IN. There are two implementation options: A. Make sargable (like current patch does) B. Rewrite into col LIKE 'string-literal%' (? is it really equivalent?) Do we need to take into account the length of string we're comparing with? substring (col, 1, 2) = 'more-than-two-characters' This is always FALSE. But one can't just count the characters: First, mind the contractions, it could be that substring(col1, 1, 3)-> 'foo' but it is equal to 'Fooooo' in some collation. Second, mind the endspace. substring(col, 1,2) -> 'ab' and it's equal to 'ab ' in many collations. Anyhow, note that this is not equivalent to col LIKE 'more-than-two-chars%' . substring (col, 1, 20)= '17-characters-str' can only be true when col='17-characters-str' , right? (again, changing to LIKE won't be equivalent). and finally substring (col, 1, 17)= '17-characters-str' means the col can be anything starting with '17-characters-str' . (Q: does it mean the condition is EQUIVALENT to col LIKE '17-characters-str%' ? )

            Discussed with Bar.

            Some take-aways

            LIKE doesn't do collation's equality comparisons.

            It does per-character comparison. Example:

            MariaDB [test]> set names utf8 collate utf8_danish_ci;
            Query OK, 0 rows affected (0.001 sec)
             
            MariaDB [test]> SELECT 'Ă…x'='AAX', 'Ă…x' LIKE 'AAX';
            +-------------+------------------+
            | 'Ă…x'='AAX'  | 'Ă…x' LIKE 'AAX'  |
            +-------------+------------------+
            |           1 |                0 |
            +-------------+------------------+
            1 row in set (0.001 sec)
            

            This does not necessarily invalidate the approach used in this MDEV
            but we need to look carefully.

            Contractions

            The task might be simpler if we only allow it for collations that do not have contractions.

            • utf8_general_ci doesn't have contractions.
            • utf8_unicode_ci doesn't have contractions.
            • utf8mb4_uca1400_ai_ci (the new default) does have contractions.

            It seems it's still worth trying to avoid adding the "no contractions" limitation.

            Escaping

            What if the string in question already has the pattern symbols: _,% ?
            They will need to be escaped.

            psergei Sergei Petrunia added a comment - Discussed with Bar. Some take-aways LIKE doesn't do collation's equality comparisons. It does per-character comparison. Example: MariaDB [test]> set names utf8 collate utf8_danish_ci; Query OK, 0 rows affected (0.001 sec)   MariaDB [test]> SELECT 'Ă…x'='AAX', 'Ă…x' LIKE 'AAX'; +-------------+------------------+ | 'Ă…x'='AAX' | 'Ă…x' LIKE 'AAX' | +-------------+------------------+ | 1 | 0 | +-------------+------------------+ 1 row in set (0.001 sec) This does not necessarily invalidate the approach used in this MDEV but we need to look carefully. Contractions The task might be simpler if we only allow it for collations that do not have contractions. utf8_general_ci doesn't have contractions. utf8_unicode_ci doesn't have contractions. utf8mb4_uca1400_ai_ci (the new default) does have contractions. It seems it's still worth trying to avoid adding the "no contractions" limitation. Escaping What if the string in question already has the pattern symbols: _,% ? They will need to be escaped.
            psergei Sergei Petrunia added a comment - - edited

            To the question of LIKE being different from equality:

            CREATE TABLE `t12` (
              `a` varchar(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci DEFAULT NULL,
              KEY `a` (`a`)
            );
            

            ...

            select a, a LIKE 'Ă…X', a LIKE 'AAX' from t12 order by a;
            +------+--------------+--------------+
            | a    | a LIKE 'Ă…X'  | a LIKE 'AAX' |
            +------+--------------+--------------+
            | A    |            0 |            0 |
            | Z    |            0 |            0 |
            | AAX  |            0 |            1 |
            | Ă…x   |            1 |            0 |
            

            EDIT: fixed the ranges to be for the right queries:

            LIKE generates ranges for different constants:

            analyze format=json select * from t12 where  a LIKE 'AAX' order by a;
            optimizer_trace shows:
                                  "ranges": ["(AAX) <= (a) <= (AAX)"]
            

            and

            analyze format=json select * from t12 where  a LIKE 'Ă…X' order by a;
            optimizer_trace shows:
                                  "ranges": ["(Ă…X) <= (a) <= (Ă…X)"]

             

            But the lookup code uses equality operator semantics so both queries will scan rows with a=AAX and rows with a=Ă…X.

            psergei Sergei Petrunia added a comment - - edited To the question of LIKE being different from equality: CREATE TABLE `t12` ( `a` varchar(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci DEFAULT NULL, KEY `a` (`a`) ); ... select a, a LIKE 'Ă…X', a LIKE 'AAX' from t12 order by a; +------+--------------+--------------+ | a | a LIKE 'Ă…X' | a LIKE 'AAX' | +------+--------------+--------------+ | A | 0 | 0 | | Z | 0 | 0 | | AAX | 0 | 1 | | Ă…x | 1 | 0 | EDIT: fixed the ranges to be for the right queries : LIKE generates ranges for different constants: analyze format=json select * from t12 where a LIKE 'AAX' order by a; optimizer_trace shows: "ranges": ["(AAX) <= (a) <= (AAX)"] and analyze format=json select * from t12 where a LIKE 'Ă…X' order by a; optimizer_trace shows: "ranges": ["(Ă…X) <= (a) <= (Ă…X)"]   But the lookup code uses equality operator semantics so both queries will scan rows with a=AAX and rows with a=Ă…X.
            psergei Sergei Petrunia added a comment - - edited

            How LIKE works

             key1 LIKE 'bar%'
            

            needs the first 3 characters to compare as equal to chars {b, a, r}.
            When {b,a,r} doesn't have an unfinished contraction, the lookup key is:

            min_key= {b, a, r, repeat(min_sort_char, N) }
             
            max_key= {b, a, r, repeat(max_sort_char, N)}
            

            (LIKE-HAS-WIDER-RANGE):
            this range is wider than necessary as it also covers letter combinations which are equal to "bar" but not equal to it on a character-by-character basis. (e.g. in Danish aa=Ă… but a!=Ă…).

            Suppose {b,a,r} has a potential unfinished contraction starting with 'r' (*- see below). Some made-up collation could have {r,u} -> b or {r,c}->v.

            then we need the range to include all possible contraction continuations. In our made-up collation the lookup range would need to include {b,a,v} as {b,a,r,c} would compare as {b,a,v}.

            This is done by including all characters at the potential contraction start:

              min_key= { b, a, repeat(min_sort_char) }
             
              max_key= { b, a, repeat(max_sort_char) }
            

            - for some reason assumes the unfinished contraction can only start with the last letter before the pattern. In our example, with "r", not with "a,r".

            How LEFT/SUBSTRING would work

              LEFT(key1, N) = 'bar' 
            

            This means the string of first N characters of key1 compares as equal to 'bar'.

            The "first N" part is complex to check, so we'll use a weaker condition:

            "key1 starts with a string that compares as equal string to 'bar'"
            

            This is what LIKE is doing due to LIKE-HAS-WIDER-RANGE.
            LIKE's handling of possible unfinished contractions is applicable here.

            psergei Sergei Petrunia added a comment - - edited How LIKE works key1 LIKE 'bar%' needs the first 3 characters to compare as equal to chars {b, a, r}. When {b,a,r} doesn't have an unfinished contraction, the lookup key is: min_key= {b, a, r, repeat(min_sort_char, N) }   max_key= {b, a, r, repeat(max_sort_char, N)} (LIKE-HAS-WIDER-RANGE): this range is wider than necessary as it also covers letter combinations which are equal to "bar" but not equal to it on a character-by-character basis. (e.g. in Danish aa=Ă… but a!=Ă…). Suppose {b,a,r} has a potential unfinished contraction starting with 'r' (*- see below). Some made-up collation could have {r,u} -> b or {r,c}->v. then we need the range to include all possible contraction continuations. In our made-up collation the lookup range would need to include {b,a,v} as {b,a,r,c} would compare as {b,a,v}. This is done by including all characters at the potential contraction start: min_key= { b, a, repeat(min_sort_char) }   max_key= { b, a, repeat(max_sort_char) } - for some reason assumes the unfinished contraction can only start with the last letter before the pattern. In our example, with "r", not with "a,r". How LEFT/SUBSTRING would work LEFT(key1, N) = 'bar' This means the string of first N characters of key1 compares as equal to 'bar'. The "first N" part is complex to check, so we'll use a weaker condition: "key1 starts with a string that compares as equal string to 'bar'" This is what LIKE is doing due to LIKE-HAS-WIDER-RANGE. LIKE's handling of possible unfinished contractions is applicable here.
            psergei Sergei Petrunia added a comment - - edited

            ycp ok, takeaways of the above:

            • We can use like_range() call to construct endpoints.
            • We need to take care of cases like SUBSTRING(col, 1, 4)= 'a%_b' and escape the pattern chars. my_like_range() should get (EDIT: jira swallowed backslashes )

              my_like_range( ... 'a\%\_b%' ...)
              

            Other review input:
            Please also handle LEFT(col, N). It's the same as SUBSTRING(col1, 1, N), right?

            Would it be possible to move with_sargable_substr() to be a member of
            Item_func_eq instead of Item_bool_func2?

            This check is not good:

             !strcmp(((Item_func *) args[0])->func_name(), "substr") 
            

            Please add Item_func_substr::functype() and check that.

            In Item_bool_func::get_mm_parts(), I see this:

                    if (with_sargable_substr())
                      sel_arg= get_mm_leaf_for_sargable(this, param, key_part->field, key_part, type, value);
            

            But then inside get_mm_leaf_for_sargable() there's yet another call to with_sargable_substr():

              if (item->with_sargable_substr())
                res->append("%", 1);
            

            (and there's also the third call in with_sargable_substr()).

            At least one call can be removed by adding an argument to get_mm_leaf_for_sargable().

            The name get_mm_leaf_for_sargable() doesn't look very meaningful to me.
            Do names like get_mm_leaf_for_LIKE() or get_mm_leaf_for_like_pattern() describe what it does?

            psergei Sergei Petrunia added a comment - - edited ycp ok, takeaways of the above: We can use like_range() call to construct endpoints. We need to take care of cases like SUBSTRING(col, 1, 4)= 'a%_b' and escape the pattern chars. my_like_range() should get ( EDIT : jira swallowed backslashes ) my_like_range( ... 'a\%\_b%' ...) Other review input: Please also handle LEFT(col, N). It's the same as SUBSTRING(col1, 1, N), right? Would it be possible to move with_sargable_substr() to be a member of Item_func_eq instead of Item_bool_func2? This check is not good: ! strcmp (((Item_func *) args[0])->func_name(), "substr" ) Please add Item_func_substr::functype() and check that. In Item_bool_func::get_mm_parts(), I see this: if (with_sargable_substr()) sel_arg= get_mm_leaf_for_sargable( this , param, key_part->field, key_part, type, value); But then inside get_mm_leaf_for_sargable() there's yet another call to with_sargable_substr(): if (item->with_sargable_substr()) res->append( "%" , 1); (and there's also the third call in with_sargable_substr()). At least one call can be removed by adding an argument to get_mm_leaf_for_sargable(). The name get_mm_leaf_for_sargable() doesn't look very meaningful to me. Do names like get_mm_leaf_for_LIKE() or get_mm_leaf_for_like_pattern() describe what it does?
            ycp Yuchen Pei added a comment - - edited

            Hi psergei, thanks for the comments. ptal at the revised patch:

            57a70ffd077 bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
            

            Addressing the review comments:

            • escape wildcards: done. Also escaping backslashes, and added testcases accordingly.
            • also handle LEFT: done. Also added testcases for coverage.
            • move with_sargable_substr to be a member of Item_func_eq instead of Item_bool_func2: I don't think that's necessary. The divergence between our conditions and the LIKE condition happens at get_mm_leaf which is called from get_mm_parts. And there's no Item_func_eq::get_mm_parts - it goes to Item_bool_func::get_mm_parts. So if we make the check a Item_func_eq method, we'll need to check the functype in Item_bool_func::get_mm_parts as well as casting to Item_func_eq on a match. The same goes to the other invocation in Item_bool_func2::add_key_fields_optimize_op().
            • add Item_func_substr::functype() and check that: done. Also did the same for Item_func_left
            • remove call to with_sargable_substr() from get_mm_leaf_for_sargable(): done. Now checking that the type is not LIKE which implies taking the path of with_sargable_substr() returning true.
            • rename get_mm_leaf_for_sargable: done.
            ycp Yuchen Pei added a comment - - edited Hi psergei , thanks for the comments. ptal at the revised patch: 57a70ffd077 bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str Addressing the review comments: escape wildcards: done. Also escaping backslashes, and added testcases accordingly. also handle LEFT: done. Also added testcases for coverage. move with_sargable_substr to be a member of Item_func_eq instead of Item_bool_func2: I don't think that's necessary. The divergence between our conditions and the LIKE condition happens at get_mm_leaf which is called from get_mm_parts. And there's no Item_func_eq::get_mm_parts - it goes to Item_bool_func::get_mm_parts. So if we make the check a Item_func_eq method, we'll need to check the functype in Item_bool_func::get_mm_parts as well as casting to Item_func_eq on a match. The same goes to the other invocation in Item_bool_func2::add_key_fields_optimize_op(). add Item_func_substr::functype() and check that: done. Also did the same for Item_func_left remove call to with_sargable_substr() from get_mm_leaf_for_sargable(): done. Now checking that the type is not LIKE which implies taking the path of with_sargable_substr() returning true. rename get_mm_leaf_for_sargable: done.

            Review input provided on Slack today

            psergei Sergei Petrunia added a comment - Review input provided on Slack today
            ycp Yuchen Pei added a comment -

            Hi psergei, ptal at the following revised patch, thanks:

            7c75761ebbd upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
            

            Addressed review comments:

            • [X] move with_sargable_substr() to be called later in Item_bool_func2_with_rev::get_mm_tree()
            • [X] don't call with_sargable_substr repeatedly in the Item_bool_func::get_mm_parts() loop
            • [X] (note) substr(a, 1, unknown) = 'ab' is ok, because it is a necessary condition. the results will be further filtered correctly. but add a test
            • [X] views not handled, use real_item and add a test
            • [X] fix stack overrun in escaping string in get_mm_leaf_for_LIKE(), and use more descriptive var names for the pointers, and add a testcase
            • [X] item_bool_func::with_sargable_substr() doesn't mention LEFT in the comment
            • [X] less vcalls in item_bool_func::with_sargable_substr(). use caching.
            • [X] (note) Item_func::BITMAP_LIKE is ok - added a comment
            ycp Yuchen Pei added a comment - Hi psergei , ptal at the following revised patch, thanks: 7c75761ebbd upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str Addressed review comments: [X] move with_sargable_substr() to be called later in Item_bool_func2_with_rev::get_mm_tree() [X] don't call with_sargable_substr repeatedly in the Item_bool_func::get_mm_parts() loop [X] (note) substr(a, 1, unknown) = 'ab' is ok, because it is a necessary condition. the results will be further filtered correctly. but add a test [X] views not handled, use real_item and add a test [X] fix stack overrun in escaping string in get_mm_leaf_for_LIKE(), and use more descriptive var names for the pointers, and add a testcase [X] item_bool_func::with_sargable_substr() doesn't mention LEFT in the comment [X] less vcalls in item_bool_func::with_sargable_substr(). use caching. [X] (note) Item_func::BITMAP_LIKE is ok - added a comment
            psergei Sergei Petrunia added a comment - - edited

            ycp, thanks. I think we're getting closer to finish but there are some steps to make still.

            Review input part 1

            Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this

            This

            +# strangely, the results here differ from the results if we extract
            +# this testing block to a standalone mtr test, in the way that ce and
            +# cé are swapped (ce before cé here, and ce after cé if standalone),
            +# but given the results are the same between like and substr, it is
            +# not a concern here
            

            needs to be gone before the push.
            I've investigated the cause.
            In func_str.test, %cé is 3 characters:

            +create table t2 like t1;
            +insert into t2 values('%cé');
            +insert into t2 values('%ce');
            +select hex(s1) from t2;
            +hex(s1)
            +0000002500000063000000E9
            +000000250000006300000065
            

            while in a separate .test file, it is 4 characters:

            +create table t2 like t1;
            +insert into t2 values('%cé');
            +insert into t2 values('%ce');
            +select hex(s1) from t2;
            +hex(s1)
            +0000002500000063000000C3000000A9
            +000000250000006300000065
            

            THD::make_string_literal() converts the literals from character_set_client.
            The test case sets @@collation_connection=utf32_czech_ci, but the server doesn't support the client sending incoming data in utf32, so character_set_client remains:

            • latin1 in the separate test file.
            • utf8 in func_str.test because a test above does SET NAMES utf8.

            So, these are the charsets that the server gets the data from client.
            Then, it honors the @@collation_connection=utf32_czech_ci setting and converts literals to that charset/collation.

            Conversion of "cé" from latin1 produces two characters, while conversion from utf8 produces one character.
            I assume this is correct. We just need to remove the comment.

            psergei Sergei Petrunia added a comment - - edited ycp , thanks. I think we're getting closer to finish but there are some steps to make still. Review input part 1 Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this This +# strangely, the results here differ from the results if we extract +# this testing block to a standalone mtr test, in the way that ce and +# cé are swapped (ce before cé here, and ce after cé if standalone), +# but given the results are the same between like and substr, it is +# not a concern here needs to be gone before the push. I've investigated the cause. In func_str.test, %cé is 3 characters: +create table t2 like t1; +insert into t2 values('%cé'); +insert into t2 values('%ce'); +select hex(s1) from t2; +hex(s1) +0000002500000063000000E9 +000000250000006300000065 while in a separate .test file, it is 4 characters: +create table t2 like t1; +insert into t2 values('%cé'); +insert into t2 values('%ce'); +select hex(s1) from t2; +hex(s1) +0000002500000063000000C3000000A9 +000000250000006300000065 THD::make_string_literal() converts the literals from character_set_client . The test case sets @@collation_connection=utf32_czech_ci, but the server doesn't support the client sending incoming data in utf32, so character_set_client remains: latin1 in the separate test file. utf8 in func_str.test because a test above does SET NAMES utf8. So, these are the charsets that the server gets the data from client. Then, it honors the @@collation_connection=utf32_czech_ci setting and converts literals to that charset/collation. Conversion of "cé" from latin1 produces two characters, while conversion from utf8 produces one character. I assume this is correct. We just need to remove the comment.

            Review input part 2

            It looks like moving escaping into a separate function was a good idea:

            commit bc257b1b9edf330348464e469cca8a763c9806c6 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911)
            Author: Sergei Petrunia <sergey@mariadb.com>
            Date:   Thu Nov 21 21:45:12 2024 +0200
             
                Move escaping into escape_like_characters(), add comments
            

            I'll do a bit more commenting /cleanups.

            psergei Sergei Petrunia added a comment - Review input part 2 It looks like moving escaping into a separate function was a good idea: commit bc257b1b9edf330348464e469cca8a763c9806c6 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911) Author: Sergei Petrunia <sergey@mariadb.com> Date: Thu Nov 21 21:45:12 2024 +0200   Move escaping into escape_like_characters(), add comments I'll do a bit more commenting /cleanups.
            psergei Sergei Petrunia added a comment - lstartseva , testing can start on this branch: https://github.com/MariaDB/server/commits/bb-11.8-mdev-34911-plus-cleanup
            ycp Yuchen Pei added a comment -

            Hi psergei,

            > Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this

            It's because the addition of LEFT_FUNC and SUBSTR_FUNC results in a different code path in spider group by handler query construction. Added a note to the commit message

            > I assume this is correct. We just need to remove the comment.

            Thanks for the explanation. That makes sense, see also https://mariadb.com/kb/en/setting-character-sets-and-collations/#examples. Removed the comment

            > It looks like moving escaping into a separate function was a good idea:
            > ...
            > I'll do a bit more commenting /cleanups.

            Thanks. This sounds like a WIP on your part, so I'll wait until you are done before incorporating your suggestions.

            Updated patch:

            158b9263e52 upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

            ycp Yuchen Pei added a comment - Hi psergei , > Why did the patch require changes in Spider? Do I understand correctly that it's just some extra test coverage? Please add a note to the commit comment about this It's because the addition of LEFT_FUNC and SUBSTR_FUNC results in a different code path in spider group by handler query construction. Added a note to the commit message > I assume this is correct. We just need to remove the comment. Thanks for the explanation. That makes sense, see also https://mariadb.com/kb/en/setting-character-sets-and-collations/#examples . Removed the comment > It looks like moving escaping into a separate function was a good idea: > ... > I'll do a bit more commenting /cleanups. Thanks. This sounds like a WIP on your part, so I'll wait until you are done before incorporating your suggestions. Updated patch: 158b9263e52 upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str

            ycp, yes. Now I've finished my cleanup suggestions here.

            commit e7d43688844904dfd9027b2328693823f9fa67e9 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911)
            Author: Sergei Petrunia <sergey@mariadb.com>
            Date:   Mon Nov 25 10:30:01 2024 +0200
             
                Move SargableLeft code to opt_sargable_left.cc
             
            commit ab93436b99aef0c1ddcd73d6001610b3f0bbd90a
            

            please take a look.

            psergei Sergei Petrunia added a comment - ycp , yes. Now I've finished my cleanup suggestions here. commit e7d43688844904dfd9027b2328693823f9fa67e9 (HEAD -> bb-11.8-mdev-34911-plus-cleanup, origin/bb-11.8-mdev-34911-plus-cleanup, bb-11.8-mdev-34911) Author: Sergei Petrunia <sergey@mariadb.com> Date: Mon Nov 25 10:30:01 2024 +0200   Move SargableLeft code to opt_sargable_left.cc   commit ab93436b99aef0c1ddcd73d6001610b3f0bbd90a please take a look.
            ycp Yuchen Pei added a comment -

            psergei, thanks. I've made some small adjustment on top:

            babbea6cc71 upstream/bb-11.8-mdev-34911-plus-cleanup Some adjustment on top of the cleanup of MDEV-34991 commits
            

            Squashed all commits into one and added you as a co-author:

            2bf8d31baeb upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str
            

            lstartseva, please use the squashed commit above for testing, thanks.

            ycp Yuchen Pei added a comment - psergei , thanks. I've made some small adjustment on top: babbea6cc71 upstream/bb-11.8-mdev-34911-plus-cleanup Some adjustment on top of the cleanup of MDEV-34991 commits Squashed all commits into one and added you as a co-author: 2bf8d31baeb upstream/bb-11.8-mdev-34911 MDEV-34911 Sargable substr(col, 1, n) = str lstartseva , please use the squashed commit above for testing, thanks.
            lstartseva Lena Startseva added a comment - - edited

            ycp, there is an interesting behavior for CHARSET=swe7 : for type varchar() and TEXT access_type is not range. But for type BLOB everything is good and access_type is range

            Testcase (varchar):

            create table t (c varchar(5), key (c)) DEFAULT CHARSET=swe7;
            insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null);
            explain format=json select * from t where  substr(c, 1, 4) = 'b%_c';
            drop table t;
            

            Result: access_type is index

            EXPLAIN
            {
              "query_block": {
                "select_id": 1,
                "cost": 0.006758432,
                "nested_loop": [
                  {
                    "table": {
                      "table_name": "t",
                      "access_type": "index",
                      "possible_keys": ["c"],
                      "key": "c",
                      "key_length": "8",
                      "used_key_parts": ["c"],
                      "loops": 1,
                      "rows": 7,
                      "cost": 0.006758432,
                      "filtered": 100,
                      "attached_condition": "substr(t.c,1,4) = 'b%_c'",
                      "using_index": true
                    }
                  }
                ]
              }
            }
            

            Testcase (TEXT):

            create table t (c TEXT, key (c)) DEFAULT CHARSET=swe7;
            insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null);
            explain format=json select * from t where  substr(c, 1, 4) = 'b%_c';
            drop table t;
            

            Result: access_type is ALL

            EXPLAIN
            {
              "query_block": {
                "select_id": 1,
                "cost": 0.011130435,
                "nested_loop": [
                  {
                    "table": {
                      "table_name": "t",
                      "access_type": "ALL",
                      "possible_keys": ["c"],
                      "loops": 1,
                      "rows": 7,
                      "cost": 0.011130435,
                      "filtered": 100,
                      "attached_condition": "substr(t.c,1,4) = 'b%_c'"
                    }
                  }
                ]
              }
            }
            

            lstartseva Lena Startseva added a comment - - edited ycp , there is an interesting behavior for CHARSET=swe7 : for type varchar() and TEXT access_type is not range. But for type BLOB everything is good and access_type is range Testcase (varchar): create table t (c varchar (5), key (c)) DEFAULT CHARSET=swe7; insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'b%_cd' ), ( 'aaa' ), ( 'b' ), ( null ); explain format=json select * from t where substr(c, 1, 4) = 'b%_c' ; drop table t; Result: access_type is index EXPLAIN { "query_block" : { "select_id" : 1, "cost" : 0.006758432, "nested_loop" : [ { "table" : { "table_name" : "t" , "access_type" : "index" , "possible_keys" : [ "c" ], "key" : "c" , "key_length" : "8" , "used_key_parts" : [ "c" ], "loops" : 1, "rows" : 7, "cost" : 0.006758432, "filtered" : 100, "attached_condition" : "substr(t.c,1,4) = 'b%_c'" , "using_index" : true } } ] } } Testcase (TEXT): create table t (c TEXT, key (c)) DEFAULT CHARSET=swe7; insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'b%_cd' ), ( 'aaa' ), ( 'b' ), ( null ); explain format=json select * from t where substr(c, 1, 4) = 'b%_c' ; drop table t; Result: access_type is ALL EXPLAIN { "query_block" : { "select_id" : 1, "cost" : 0.011130435, "nested_loop" : [ { "table" : { "table_name" : "t" , "access_type" : "ALL" , "possible_keys" : [ "c" ], "loops" : 1, "rows" : 7, "cost" : 0.011130435, "filtered" : 100, "attached_condition" : "substr(t.c,1,4) = 'b%_c'" } } ] } }
            ycp Yuchen Pei added a comment -

            If we do at the base commit eff9c198e32a828f610b93fad3a0f0eb63b3ded2

            create table t (c varchar(5), key (c)) DEFAULT CHARSET=swe7;
            insert into t values ('ddd'), ('bbcd'), ('bba'), ('b%_cd'), ('aaa'), ('b'), (null);
            explain format=json select * from t where  substr(c, 1, 4) = 'b%_c';
            explain format=json select * from t where  c like 'b\\%_c%';
            drop table t;

            we get

            CURRENT_TEST: main.mdev_34911_like
            mysqltest: At line 4: query 'explain format=json select * from t where  c like 'b\\%_c%'' failed: ER_CANT_AGGREGATE_2COLLATIONS (1267): Illegal mix of collations (swe7_swedish_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation 'like'

            It could be MDEV-35041, so I will need to test on the freshly merged main branch.

            ycp Yuchen Pei added a comment - If we do at the base commit eff9c198e32a828f610b93fad3a0f0eb63b3ded2 create table t (c varchar (5), key (c)) DEFAULT CHARSET=swe7; insert into t values ( 'ddd' ), ( 'bbcd' ), ( 'bba' ), ( 'b%_cd' ), ( 'aaa' ), ( 'b' ), ( null ); explain format=json select * from t where substr(c, 1, 4) = 'b%_c' ; explain format=json select * from t where c like 'b\\%_c%' ; drop table t; we get CURRENT_TEST: main.mdev_34911_like mysqltest: At line 4: query 'explain format=json select * from t where c like 'b\\%_c%'' failed: ER_CANT_AGGREGATE_2COLLATIONS (1267): Illegal mix of collations (swe7_swedish_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation 'like' It could be MDEV-35041 , so I will need to test on the freshly merged main branch.
            ycp Yuchen Pei added a comment - - edited

            Testing at the current main f0961301c81c7f5b009c012c076abc326b203b4a, I'm getting the same "Illegal mix of collations" error, despite it containing the MDEV-35041 patch.

            So I cannot easily compare LIKE with SUBSTR in this case.

            However, if we do explain format=json select * from t where c like 'b%_c%';, then the error disappears and the output has access_type range. So something fishy is happening with the escape here

            ycp Yuchen Pei added a comment - - edited Testing at the current main f0961301c81c7f5b009c012c076abc326b203b4a, I'm getting the same "Illegal mix of collations" error, despite it containing the MDEV-35041 patch. So I cannot easily compare LIKE with SUBSTR in this case. However, if we do explain format=json select * from t where c like 'b%_c%'; , then the error disappears and the output has access_type range. So something fishy is happening with the escape here
            ycp Yuchen Pei added a comment - - edited

            I did some investigation. It failed at escaping, more specifically when appending a backslash:

            // escape_like_characters:
                  if ((ret= my_ci_wc_mb(cs, (my_wc_t) '\\', dst, dst_end)) <= 0)
                    return true; /* No space - no LIKE optimize */
            

            the wc to mb function resolves to my_wc_mb_8bit

            int my_wc_mb_8bit(CHARSET_INFO *cs,my_wc_t wc,
            		  uchar *str,
            		  uchar *end)
            {
              MY_UNI_IDX *idx;
             
              if (str >= end)
                return MY_CS_TOOSMALL;
              
              for (idx=cs->tab_from_uni; idx->tab ; idx++)
              {
                if (idx->from <= wc && idx->to >= wc)
                {
                  str[0]= idx->tab[wc - idx->from];
                  return (!str[0] && wc) ? MY_CS_ILUNI : 1; // returns MY_CS_ILUNI here
                }
              }
              return MY_CS_ILUNI;
            }
            

            The cs->tab_from_uni->tab has the following arrangement:

            (rr) p *cs->tab_from_uni->tab@128
            $51 = "\000\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?\000ABCDEFGHIJKLMNOPQRSTUVWXYZ\000\000\000\000_\000abcdefghijklmnopqrstuvwxyz\000\000\000\000"
            

            and unlike ascii, the 92nd element is not backslash, but a \000.

            ycp Yuchen Pei added a comment - - edited I did some investigation. It failed at escaping, more specifically when appending a backslash: // escape_like_characters: if ((ret= my_ci_wc_mb(cs, (my_wc_t) '\\' , dst, dst_end)) <= 0) return true ; /* No space - no LIKE optimize */ the wc to mb function resolves to my_wc_mb_8bit int my_wc_mb_8bit(CHARSET_INFO *cs,my_wc_t wc, uchar *str, uchar *end) { MY_UNI_IDX *idx;   if (str >= end) return MY_CS_TOOSMALL; for (idx=cs->tab_from_uni; idx->tab ; idx++) { if (idx->from <= wc && idx->to >= wc) { str[0]= idx->tab[wc - idx->from]; return (!str[0] && wc) ? MY_CS_ILUNI : 1; // returns MY_CS_ILUNI here } } return MY_CS_ILUNI; } The cs->tab_from_uni->tab has the following arrangement: (rr) p *cs->tab_from_uni->tab@128 $51 = "\000\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?\000ABCDEFGHIJKLMNOPQRSTUVWXYZ\000\000\000\000_\000abcdefghijklmnopqrstuvwxyz\000\000\000\000" and unlike ascii, the 92nd element is not backslash, but a \000 .
            ycp Yuchen Pei added a comment - - edited

            The "good" news is that the corresponding

            c like 'b\\%_c%'
            

            failed for the same reason, i.e. my_wc_mb_8bit failing to convert a backslash to the swe7 charset. So I think it is ok and can be regarded as intended behaviour that this case does not result in a sargable substr. psergei

            In below the returned conv is NULL due to the conversion failure of the backslash char, which leads to the error reporting by my_coll_agg_error(args, nargs, fname.str, item_sep);

            // Type_std_attributes::agg_item_set_converter:
              for (i= 0, arg= args; i < nargs; i++, arg+= item_sep)
              {
                Item* conv= (*arg)->safe_charset_converter(thd, coll.collation); // the returned conv == NULL
                if (conv == *arg)
                  continue;
             
                if (!conv)
                {
                  if (nargs >=2 && nargs <= 3)
                  {
                    /* restore the original arguments for better error message */
                    args[0]= safe_args[0];
                    args[item_sep]= safe_args[1];
                  }
                  if (nargs == 1 && single_err)
                  {
                    /*
                      Use *single_err to produce an error message mentioning two
                      collations.
                    */
                    if (single_err->first)
                      my_coll_agg_error(args[0]->collation, single_err->coll, fname.str);
                    else
                      my_coll_agg_error(single_err->coll, args[0]->collation, fname.str);
                  }
                  else
                    my_coll_agg_error(args, nargs, fname.str, item_sep);
                  return TRUE;
                }
            

            Trace:

            my_wc_mb_8bit > my_convert_using_func > my_convert > copy_and_convert > String::copy > String::copy > Item_string::Item_string > Item::const_charset_converter > Item::const_charset_converter > Item_string::safe_charset_converter > Type_std_attributes::agg_item_set_converter > Type_std_attributes::agg_arg_charsets > Type_std_attributes::agg_arg_charsets_for_comparison > Item_func_or_sum::agg_arg_charsets_for_comparison > Item_func_like::fix_length_and_dec > Item_func::fix_fields > Item_func_like::fix_fields > Item::fix_fields_if_needed > Item::fix_fields_if_needed_for_scalar > Item::fix_fields_if_needed_for_bool > setup_conds > setup_without_group > JOIN::prepare > mysql_select

            ycp Yuchen Pei added a comment - - edited The "good" news is that the corresponding c like 'b\\%_c%' failed for the same reason, i.e. my_wc_mb_8bit failing to convert a backslash to the swe7 charset. So I think it is ok and can be regarded as intended behaviour that this case does not result in a sargable substr. psergei In below the returned conv is NULL due to the conversion failure of the backslash char, which leads to the error reporting by my_coll_agg_error(args, nargs, fname.str, item_sep); // Type_std_attributes::agg_item_set_converter: for (i= 0, arg= args; i < nargs; i++, arg+= item_sep) { Item* conv= (*arg)->safe_charset_converter(thd, coll.collation); // the returned conv == NULL if (conv == *arg) continue ;   if (!conv) { if (nargs >=2 && nargs <= 3) { /* restore the original arguments for better error message */ args[0]= safe_args[0]; args[item_sep]= safe_args[1]; } if (nargs == 1 && single_err) { /* Use *single_err to produce an error message mentioning two collations. */ if (single_err->first) my_coll_agg_error(args[0]->collation, single_err->coll, fname.str); else my_coll_agg_error(single_err->coll, args[0]->collation, fname.str); } else my_coll_agg_error(args, nargs, fname.str, item_sep); return TRUE; } Trace: my_wc_mb_8bit > my_convert_using_func > my_convert > copy_and_convert > String::copy > String::copy > Item_string::Item_string > Item::const_charset_converter > Item::const_charset_converter > Item_string::safe_charset_converter > Type_std_attributes::agg_item_set_converter > Type_std_attributes::agg_arg_charsets > Type_std_attributes::agg_arg_charsets_for_comparison > Item_func_or_sum::agg_arg_charsets_for_comparison > Item_func_like::fix_length_and_dec > Item_func::fix_fields > Item_func_like::fix_fields > Item::fix_fields_if_needed > Item::fix_fields_if_needed_for_scalar > Item::fix_fields_if_needed_for_bool > setup_conds > setup_without_group > JOIN::prepare > mysql_select

            Testing done.
            MDEV-35564 is marked as not a bug, so this task is ok to push

            lstartseva Lena Startseva added a comment - Testing done. MDEV-35564 is marked as not a bug, so this task is ok to push
            ycp Yuchen Pei added a comment -

            Thanks for the testing - pushed e021770667851233c8cda34e9360606adfd3ea0c to main.

            btw julien.fritsch, there are two unreleased 11.8.* versions in the dropdown menu of fix version/s, is that right? I picked the first one 11.8.0

            ycp Yuchen Pei added a comment - Thanks for the testing - pushed e021770667851233c8cda34e9360606adfd3ea0c to main. btw julien.fritsch , there are two unreleased 11.8.* versions in the dropdown menu of fix version/s, is that right? I picked the first one 11.8.0

            People

              ycp Yuchen Pei
              oleg.smirnov Oleg Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.