Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32113

utf8mb3_key_col=utf8mb4_value cannot be used for ref access

Details

    Description

      Contents

      1.Task setting
      2. From utf8mb3_val=utf8mb4_val to index lookups
      3. NEW: Can we handle non-equality comparisons too?
      4. How to implement this
      4.1 Suggested solution #1: rewrite
      4.2 Suggested solution #2: change the way string comparisons are done
      4.3 Suggested solution #3: pseudo-charset
      5. Other stuff
      5.1 The optimization must be switchable
      

      1. Task setting

      Consider this example:

      create table t1 (
        a varchar(32) collate utf8mb3_general_ci,
        key(a)
      );
      insert into t1 select seq from seq_1_to_10000;
       
      create table t2 (
        a varchar(32) collate utf8mb4_general_ci,
        key(a)
      );
      insert into t2 select seq from seq_1_to_10000;
      

      The optimizer is able to use the index on t2.a for ref access:

      explain select * from t1 straight_join t2 on t1.a=t2.a;
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                    |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
      |    1 | SIMPLE      | t1    | index | NULL          | a    | 99      | NULL | 10000 | Using index              |
      |    1 | SIMPLE      | t2    | ref   | a             | a    | 131     | func | 1     | Using where; Using index |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
      

      But not the index on t1.a:

      explain select * from t2 straight_join t1 on t1.a=t2.a;
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                                                        |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
      |    1 | SIMPLE      | t2    | index | a             | a    | 131     | NULL | 10000 | Using index                                                  |
      |    1 | SIMPLE      | t1    | index | NULL          | a    | 99      | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
      

      This is because internally t1.a=t2.a is converted into:

      convert(t1.a using utf8mb4) = t2.a
      

      This MDEV is about making it possible to use ref access on t2.a.

      2. From utf8mb3_val=utf8mb4_val to index lookups

      When we have

      utf8mb3_key_col = utf8mb4_value
      

      the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

      But what if we want to use

      INDEX(utf8mb3_key_col) 
      

      to find the matching value?

      UTF8-MB4 consists of

      • BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
      • Characters from other Planes, named Supplementary or Astral planes.

      Consider a lookup value that is in UTF8MB4.

      • If the characters are in BMP, we can just make the lookup.
      • What if there are characters outside BMP? They may or may not compare as equal to characters in BMP. It depends on the collation.

      In utf8mb4_general_ci, all characters outside the BMP compare as equal to each other
      and equal to the U+FFFD Replacement Character, which is in BMP (In the code, search for MY_CS_REPLACEMENT_CHARACTER).

      A demo

      create table t12 (
        mb3 varchar(10) collate utf8mb3_general_ci,
        mb4 varchar(10) collate utf8mb4_general_ci
      );
       
      -- mb3 has the Replacement Character. 
      -- mb4 has some non-BMP character.
      insert into t12 values (
       _utf8mb3 0xEFBFBD,
       _utf8mb4 0xF09D8C86);
      

      set names utf8mb4;
      MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
      +------+----------+------+----------+
      | mb3  | hex(mb3) | mb4  | hex(mb4) |
      +------+----------+------+----------+
      | �    | EFBFBD   | 𝌆     | F09D8C86 |
      +------+----------+------+----------+
      1 row in set (0.001 sec)
      

      Non-bmp character is equal to the Replacement Character:

      MariaDB [test]> select mb3=mb4 from t12;
      +---------+
      | mb3=mb4 |
      +---------+
      |       1 |
      +---------+
      1 row in set (0.004 sec)
      

      If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:

      MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
      +----------------------------+
      | convert(mb4 using utf8mb3) |
      +----------------------------+
      | ?                          |
      +----------------------------+
      1 row in set, 1 warning (0.004 sec)
       
      MariaDB [test]> show warnings;
      +---------+------+------------------------------------------------------------+
      | Level   | Code | Message                                                    |
      +---------+------+------------------------------------------------------------+
      | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
      +---------+------+------------------------------------------------------------+
      

      3. (NEW) NEW: Can we handle non-equality comparisons too?

      It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

      4. How to implement this

      4.1 Suggested solution #1: rewrite

      So, currently, when one calls fix_fields() for

        mb3_key_col= mb4_col 
      

      it will be converted into

        convert(mb3_key_col using utf8mb4) = mb4_col
      

      The optimizer only considers ref access for column references, that is, for

         mb3_key_col= ...
      

      The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into

        convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
        AND 
        mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
      

      The convert_with_replacement() is like CONVERT(), except that it

      • replaces non-converted symbols with the Replacement Character.
      • doesn't emit any warnings.

      Advantage of this approach:

      • the fix is not intrusive. We don't need to change the way string comparisons are done.
      • the fix covers all optimizer. Range optimizer will be able to handle utf8mb3_col= utf8mb4_const.

      Disadvantages:

      1. One condition becomes two. We can work around this by making make_join_select() ignore the "ADDED PART".
      2. What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

      *See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"*

      4.2 Suggested solution #2: change the way string comparisons are done

      Currently, fix_fields() code will inject CONVERT() call if character sets do not match.
      The idea is not not inject it, and instead allow string comparisons to use different character sets.
      They would dynamically check if the charset is the right one and convert if it is not..

      *See below for link to a patch with attempt to implement it*
      The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

      I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
      I had to add hooks in 4 places already:

      There are multiple places where conversion is done:
      1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
      2. CharsetNarrowing#2: class store_key_item, copy_inner()
      3. CharsetNarrowing#3: class store_key_item_const
      4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
      ...
      

      The patch seems to be very intrusive and risky.

      4.3 Suggested solution #3: pseudo-charset

      Basic idea: So we're comparing

      mb3col = mb4col

      We need to handle two operations:

      1. Compare the values. We can just assume that mb3col has utf8mb4 content in it.

      const char *mb3col_data= ...; // This is in UTF8MB3
      const char *mb4col_data= ...; // This is in UTF8mb4.
      const char *left_value_in_mb4= utf8mb3_data; // Ok
      res= compare(left_value_in_mb4, mb4col_data, &my_charset_utf8mb4_general_ci);

      2. Take MB4 data and create MB3 index lookup key from it.

      Narrowing is done by using a special "Narrowing charset":

      We take the source utf8mb4_general_ci but change mb_wc function to
      replace non-Bmp characters with MY_CS_REPLACEMENT_CHAR.

      Todo: is there really a single place we could replace the charset?

      5. Other stuff

      5.1 The optimization must be switchable

      One must be able to turn the new functionality on and off. Possible ways to do it:

      • optimizer_switch flag.
      • old_mode variable.

      Possible issues with optimizer_switch flag:

      • we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
      • we might need to check the flag on prepare stage (where the current code adds CONVERT(...} call). Optimizer stuff at the prepare stage is not good.

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            Description Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:

            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.
            psergei Sergei Petrunia made changes -
            Description Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.
            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            And these two characters are equal:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            And these two characters are equal:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}
            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}
            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution
            psergei Sergei Petrunia made changes -
            Summary tbl.utf8mb3_key_col=utf8mb4_val cannot be used for ref access utf8mb3_key_col=utf8mb4_value cannot be used for ref access
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution
            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way varchar comparisons are done.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way varchar comparisons are done.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..
            (Problem: it's not clear how intrusive this will be).

            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..
            (Problem: it's not clear how intrusive this will be).

            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..
            (Problem: it's not clear how intrusive this will be, it might be hard to put this behavior behind an @@optimizer_switch flag).

            monty Michael Widenius made changes -
            julien.fritsch Julien Fritsch made changes -
            Assignee Sergei Petrunia [ psergey ]
            serg Sergei Golubchik made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            psergei Sergei Petrunia added a comment - - edited

            Debugging bar 's patch:

            MariaDB [test]> explain select * from t1 join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+-----------+-------+--------------------------+
            | id   | select_type | table | type  | possible_keys | key  | key_len | ref       | rows  | Extra                    |
            +------+-------------+-------+-------+---------------+------+---------+-----------+-------+--------------------------+
            |    1 | SIMPLE      | t2    | index | a             | a    | 131     | NULL      | 10000 | Using where; Using index |
            |    1 | SIMPLE      | t1    | ref   | a             | a    | 99      | test.t2.a | 1     | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+-----------+-------+--------------------------+
            

            The CONVERT() function in not injected, but ref access on t2.a=... is not considered:

                        "ref_optimizer_key_uses": [
                          {
                            "table": "t1",
                            "field": "a",
                            "equals": "t2.a",
                            "null_rejecting": true
                          }
                        ]
            

            We skip it here:

              Field_longstr::cmp_to_string_with_stricter_collation(const Item_bool_func *cond,
                                                                   const Item *item) const
              { 
            =>  return cmp_is_done_using_type_handler_of_this(cond, item) &&
                       (charset() == cond->compare_collation() ||
                        cond->compare_collation()->state & MY_CS_BINSORT);
              }
            

            here, cmp_is_done_using_type_handler_of_this(...) = true

            but charset() = my_charset_utf8mb4_general_ci
            while cond->compare_collation()=my_charset_utf8mb3_general_ci

            Is it correct that cond->compare_collation()=my_charset_utf8mb3_general_ci ? We should compare in MB4...

            psergei Sergei Petrunia added a comment - - edited Debugging bar 's patch: MariaDB [test]> explain select * from t1 join t2 on t1.a=t2.a; +------+-------------+-------+-------+---------------+------+---------+-----------+-------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+-------+---------------+------+---------+-----------+-------+--------------------------+ | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using where; Using index | | 1 | SIMPLE | t1 | ref | a | a | 99 | test.t2.a | 1 | Using where; Using index | +------+-------------+-------+-------+---------------+------+---------+-----------+-------+--------------------------+ The CONVERT() function in not injected, but ref access on t2.a=... is not considered: "ref_optimizer_key_uses": [ { "table": "t1", "field": "a", "equals": "t2.a", "null_rejecting": true } ] We skip it here: Field_longstr::cmp_to_string_with_stricter_collation( const Item_bool_func *cond, const Item *item) const { => return cmp_is_done_using_type_handler_of_this(cond, item) && (charset() == cond->compare_collation() || cond->compare_collation()->state & MY_CS_BINSORT); } here, cmp_is_done_using_type_handler_of_this(...) = true but charset() = my_charset_utf8mb4_general_ci while cond->compare_collation()=my_charset_utf8mb3_general_ci Is it correct that cond->compare_collation()=my_charset_utf8mb3_general_ci ? We should compare in MB4...
            psergei Sergei Petrunia added a comment - - edited

            Made cond->compare_collation() be the MB4.

            Now, got to the point where the join query constructs a lookup key.
            It's here:

            (gdb) wher 12
              #0  my_convert_fix (to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801c7f3 "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>,    from=0x7fff180286ea "1", from_length=1, nchars=32, copy_status=0x7fffec9a1a00, conv_status=0x7fffec9a1a10) at /home/psergey/dev-git2/10.6-charset-fixes/strings/ctype.c:   1291  
              #1  0x0000555555fa0511 in String_copier::well_formed_copy (this=0x7fffec9a1a00, to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801c7f3 "",              to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff180286ea "1", from_length=1, nchars=32) at /home/psergey/dev-git2/10.6-charset-fixes/sql/ sql_string.cc:1148
              #2  0x0000555556237dba in Field_longstr::well_formed_copy_with_check (this=0x7fff18999190, to=0x7fff1801c7f3 "", to_length=96, from_cs=0x555557a4b060                    <my_charset_utf8mb4_general_ci>, from=0x7fff180286ea "1", from_length=1, nchars=32, count_spaces=true, copy_length=0x7fffec9a1a78) at /home/psergey/dev-git2/10.6-charset- fixes/sql/field.h:2196
              #3  0x0000555556228f40 in Field_varstring::store (this=0x7fff18999190, from=0x7fff180286ea "1", length=1, cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>) at /home/   psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7870
              #4  0x000055555623d3f0 in Field::do_field_string (copy=0x7fff1801c918) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:372
              #5  0x000055555623cdd6 in do_copy_null (copy=0x7fff1801c918) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:246
              #6  0x0000555555f5a3fe in store_key_field::copy_inner (this=0x7fff1801c8f0) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:1980
              #7  0x0000555555f5a166 in store_key::copy (this=0x7fff1801c8f0, thd=0x7fff18000d78) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:1924
              #8  0x0000555555f43448 in cp_buffer_from_ref (thd=0x7fff18000d78, table=0x7fff180207f8, ref=0x7fff1801bec8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select. cc:25586
              #9  0x0000555555f3ae1a in join_read_always_key (tab=0x7fff1801bc90) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:22474
              #10 0x0000555555f39138 in sub_select (join=0x7fff18019490, join_tab=0x7fff1801bc90, end_of_records=false) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:21700
              #11 0x0000555555f399ca in evaluate_join_record (join=0x7fff18019490, join_tab=0x7fff1801b8d0, error=0) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:   21933 
            

            Some of the functions along the stack trace should be the one where we use a different key construction algorithm.

            Should it be Field::do_field_string ?

            psergei Sergei Petrunia added a comment - - edited Made cond->compare_collation() be the MB4. Now, got to the point where the join query constructs a lookup key. It's here: (gdb) wher 12 #0 my_convert_fix (to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801c7f3 "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff180286ea "1", from_length=1, nchars=32, copy_status=0x7fffec9a1a00, conv_status=0x7fffec9a1a10) at /home/psergey/dev-git2/10.6-charset-fixes/strings/ctype.c: 1291 #1 0x0000555555fa0511 in String_copier::well_formed_copy (this=0x7fffec9a1a00, to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801c7f3 "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff180286ea "1", from_length=1, nchars=32) at /home/psergey/dev-git2/10.6-charset-fixes/sql/ sql_string.cc:1148 #2 0x0000555556237dba in Field_longstr::well_formed_copy_with_check (this=0x7fff18999190, to=0x7fff1801c7f3 "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff180286ea "1", from_length=1, nchars=32, count_spaces=true, copy_length=0x7fffec9a1a78) at /home/psergey/dev-git2/10.6-charset- fixes/sql/field.h:2196 #3 0x0000555556228f40 in Field_varstring::store (this=0x7fff18999190, from=0x7fff180286ea "1", length=1, cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>) at /home/ psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7870 #4 0x000055555623d3f0 in Field::do_field_string (copy=0x7fff1801c918) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:372 #5 0x000055555623cdd6 in do_copy_null (copy=0x7fff1801c918) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:246 #6 0x0000555555f5a3fe in store_key_field::copy_inner (this=0x7fff1801c8f0) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:1980 #7 0x0000555555f5a166 in store_key::copy (this=0x7fff1801c8f0, thd=0x7fff18000d78) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:1924 #8 0x0000555555f43448 in cp_buffer_from_ref (thd=0x7fff18000d78, table=0x7fff180207f8, ref=0x7fff1801bec8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select. cc:25586 #9 0x0000555555f3ae1a in join_read_always_key (tab=0x7fff1801bc90) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:22474 #10 0x0000555555f39138 in sub_select (join=0x7fff18019490, join_tab=0x7fff1801bc90, end_of_records=false) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:21700 #11 0x0000555555f399ca in evaluate_join_record (join=0x7fff18019490, join_tab=0x7fff1801b8d0, error=0) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc: 21933 Some of the functions along the stack trace should be the one where we use a different key construction algorithm. Should it be Field::do_field_string ?

            Attempt to look at how constants will be handled:

            set names utf8mb4; 
            explain select * from t1 where t1.a='abc';
            +------+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
            | id   | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra                    |
            +------+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
            |    1 | SIMPLE      | t1    | ref  | a             | a    | 99      | const | 1    | Using where; Using index |
            +------+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
            

            Here, the logic in Item_func_or_sum::agg_arg_charsets_for_comparison converts 'abc' to utf8mb3. If conversion fails, an error is produced:

            MariaDB [test]> select * from t1 where t1.a='ab𝌆';
            ERROR 1271 (HY000): Illegal mix of collations for operation '='
            

            psergei Sergei Petrunia added a comment - Attempt to look at how constants will be handled: set names utf8mb4; explain select * from t1 where t1.a='abc'; +------+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+ | 1 | SIMPLE | t1 | ref | a | a | 99 | const | 1 | Using where; Using index | +------+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+ Here, the logic in Item_func_or_sum::agg_arg_charsets_for_comparison converts 'abc' to utf8mb3. If conversion fails, an error is produced: MariaDB [test]> select * from t1 where t1.a='ab𝌆'; ERROR 1271 (HY000): Illegal mix of collations for operation '='
            psergei Sergei Petrunia added a comment - - edited

            Stack trace when using an Item value for lookup...

              #0  my_convert_fix (to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801bb7b "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fffec9a1a90 "11",       from_length=2, nchars=32, copy_status=0x7fffec9a19a0, conv_status=0x7fffec9a19b0) at /home/psergey/dev-git2/10.6-charset-fixes/strings/ctype.c:1294
              #1  0x0000555555fa0511 in String_copier::well_formed_copy (this=0x7fffec9a19a0, to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801bb7b "", to_length=96, from_cs=0x555557a4b060      <my_charset_utf8mb4_general_ci>, from=0x7fffec9a1a90 "11", from_length=2, nchars=32) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_string.cc:1148
              #2  0x0000555556237dba in Field_longstr::well_formed_copy_with_check (this=0x7fff1801bca0, to=0x7fff1801bb7b "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>,                from=0x7fffec9a1a90 "11", from_length=2, nchars=32, count_spaces=true, copy_length=0x7fffec9a1a18) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.h:2196
              #3  0x0000555556228f40 in Field_varstring::store (this=0x7fff1801bca0, from=0x7fffec9a1a90 "11", length=2, cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>) at /home/psergey/dev-git2/10.6-charset- fixes/sql/field.cc:7870
              #4  0x000055555628372a in Item::save_str_in_field (this=0x7fff18018338, field=0x7fff1801bca0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6777
              #5  0x000055555612e7b6 in Type_handler_string_result::Item_save_in_field (this=0x555557a8d640 <type_handler_varchar>, item=0x7fff18018338, field=0x7fff1801bca0, no_conversions=true) at /home/       psergey/dev-git2/10.6-charset-fixes/sql/sql_type.cc:4324
              #6  0x00005555562839a1 in Item::save_in_field (this=0x7fff18018338, field=0x7fff1801bca0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6815
              #7  0x0000555555f5a634 in store_key_item::copy_inner (this=0x7fff1801bc68) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:2031
              #8  0x0000555555f5a166 in store_key::copy (this=0x7fff1801bc68, thd=0x7fff18000d78) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:1924
              #9  0x0000555555f43448 in cp_buffer_from_ref (thd=0x7fff18000d78, table=0x7fff180207f8, ref=0x7fff1899d2c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:25586
              #10 0x0000555555f3ae1a in join_read_always_key (tab=0x7fff1899d090) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:22474
            

            Comparing with field-to-field copy:

              #0  my_convert_fix 
              #1  String_copier::well_formed_copy 
              #2  Field_longstr::well_formed_copy_with_check 
              #3  Field_varstring::store 
              #4  Item::save_str_in_field
              #5  Type_handler_string_result::Item_save_in_field
            

            vs

              #0  my_convert_fix 
              #1  String_copier::well_formed_copy 
              #2  Field_longstr::well_formed_copy_with_check 
              #3  Field_varstring::store 
              #4  Field::do_field_string 
              #5  do_copy_null
            

            psergei Sergei Petrunia added a comment - - edited Stack trace when using an Item value for lookup... #0 my_convert_fix (to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801bb7b "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fffec9a1a90 "11", from_length=2, nchars=32, copy_status=0x7fffec9a19a0, conv_status=0x7fffec9a19b0) at /home/psergey/dev-git2/10.6-charset-fixes/strings/ctype.c:1294 #1 0x0000555555fa0511 in String_copier::well_formed_copy (this=0x7fffec9a19a0, to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff1801bb7b "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fffec9a1a90 "11", from_length=2, nchars=32) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_string.cc:1148 #2 0x0000555556237dba in Field_longstr::well_formed_copy_with_check (this=0x7fff1801bca0, to=0x7fff1801bb7b "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fffec9a1a90 "11", from_length=2, nchars=32, count_spaces=true, copy_length=0x7fffec9a1a18) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.h:2196 #3 0x0000555556228f40 in Field_varstring::store (this=0x7fff1801bca0, from=0x7fffec9a1a90 "11", length=2, cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>) at /home/psergey/dev-git2/10.6-charset- fixes/sql/field.cc:7870 #4 0x000055555628372a in Item::save_str_in_field (this=0x7fff18018338, field=0x7fff1801bca0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6777 #5 0x000055555612e7b6 in Type_handler_string_result::Item_save_in_field (this=0x555557a8d640 <type_handler_varchar>, item=0x7fff18018338, field=0x7fff1801bca0, no_conversions=true) at /home/ psergey/dev-git2/10.6-charset-fixes/sql/sql_type.cc:4324 #6 0x00005555562839a1 in Item::save_in_field (this=0x7fff18018338, field=0x7fff1801bca0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6815 #7 0x0000555555f5a634 in store_key_item::copy_inner (this=0x7fff1801bc68) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:2031 #8 0x0000555555f5a166 in store_key::copy (this=0x7fff1801bc68, thd=0x7fff18000d78) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.h:1924 #9 0x0000555555f43448 in cp_buffer_from_ref (thd=0x7fff18000d78, table=0x7fff180207f8, ref=0x7fff1899d2c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:25586 #10 0x0000555555f3ae1a in join_read_always_key (tab=0x7fff1899d090) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_select.cc:22474 Comparing with field-to-field copy: #0 my_convert_fix #1 String_copier::well_formed_copy #2 Field_longstr::well_formed_copy_with_check #3 Field_varstring::store #4 Item::save_str_in_field #5 Type_handler_string_result::Item_save_in_field vs #0 my_convert_fix #1 String_copier::well_formed_copy #2 Field_longstr::well_formed_copy_with_check #3 Field_varstring::store #4 Field::do_field_string #5 do_copy_null
            psergei Sergei Petrunia added a comment - - edited

            Range Optimizer:

            Applicability is checked here (applicable):

              #0  Field_longstr::cmp_to_string_with_stricter_collation (this=0x7fff18020dc0, cond=0x7fff68015a08, item=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7434
              #1  0x0000555556227922 in Field_longstr::can_optimize_range (this=0x7fff18020dc0, cond=0x7fff68015a08, item=0x7fff680157c8, is_eq_func=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7475
              #2  0x00005555564043bd in Field::can_optimize_scalar_range (this=0x7fff18020dc0, param=0x7fffec956890, key_part=0x7fff68022440, cond=0x7fff68015a08, op=SCALAR_CMP_EQ, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8834
              #3  0x0000555556404b55 in Field_str::get_mm_leaf (this=0x7fff18020dc0, prm=0x7fffec956890, key_part=0x7fff68022440, cond=0x7fff68015a08, op=SCALAR_CMP_EQ, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8952
              #4  0x00005555564042d1 in Item_bool_func::get_mm_leaf (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, key_part=0x7fff68022440, functype=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8818
              #5  0x000055555640371c in Item_bool_func::get_mm_parts (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, type=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8653
              #6  0x0000555555e15dd7 in Item_bool_func2_with_rev::get_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:497
              #7  0x0000555556402529 in Item_bool_func::get_full_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field_item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8312
              #8  0x0000555555e15a16 in Item_bool_func::get_full_func_mm_tree_for_args (this=0x7fff68015a08, param=0x7fffec956890, item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:208
              #9  0x0000555555e15ffb in Item_bool_func2_with_rev::get_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, cond_ptr=0x7fff68019b08) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:525
              #10 0x00005555563f429b in SQL_SELECT::test_quick_select (this=0x7fff68019b00, thd=0x7fff68001228, keys_to_use={buffer = {1}}, prev_tables=0, limit=18446744073709551615, force_quick_range=false, ordered_output=false, remove_false_parts_of_where=true, only_single_index_range_scan=false) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:2892
            

            And then the lookup tuple is constructed here:

            (gdb) wher
              #0  my_convert_fix (to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff18020caa "", to_length=96,               from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff189a2d0e "1", from_length=1, nchars=32,                    copy_status=0x7fffec955d00, conv_status=0x7fffec955d10) at /home/psergey/dev-git2/10.6-charset-fixes/strings/ctype.c:1291
              #1  0x0000555555fa0511 in String_copier::well_formed_copy (this=0x7fffec955d00, to_cs=0x555557a4ab20                        <my_charset_utf8mb3_general_ci>, to=0x7fff18020caa "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>,  from=0x7fff189a2d0e "1", from_length=1, nchars=32) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_string.cc:1148
              #2  0x0000555556237dba in Field_longstr::well_formed_copy_with_check (this=0x7fff18020dc0, to=0x7fff18020caa "",            to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff189a2d0e "1", from_length=1, nchars=32,      count_spaces=true, copy_length=0x7fffec955d78) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.h:2196
              #3  0x0000555556228f40 in Field_varstring::store (this=0x7fff18020dc0, from=0x7fff189a2d0e "1", length=1, cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7870
              #4  0x000055555604de43 in Field::save_in_field_str (this=0x7fff189a2f38, to=0x7fff18020dc0) at /home/psergey/dev-git2/10.6- charset-fixes/sql/field.h:760
              #5  0x000055555604f43d in Field_str::save_in_field (this=0x7fff189a2f38, to=0x7fff18020dc0) at /home/psergey/dev-git2/10.6- charset-fixes/sql/field.h:2114
              #6  0x000055555604e012 in Field::store_field (this=0x7fff18020dc0, from=0x7fff189a2f38) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.h:922
              #7  0x000055555623edc5 in field_conv_incompatible (to=0x7fff18020dc0, from=0x7fff189a2f38) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:899
              #8  0x000055555623ee21 in field_conv (to=0x7fff18020dc0, from=0x7fff189a2f38) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:910
              #9  0x000055555628334d in save_field_in_field (from=0x7fff189a2f38, null_value=0x7fff6801582c, to=0x7fff18020dc0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6662
              #10 0x0000555556283583 in Item_field::save_in_field (this=0x7fff680157c8, to=0x7fff18020dc0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6712
              #11 0x0000555556272831 in Item::save_in_field_no_warnings (this=0x7fff680157c8, field=0x7fff18020dc0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:1514
              #12 0x0000555556404b85 in Field_str::get_mm_leaf (this=0x7fff18020dc0, prm=0x7fffec956890, key_part=0x7fff68022440, cond=0x7fff68015a08, op=SCALAR_CMP_EQ, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8954
              #13 0x00005555564042d1 in Item_bool_func::get_mm_leaf (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, key_part=0x7fff68022440, functype=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8818
              #14 0x000055555640371c in Item_bool_func::get_mm_parts (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, type=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8653
              #15 0x0000555555e15dd7 in Item_bool_func2_with_rev::get_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:497
              #16 0x0000555556402529 in Item_bool_func::get_full_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field_item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8312
              #17 0x0000555555e15a16 in Item_bool_func::get_full_func_mm_tree_for_args (this=0x7fff68015a08, param=0x7fffec956890, item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:208
              #18 0x0000555555e15ffb in Item_bool_func2_with_rev::get_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, cond_ptr=0x7fff68019b08) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:525
              #19 0x00005555563f429b in SQL_SELECT::test_quick_select (this=0x7fff68019b00, thd=0x7fff68001228, keys_to_use={buffer = {1}}, prev_tables=0, limit=18446744073709551615, force_quick_range=false, ordered_output=false, remove_false_parts_of_where=true, only_single_index_range_scan=false) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:2892
            

              #0  my_convert_fix
              #1  0x0000555555fa0511 in String_copier::well_formed_copy
              #2  0x0000555556237dba in Field_longstr::well_formed_copy_with_check
              #3  0x0000555556228f40 in Field_varstring::store
              #4  0x000055555604de43 in Field::save_in_field_str
              #5  0x000055555604f43d in Field_str::save_in_field
              #6  0x000055555604e012 in Field::store_field
              #7  0x000055555623edc5 in field_conv_incompatible
              #8  0x000055555623ee21 in field_conv
              #9  0x000055555628334d in save_field_in_field
              #10 0x0000555556283583 in Item_field::save_in_field
              #11 0x0000555556272831 in Item::save_in_field_no_warnings
              #12 0x0000555556404b85 in Field_str::get_mm_leaf
              #13 0x00005555564042d1 in Item_bool_func::get_mm_leaf
            

            psergei Sergei Petrunia added a comment - - edited Range Optimizer: Applicability is checked here (applicable): #0 Field_longstr::cmp_to_string_with_stricter_collation (this=0x7fff18020dc0, cond=0x7fff68015a08, item=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7434 #1 0x0000555556227922 in Field_longstr::can_optimize_range (this=0x7fff18020dc0, cond=0x7fff68015a08, item=0x7fff680157c8, is_eq_func=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7475 #2 0x00005555564043bd in Field::can_optimize_scalar_range (this=0x7fff18020dc0, param=0x7fffec956890, key_part=0x7fff68022440, cond=0x7fff68015a08, op=SCALAR_CMP_EQ, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8834 #3 0x0000555556404b55 in Field_str::get_mm_leaf (this=0x7fff18020dc0, prm=0x7fffec956890, key_part=0x7fff68022440, cond=0x7fff68015a08, op=SCALAR_CMP_EQ, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8952 #4 0x00005555564042d1 in Item_bool_func::get_mm_leaf (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, key_part=0x7fff68022440, functype=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8818 #5 0x000055555640371c in Item_bool_func::get_mm_parts (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, type=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8653 #6 0x0000555555e15dd7 in Item_bool_func2_with_rev::get_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:497 #7 0x0000555556402529 in Item_bool_func::get_full_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field_item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8312 #8 0x0000555555e15a16 in Item_bool_func::get_full_func_mm_tree_for_args (this=0x7fff68015a08, param=0x7fffec956890, item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:208 #9 0x0000555555e15ffb in Item_bool_func2_with_rev::get_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, cond_ptr=0x7fff68019b08) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:525 #10 0x00005555563f429b in SQL_SELECT::test_quick_select (this=0x7fff68019b00, thd=0x7fff68001228, keys_to_use={buffer = {1}}, prev_tables=0, limit=18446744073709551615, force_quick_range=false, ordered_output=false, remove_false_parts_of_where=true, only_single_index_range_scan=false) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:2892 And then the lookup tuple is constructed here: (gdb) wher #0 my_convert_fix (to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff18020caa "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff189a2d0e "1", from_length=1, nchars=32, copy_status=0x7fffec955d00, conv_status=0x7fffec955d10) at /home/psergey/dev-git2/10.6-charset-fixes/strings/ctype.c:1291 #1 0x0000555555fa0511 in String_copier::well_formed_copy (this=0x7fffec955d00, to_cs=0x555557a4ab20 <my_charset_utf8mb3_general_ci>, to=0x7fff18020caa "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff189a2d0e "1", from_length=1, nchars=32) at /home/psergey/dev-git2/10.6-charset-fixes/sql/sql_string.cc:1148 #2 0x0000555556237dba in Field_longstr::well_formed_copy_with_check (this=0x7fff18020dc0, to=0x7fff18020caa "", to_length=96, from_cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>, from=0x7fff189a2d0e "1", from_length=1, nchars=32, count_spaces=true, copy_length=0x7fffec955d78) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.h:2196 #3 0x0000555556228f40 in Field_varstring::store (this=0x7fff18020dc0, from=0x7fff189a2d0e "1", length=1, cs=0x555557a4b060 <my_charset_utf8mb4_general_ci>) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.cc:7870 #4 0x000055555604de43 in Field::save_in_field_str (this=0x7fff189a2f38, to=0x7fff18020dc0) at /home/psergey/dev-git2/10.6- charset-fixes/sql/field.h:760 #5 0x000055555604f43d in Field_str::save_in_field (this=0x7fff189a2f38, to=0x7fff18020dc0) at /home/psergey/dev-git2/10.6- charset-fixes/sql/field.h:2114 #6 0x000055555604e012 in Field::store_field (this=0x7fff18020dc0, from=0x7fff189a2f38) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field.h:922 #7 0x000055555623edc5 in field_conv_incompatible (to=0x7fff18020dc0, from=0x7fff189a2f38) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:899 #8 0x000055555623ee21 in field_conv (to=0x7fff18020dc0, from=0x7fff189a2f38) at /home/psergey/dev-git2/10.6-charset-fixes/sql/field_conv.cc:910 #9 0x000055555628334d in save_field_in_field (from=0x7fff189a2f38, null_value=0x7fff6801582c, to=0x7fff18020dc0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6662 #10 0x0000555556283583 in Item_field::save_in_field (this=0x7fff680157c8, to=0x7fff18020dc0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:6712 #11 0x0000555556272831 in Item::save_in_field_no_warnings (this=0x7fff680157c8, field=0x7fff18020dc0, no_conversions=true) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item.cc:1514 #12 0x0000555556404b85 in Field_str::get_mm_leaf (this=0x7fff18020dc0, prm=0x7fffec956890, key_part=0x7fff68022440, cond=0x7fff68015a08, op=SCALAR_CMP_EQ, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8954 #13 0x00005555564042d1 in Item_bool_func::get_mm_leaf (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, key_part=0x7fff68022440, functype=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8818 #14 0x000055555640371c in Item_bool_func::get_mm_parts (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, type=Item_func::EQ_FUNC, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8653 #15 0x0000555555e15dd7 in Item_bool_func2_with_rev::get_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field=0x7fff18020dc0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:497 #16 0x0000555556402529 in Item_bool_func::get_full_func_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, field_item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:8312 #17 0x0000555555e15a16 in Item_bool_func::get_full_func_mm_tree_for_args (this=0x7fff68015a08, param=0x7fffec956890, item=0x7fff680158f0, value=0x7fff680157c8) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:208 #18 0x0000555555e15ffb in Item_bool_func2_with_rev::get_mm_tree (this=0x7fff68015a08, param=0x7fffec956890, cond_ptr=0x7fff68019b08) at /home/psergey/dev-git2/10.6-charset-fixes/sql/item_cmpfunc.h:525 #19 0x00005555563f429b in SQL_SELECT::test_quick_select (this=0x7fff68019b00, thd=0x7fff68001228, keys_to_use={buffer = {1}}, prev_tables=0, limit=18446744073709551615, force_quick_range=false, ordered_output=false, remove_false_parts_of_where=true, only_single_index_range_scan=false) at /home/psergey/dev-git2/10.6-charset-fixes/sql/opt_range.cc:2892 #0 my_convert_fix #1 0x0000555555fa0511 in String_copier::well_formed_copy #2 0x0000555556237dba in Field_longstr::well_formed_copy_with_check #3 0x0000555556228f40 in Field_varstring::store #4 0x000055555604de43 in Field::save_in_field_str #5 0x000055555604f43d in Field_str::save_in_field #6 0x000055555604e012 in Field::store_field #7 0x000055555623edc5 in field_conv_incompatible #8 0x000055555623ee21 in field_conv #9 0x000055555628334d in save_field_in_field #10 0x0000555556283583 in Item_field::save_in_field #11 0x0000555556272831 in Item::save_in_field_no_warnings #12 0x0000555556404b85 in Field_str::get_mm_leaf #13 0x00005555564042d1 in Item_bool_func::get_mm_leaf

            The common part of trace between execution variants and the range optimizer is:

              #0  my_convert_fix
              #1  String_copier::well_formed_copy
              #2  Field_longstr::well_formed_copy_with_check
              #3  Field_varstring::store
            

            Would it be safe to put "Charset narrowing" into one of these functions?

            psergei Sergei Petrunia added a comment - The common part of trace between execution variants and the range optimizer is: #0 my_convert_fix #1 String_copier::well_formed_copy #2 Field_longstr::well_formed_copy_with_check #3 Field_varstring::store Would it be safe to put "Charset narrowing" into one of these functions?
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..
            (Problem: it's not clear how intrusive this will be, it might be hard to put this behavior behind an @@optimizer_switch flag).

            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..
            (Problem: it's not clear how intrusive this will be, it might be hard to put this behavior behind an @@optimizer_switch flag).

            h2. The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..
            (Problem: it's not clear how intrusive this will be, it might be hard to put this behavior behind an @@optimizer_switch flag).

            h2. The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..

            h2. The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.

            Tried to add this logic into String_copier::well_formed_copy() :

            +    if (from_cs == &my_charset_utf8mb4_general_ci &&
            +      to_cs == &my_charset_utf8mb3_general_ci)
            +      // Do charset narrowing
            

            This causes test failures. For example, the statement in ctype_utf8mb4.test

            ALTER TABLE t1 CONVERT TO CHARACTER SET utf8;
            

            starts to use narrowing and no longer produces the {{Incorrect value ... }} warnings.

            psergei Sergei Petrunia added a comment - Tried to add this logic into String_copier::well_formed_copy() : + if (from_cs == &my_charset_utf8mb4_general_ci && + to_cs == &my_charset_utf8mb3_general_ci) + // Do charset narrowing This causes test failures. For example, the statement in ctype_utf8mb4.test ALTER TABLE t1 CONVERT TO CHARACTER SET utf8; starts to use narrowing and no longer produces the {{Incorrect value ... }} warnings.

            The branch:
            https://github.com/MariaDB/server/tree/bb-10.6-mdev32113
            now has a commit with an unfinished patch for aproach #2:

            https://github.com/MariaDB/server/commit/2fb84b63fb1514c684a5d254ff0df12bb592fa95

            The commit comment

            ...
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            

            lists the places where I had to put in (or plan to put in) charset narrowing hooks.

            psergei Sergei Petrunia added a comment - The branch: https://github.com/MariaDB/server/tree/bb-10.6-mdev32113 now has a commit with an unfinished patch for aproach #2: https://github.com/MariaDB/server/commit/2fb84b63fb1514c684a5d254ff0df12bb592fa95 The commit comment ... There are multiple places where conversion is done: 1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string. 2. CharsetNarrowing#2: class store_key_item, copy_inner() 3. CharsetNarrowing#3: class store_key_item_const 4. CharsetNarrowing#4: in Field_str::get_mm_leaf() ... lists the places where I had to put in (or plan to put in) charset narrowing hooks.
            julien.fritsch Julien Fritsch made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            A much smaller nearly-finished patch implementing approach #1:
            https://github.com/MariaDB/server/tree/bb-10.6-mdev32113-variant1

            psergei Sergei Petrunia added a comment - A much smaller nearly-finished patch implementing approach #1: https://github.com/MariaDB/server/tree/bb-10.6-mdev32113-variant1
            serg Sergei Golubchik made changes -
            Issue Type Bug [ 1 ] Task [ 3 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            serg Sergei Golubchik made changes -
            Component/s Character Sets [ 10801 ]
            Component/s Optimizer [ 10200 ]
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then doing the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets. They would dynamically check if the charset is the right one and convert if it is not..

            h2. The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h2. Suggested solution #3: pseudo-charset

            h1. Other stuff
            h2. The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia made changes -
            Description h1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h2. Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h2. Suggested solution #3: pseudo-charset

            h1. Other stuff
            h2. The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia made changes -
            Description h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            {toc:printable=true|style=square|maxLevel=2|indent=5px|minLevel=2|class=bigpink|exclude=[1//2]|type=list|outline=true|include=.*}

            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia made changes -
            Description {toc:printable=true|style=square|maxLevel=2|indent=5px|minLevel=2|class=bigpink|exclude=[1//2]|type=list|outline=true|include=.*}

            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            {toc}

            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia made changes -
            Description {toc}

            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia made changes -
            Description h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset

            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            h1. Contents
            {code}
            1.Task setting
            2. From utf8mb3_val=utf8mb4_val to index lookups
            3. NEW: Can we handle non-equality comparisons too?
            4. How to implement this
            4.1 Suggested solution #1: rewrite
            4.2 Suggested solution #2: change the way string comparisons are done
            4.3 Suggested solution #3: pseudo-charset
            5. Other stuff
            5.1 The optimization must be switchable
            {code}

            h1. 1. Task setting
            Consider this example:

            {code:sql}
            create table t1 (
              a varchar(32) collate utf8mb3_general_ci,
              key(a)
            );
            insert into t1 select seq from seq_1_to_10000;

            create table t2 (
              a varchar(32) collate utf8mb4_general_ci,
              key(a)
            );
            insert into t2 select seq from seq_1_to_10000;
            {code}

            The optimizer is able to use the index on t2.a for ref access:

            {code}
            explain select * from t1 straight_join t2 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
            {code}

            But not the index on t1.a:
            {code}
            explain select * from t2 straight_join t1 on t1.a=t2.a;
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            | 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
            | 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
            +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
            {code}

            This is because internally t1.a=t2.a is converted into:

            {code}
            convert(t1.a using utf8mb4) = t2.a
            {code}

            This MDEV is about making it possible to use ref access on t2.a.

            h2. 2. From utf8mb3_val=utf8mb4_val to index lookups

            When we have
            {code}
            utf8mb3_key_col = utf8mb4_value
            {code}
            the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

            But what if we want to use
            {code}
            INDEX(utf8mb3_key_col)
            {code}
            to find the matching value?

            UTF8-MB4 consists of
            * BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
            * Characters from other Planes, named Supplementary or Astral planes.

            Consider a lookup value that is in UTF8MB4.
            * If the characters are in BMP, we can just make the lookup.
            * What if there are characters outside BMP? They *may* or may not compare as equal to characters in BMP. It depends on the collation.

            In {{utf8mb4_general_ci}}, all characters outside the BMP compare as equal to each other
            and equal to the {{U+FFFD}} Replacement Character, which is in BMP (In the code, search for {{MY_CS_REPLACEMENT_CHARACTER}}).

            A demo
            {code:sql}
            create table t12 (
              mb3 varchar(10) collate utf8mb3_general_ci,
              mb4 varchar(10) collate utf8mb4_general_ci
            );

            -- mb3 has the Replacement Character.
            -- mb4 has some non-BMP character.
            insert into t12 values (
             _utf8mb3 0xEFBFBD,
             _utf8mb4 0xF09D8C86);
            {code}

            {code:sql}
            set names utf8mb4;
            MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
            +------+----------+------+----------+
            | mb3 | hex(mb3) | mb4 | hex(mb4) |
            +------+----------+------+----------+
            | � | EFBFBD | 𝌆 | F09D8C86 |
            +------+----------+------+----------+
            1 row in set (0.001 sec)
            {code}

            Non-bmp character is equal to the Replacement Character:
            {code}
            MariaDB [test]> select mb3=mb4 from t12;
            +---------+
            | mb3=mb4 |
            +---------+
            | 1 |
            +---------+
            1 row in set (0.004 sec)
            {code}

            If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
            {code}
            MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
            +----------------------------+
            | convert(mb4 using utf8mb3) |
            +----------------------------+
            | ? |
            +----------------------------+
            1 row in set, 1 warning (0.004 sec)

            MariaDB [test]> show warnings;
            +---------+------+------------------------------------------------------------+
            | Level | Code | Message |
            +---------+------+------------------------------------------------------------+
            | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
            +---------+------+------------------------------------------------------------+
            {code}

            h2. 3. (NEW) NEW: Can we handle non-equality comparisons too?

            It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

            h2. 4. How to implement this
            h3. 4.1 Suggested solution #1: rewrite

            So, currently, when one calls fix_fields() for
            {code:sql}
              mb3_key_col= mb4_col
            {code}
            it will be converted into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col
            {code}

            The optimizer only considers ref access for column references, that is, for
            {code:sql}
               mb3_key_col= ...
            {code}
            The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
            {code:sql}
              convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
              AND
              mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
            {code}

            The {{convert_with_replacement()}} is like {{CONVERT()}}, except that it
            * replaces non-converted symbols with the Replacement Character.
            * doesn't emit any warnings.

            Advantage of this approach:
            - the fix is not intrusive. We don't need to change the way string comparisons are done.
            - the fix covers all optimizer. Range optimizer will be able to handle {{utf8mb3_col= utf8mb4_const}}.

            Disadvantages:
            # One condition becomes two. We can work around this by making {{make_join_select()}} ignore the "ADDED PART".
            # What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

            **See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"**

            h3. 4.2 Suggested solution #2: change the way string comparisons are done

            Currently, fix_fields() code will inject {{CONVERT()}} call if character sets do not match.
            The idea is not not inject it, and instead allow string comparisons to use different character sets.
            They would dynamically check if the charset is the right one and convert if it is not..

            **See below for link to a patch with attempt to implement it**
            The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

            I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
            I had to add hooks in 4 places already:
            {code}
            There are multiple places where conversion is done:
            1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
            2. CharsetNarrowing#2: class store_key_item, copy_inner()
            3. CharsetNarrowing#3: class store_key_item_const
            4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
            ...
            {code}
            The patch seems to be very intrusive and risky.

            h3. 4.3 Suggested solution #3: pseudo-charset
            Basic idea: So we're comparing

              mb3col = mb4col

            We need to handle two operations:

            1. Compare the values. We can just assume that mb3col has utf8mb4 content in it.

              const char *mb3col_data= ...; // This is in UTF8MB3
              const char *mb4col_data= ...; // This is in UTF8mb4.
              const char *left_value_in_mb4= utf8mb3_data; // Ok
              res= compare(left_value_in_mb4, mb4col_data, &my_charset_utf8mb4_general_ci);

            2. Take MB4 data and create MB3 index lookup key from it.

            Narrowing is done by using a special "Narrowing charset":

            We take the source utf8mb4_general_ci but change mb_wc function to
            replace non-Bmp characters with MY_CS_REPLACEMENT_CHAR.

            Todo: is there really a single place we could replace the charset?



            h2. 5. Other stuff
            h3. 5.1 The optimization must be switchable

            One must be able to turn the new functionality on and off. Possible ways to do it:
            * {{optimizer_switch}} flag.
            * {{old_mode}} variable.

            Possible issues with optimizer_switch flag:
            * we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
            * we might need to check the flag on prepare stage (where the current code adds {{CONVERT(...}}} call). Optimizer stuff at the prepare stage is not good.
            psergei Sergei Petrunia added a comment - - edited

            If we implement Suggestion #3 and do Charset Narrowing using a modified Character Set... will we have a single place to replace the charset?

            An equality for ref access

            t10.mb4=t1.mb3
            

            is represented as

            Item_field(Field_varstring(mb3)) == Item_field(Field_varstring(mb4))
            

            I don't think we can modify the charset in Field_varstring() object. It is an object from table->field[N] and other parts of the query may have references to it.

            We can modif the charset of the Item_field object.

            But that won't help: the code in create_ref_for_key() actually analyzes what kind of Item it is getting value from. In case it is Item_field (or Item_*ref(... Item_field... )), it will setup copying between Field objects:

              #0  Field_varstring::get_copy_func (this=0x7fffa801c608, from=0x7fffa89a3c30) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/field_conv.cc:811
              #1  0x000055555604e022 in Field::get_copy_func_to (this=0x7fffa89a3c30, to=0x7fffa801c608) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/field.h:919
              #2  0x000055555623e664 in Copy_field::set (this=0x7fffa801c588, to=0x7fffa801c608, from=0x7fffa89a3c30, save=false) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/field_conv.cc:755
              #3  0x0000555555f5a31e in store_key_field::store_key_field (this=0x7fffa801c560, thd=0x7fffa8000d78, to_field_arg=0x7fffa8036a38, ptr=0x7fffa801c461 "", null_ptr_arg=0x7fffa801c460 "", length=96, from_field=0x7fffa89a3c30, name_arg=0x7fffa801c550 "test.t10.mb4") at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.h:1952
              #4  0x0000555555f1e8cd in get_store_key (thd=0x7fffa8000d78, keyuse=0x7fffa801ae60, used_tables=0, key_part=0x7fffa80375d8, key_buff=0x7fffa801c460 "", maybe_null=1) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.cc:11798
              #5  0x0000555555f1e352 in create_ref_for_key (join=0x7fffa8018ca8, j=0x7fffa801b900, org_keyuse=0x7fffa801ae60, allow_full_scan=true, used_tables=4611686018427387907) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.cc:11703
              #6  0x0000555555f1cfc9 in JOIN::get_best_combination (this=0x7fffa8018ca8) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.cc:11362
            

            psergei Sergei Petrunia added a comment - - edited If we implement Suggestion #3 and do Charset Narrowing using a modified Character Set... will we have a single place to replace the charset? An equality for ref access t10.mb4=t1.mb3 is represented as Item_field(Field_varstring(mb3)) == Item_field(Field_varstring(mb4)) I don't think we can modify the charset in Field_varstring() object. It is an object from table->field [N] and other parts of the query may have references to it. We can modif the charset of the Item_field object. But that won't help: the code in create_ref_for_key() actually analyzes what kind of Item it is getting value from. In case it is Item_field (or Item_*ref(... Item_field... )), it will setup copying between Field objects: #0 Field_varstring::get_copy_func (this=0x7fffa801c608, from=0x7fffa89a3c30) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/field_conv.cc:811 #1 0x000055555604e022 in Field::get_copy_func_to (this=0x7fffa89a3c30, to=0x7fffa801c608) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/field.h:919 #2 0x000055555623e664 in Copy_field::set (this=0x7fffa801c588, to=0x7fffa801c608, from=0x7fffa89a3c30, save=false) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/field_conv.cc:755 #3 0x0000555555f5a31e in store_key_field::store_key_field (this=0x7fffa801c560, thd=0x7fffa8000d78, to_field_arg=0x7fffa8036a38, ptr=0x7fffa801c461 "", null_ptr_arg=0x7fffa801c460 "", length=96, from_field=0x7fffa89a3c30, name_arg=0x7fffa801c550 "test.t10.mb4") at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.h:1952 #4 0x0000555555f1e8cd in get_store_key (thd=0x7fffa8000d78, keyuse=0x7fffa801ae60, used_tables=0, key_part=0x7fffa80375d8, key_buff=0x7fffa801c460 "", maybe_null=1) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.cc:11798 #5 0x0000555555f1e352 in create_ref_for_key (join=0x7fffa8018ca8, j=0x7fffa801b900, org_keyuse=0x7fffa801ae60, allow_full_scan=true, used_tables=4611686018427387907) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.cc:11703 #6 0x0000555555f1cfc9 in JOIN::get_best_combination (this=0x7fffa8018ca8) at /home/psergey/dev-git2/10.6-charset-fixes-v3/sql/sql_select.cc:11362

            Note:
            For some queries, one can see question marks in the Optimizer Trace

                          "attached_conditions_summary": [
                            {
                              "table": "t1",
                              "attached": "'????' = t1.mb3"
                            }
            

            This is bad but it's not related to this bug.
            Optimizer Trace is in UTF8MB3 (we should make it MB4). When an UTF8MB4 value is written into UTF8MB3 string, non-BMP characters get replaced by question marks.

            psergei Sergei Petrunia added a comment - Note: For some queries, one can see question marks in the Optimizer Trace "attached_conditions_summary": [ { "table": "t1", "attached": "'????' = t1.mb3" } This is bad but it's not related to this bug. Optimizer Trace is in UTF8MB3 (we should make it MB4). When an UTF8MB4 value is written into UTF8MB3 string, non-BMP characters get replaced by question marks.
            psergei Sergei Petrunia added a comment - Current code is in : https://github.com/MariaDB/server/tree/bb-10.6-mdev32113-variant3

            Patch for review: https://github.com/MariaDB/server/commit/3bd8810117aefbe08a4895416a126e50c021447a
            pushed into bb-10.6-mdev32113-variant3.

            psergei Sergei Petrunia added a comment - Patch for review: https://github.com/MariaDB/server/commit/3bd8810117aefbe08a4895416a126e50c021447a pushed into bb-10.6-mdev32113-variant3 .
            psergei Sergei Petrunia made changes -
            Assignee Sergei Petrunia [ psergey ] Michael Widenius [ monty ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            psergei Sergei Petrunia added a comment - Patch addressing review input: https://github.com/MariaDB/server/commit/c81db721cad473eaea716228b1380fa827795f81
            lstartseva Lena Startseva added a comment - - edited

            Testing of variant3 is done, if there is no changes, it's Ok to push

            lstartseva Lena Startseva added a comment - - edited Testing of variant3 is done, if there is no changes, it's Ok to push

            Review approved

            monty Michael Widenius added a comment - Review approved
            monty Michael Widenius made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]
            monty Michael Widenius made changes -
            Assignee Michael Widenius [ monty ] Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.6.16 [ 29014 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.6.16 [ 29014 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 10.6.16 [ 29014 ]
            Fix Version/s 10.10.7 [ 29018 ]
            Fix Version/s 10.11.6 [ 29020 ]
            Fix Version/s 11.0.4 [ 29021 ]
            Fix Version/s 11.1.3 [ 29023 ]
            Fix Version/s 11.2.2 [ 29035 ]
            Fix Version/s 11.3.1 [ 29416 ]
            psergei Sergei Petrunia made changes -
            monty Michael Widenius made changes -
            psergei Sergei Petrunia added a comment - The docs are here: https://mariadb.com/kb/en/charset-narrowing-optimization/
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 178303 113257

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.