Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32113

utf8mb3_key_col=utf8mb4_value cannot be used for ref access

    XMLWordPrintable

Details

    Description

      Contents

      1.Task setting
      2. From utf8mb3_val=utf8mb4_val to index lookups
      3. NEW: Can we handle non-equality comparisons too?
      4. How to implement this
      4.1 Suggested solution #1: rewrite
      4.2 Suggested solution #2: change the way string comparisons are done
      4.3 Suggested solution #3: pseudo-charset
      5. Other stuff
      5.1 The optimization must be switchable
      

      1. Task setting

      Consider this example:

      create table t1 (
        a varchar(32) collate utf8mb3_general_ci,
        key(a)
      );
      insert into t1 select seq from seq_1_to_10000;
       
      create table t2 (
        a varchar(32) collate utf8mb4_general_ci,
        key(a)
      );
      insert into t2 select seq from seq_1_to_10000;
      

      The optimizer is able to use the index on t2.a for ref access:

      explain select * from t1 straight_join t2 on t1.a=t2.a;
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                    |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
      |    1 | SIMPLE      | t1    | index | NULL          | a    | 99      | NULL | 10000 | Using index              |
      |    1 | SIMPLE      | t2    | ref   | a             | a    | 131     | func | 1     | Using where; Using index |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
      

      But not the index on t1.a:

      explain select * from t2 straight_join t1 on t1.a=t2.a;
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                                                        |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
      |    1 | SIMPLE      | t2    | index | a             | a    | 131     | NULL | 10000 | Using index                                                  |
      |    1 | SIMPLE      | t1    | index | NULL          | a    | 99      | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
      +------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
      

      This is because internally t1.a=t2.a is converted into:

      convert(t1.a using utf8mb4) = t2.a
      

      This MDEV is about making it possible to use ref access on t2.a.

      2. From utf8mb3_val=utf8mb4_val to index lookups

      When we have

      utf8mb3_key_col = utf8mb4_value
      

      the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

      But what if we want to use

      INDEX(utf8mb3_key_col) 
      

      to find the matching value?

      UTF8-MB4 consists of

      • BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
      • Characters from other Planes, named Supplementary or Astral planes.

      Consider a lookup value that is in UTF8MB4.

      • If the characters are in BMP, we can just make the lookup.
      • What if there are characters outside BMP? They may or may not compare as equal to characters in BMP. It depends on the collation.

      In utf8mb4_general_ci, all characters outside the BMP compare as equal to each other
      and equal to the U+FFFD Replacement Character, which is in BMP (In the code, search for MY_CS_REPLACEMENT_CHARACTER).

      A demo

      create table t12 (
        mb3 varchar(10) collate utf8mb3_general_ci,
        mb4 varchar(10) collate utf8mb4_general_ci
      );
       
      -- mb3 has the Replacement Character. 
      -- mb4 has some non-BMP character.
      insert into t12 values (
       _utf8mb3 0xEFBFBD,
       _utf8mb4 0xF09D8C86);
      

      set names utf8mb4;
      MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;
      +------+----------+------+----------+
      | mb3  | hex(mb3) | mb4  | hex(mb4) |
      +------+----------+------+----------+
      | �    | EFBFBD   | 𝌆     | F09D8C86 |
      +------+----------+------+----------+
      1 row in set (0.001 sec)
      

      Non-bmp character is equal to the Replacement Character:

      MariaDB [test]> select mb3=mb4 from t12;
      +---------+
      | mb3=mb4 |
      +---------+
      |       1 |
      +---------+
      1 row in set (0.004 sec)
      

      If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:

      MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
      +----------------------------+
      | convert(mb4 using utf8mb3) |
      +----------------------------+
      | ?                          |
      +----------------------------+
      1 row in set, 1 warning (0.004 sec)
       
      MariaDB [test]> show warnings;
      +---------+------+------------------------------------------------------------+
      | Level   | Code | Message                                                    |
      +---------+------+------------------------------------------------------------+
      | Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
      +---------+------+------------------------------------------------------------+
      

      3. (NEW) NEW: Can we handle non-equality comparisons too?

      It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

      4. How to implement this

      4.1 Suggested solution #1: rewrite

      So, currently, when one calls fix_fields() for

        mb3_key_col= mb4_col 
      

      it will be converted into

        convert(mb3_key_col using utf8mb4) = mb4_col
      

      The optimizer only considers ref access for column references, that is, for

         mb3_key_col= ...
      

      The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into

        convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE
        AND 
        mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART
      

      The convert_with_replacement() is like CONVERT(), except that it

      • replaces non-converted symbols with the Replacement Character.
      • doesn't emit any warnings.

      Advantage of this approach:

      • the fix is not intrusive. We don't need to change the way string comparisons are done.
      • the fix covers all optimizer. Range optimizer will be able to handle utf8mb3_col= utf8mb4_const.

      Disadvantages:

      1. One condition becomes two. We can work around this by making make_join_select() ignore the "ADDED PART".
      2. What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

      *See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"*

      4.2 Suggested solution #2: change the way string comparisons are done

      Currently, fix_fields() code will inject CONVERT() call if character sets do not match.
      The idea is not not inject it, and instead allow string comparisons to use different character sets.
      They would dynamically check if the charset is the right one and convert if it is not..

      *See below for link to a patch with attempt to implement it*
      The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

      I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
      I had to add hooks in 4 places already:

      There are multiple places where conversion is done:
      1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
      2. CharsetNarrowing#2: class store_key_item, copy_inner()
      3. CharsetNarrowing#3: class store_key_item_const
      4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
      ...
      

      The patch seems to be very intrusive and risky.

      4.3 Suggested solution #3: pseudo-charset

      Basic idea: So we're comparing

      mb3col = mb4col

      We need to handle two operations:

      1. Compare the values. We can just assume that mb3col has utf8mb4 content in it.

      const char *mb3col_data= ...; // This is in UTF8MB3
      const char *mb4col_data= ...; // This is in UTF8mb4.
      const char *left_value_in_mb4= utf8mb3_data; // Ok
      res= compare(left_value_in_mb4, mb4col_data, &my_charset_utf8mb4_general_ci);

      2. Take MB4 data and create MB3 index lookup key from it.

      Narrowing is done by using a special "Narrowing charset":

      We take the source utf8mb4_general_ci but change mb_wc function to
      replace non-Bmp characters with MY_CS_REPLACEMENT_CHAR.

      Todo: is there really a single place we could replace the charset?

      5. Other stuff

      5.1 The optimization must be switchable

      One must be able to turn the new functionality on and off. Possible ways to do it:

      • optimizer_switch flag.
      • old_mode variable.

      Possible issues with optimizer_switch flag:

      • we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
      • we might need to check the flag on prepare stage (where the current code adds CONVERT(...} call). Optimizer stuff at the prepare stage is not good.

      Attachments

        Issue Links

          Activity

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.