Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
None
Description
Contents
1.Task setting
|
2. From utf8mb3_val=utf8mb4_val to index lookups
|
3. NEW: Can we handle non-equality comparisons too?
|
4. How to implement this
|
4.1 Suggested solution #1: rewrite
|
4.2 Suggested solution #2: change the way string comparisons are done
|
4.3 Suggested solution #3: pseudo-charset
|
5. Other stuff
|
5.1 The optimization must be switchable
|
1. Task setting
Consider this example:
create table t1 ( |
a varchar(32) collate utf8mb3_general_ci, |
key(a) |
);
|
insert into t1 select seq from seq_1_to_10000; |
 |
create table t2 ( |
a varchar(32) collate utf8mb4_general_ci, |
key(a) |
);
|
insert into t2 select seq from seq_1_to_10000; |
The optimizer is able to use the index on t2.a for ref access:
explain select * from t1 straight_join t2 on t1.a=t2.a;
|
+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
|
| 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using index |
|
| 1 | SIMPLE | t2 | ref | a | a | 131 | func | 1 | Using where; Using index |
|
+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
|
But not the index on t1.a:
explain select * from t2 straight_join t1 on t1.a=t2.a;
|
+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
|
| 1 | SIMPLE | t2 | index | a | a | 131 | NULL | 10000 | Using index |
|
| 1 | SIMPLE | t1 | index | NULL | a | 99 | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |
|
+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+
|
This is because internally t1.a=t2.a is converted into:
convert(t1.a using utf8mb4) = t2.a
|
This MDEV is about making it possible to use ref access on t2.a.
2. From utf8mb3_val=utf8mb4_val to index lookups
When we have
utf8mb3_key_col = utf8mb4_value
|
the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.
But what if we want to use
INDEX(utf8mb3_key_col)
|
to find the matching value?
UTF8-MB4 consists of
- BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
- Characters from other Planes, named Supplementary or Astral planes.
Consider a lookup value that is in UTF8MB4.
- If the characters are in BMP, we can just make the lookup.
- What if there are characters outside BMP? They may or may not compare as equal to characters in BMP. It depends on the collation.
In utf8mb4_general_ci, all characters outside the BMP compare as equal to each other
and equal to the U+FFFD Replacement Character, which is in BMP (In the code, search for MY_CS_REPLACEMENT_CHARACTER).
A demo
create table t12 ( |
mb3 varchar(10) collate utf8mb3_general_ci, |
mb4 varchar(10) collate utf8mb4_general_ci |
);
|
 |
-- mb3 has the Replacement Character.
|
-- mb4 has some non-BMP character.
|
insert into t12 values ( |
_utf8mb3 0xEFBFBD,
|
_utf8mb4 0xF09D8C86);
|
set names utf8mb4; |
MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12; |
+------+----------+------+----------+ |
| mb3 | hex(mb3) | mb4 | hex(mb4) |
|
+------+----------+------+----------+ |
| � | EFBFBD | 𝌆 | F09D8C86 |
|
+------+----------+------+----------+ |
1 row in set (0.001 sec) |
Non-bmp character is equal to the Replacement Character:
MariaDB [test]> select mb3=mb4 from t12;
|
+---------+
|
| mb3=mb4 |
|
+---------+
|
| 1 |
|
+---------+
|
1 row in set (0.004 sec)
|
If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:
MariaDB [test]> select convert(mb4 using utf8mb3) from t12;
|
+----------------------------+
|
| convert(mb4 using utf8mb3) |
|
+----------------------------+
|
| ? |
|
+----------------------------+
|
1 row in set, 1 warning (0.004 sec)
|
 |
MariaDB [test]> show warnings;
|
+---------+------+------------------------------------------------------------+
|
| Level | Code | Message |
|
+---------+------+------------------------------------------------------------+
|
| Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |
|
+---------+------+------------------------------------------------------------+
|
3. (NEW) NEW: Can we handle non-equality comparisons too?
It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?
4. How to implement this
4.1 Suggested solution #1: rewrite
So, currently, when one calls fix_fields() for
mb3_key_col= mb4_col
|
it will be converted into
convert(mb3_key_col using utf8mb4) = mb4_col |
The optimizer only considers ref access for column references, that is, for
mb3_key_col= ...
|
The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into
convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE |
AND |
mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART |
The convert_with_replacement() is like CONVERT(), except that it
- replaces non-converted symbols with the Replacement Character.
- doesn't emit any warnings.
Advantage of this approach:
- the fix is not intrusive. We don't need to change the way string comparisons are done.
- the fix covers all optimizer. Range optimizer will be able to handle utf8mb3_col= utf8mb4_const.
Disadvantages:
- One condition becomes two. We can work around this by making make_join_select() ignore the "ADDED PART".
- What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)
*See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"*
4.2 Suggested solution #2: change the way string comparisons are done
Currently, fix_fields() code will inject CONVERT() call if character sets do not match.
The idea is not not inject it, and instead allow string comparisons to use different character sets.
They would dynamically check if the charset is the right one and convert if it is not..
*See below for link to a patch with attempt to implement it*
The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.
I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
I had to add hooks in 4 places already:
There are multiple places where conversion is done:
|
1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.
|
2. CharsetNarrowing#2: class store_key_item, copy_inner()
|
3. CharsetNarrowing#3: class store_key_item_const
|
4. CharsetNarrowing#4: in Field_str::get_mm_leaf()
|
...
|
The patch seems to be very intrusive and risky.
4.3 Suggested solution #3: pseudo-charset
Basic idea: So we're comparing
mb3col = mb4col
We need to handle two operations:
1. Compare the values. We can just assume that mb3col has utf8mb4 content in it.
const char *mb3col_data= ...; // This is in UTF8MB3
const char *mb4col_data= ...; // This is in UTF8mb4.
const char *left_value_in_mb4= utf8mb3_data; // Ok
res= compare(left_value_in_mb4, mb4col_data, &my_charset_utf8mb4_general_ci);
2. Take MB4 data and create MB3 index lookup key from it.
Narrowing is done by using a special "Narrowing charset":
We take the source utf8mb4_general_ci but change mb_wc function to
replace non-Bmp characters with MY_CS_REPLACEMENT_CHAR.
Todo: is there really a single place we could replace the charset?
5. Other stuff
5.1 The optimization must be switchable
One must be able to turn the new functionality on and off. Possible ways to do it:
- optimizer_switch flag.
- old_mode variable.
Possible issues with optimizer_switch flag:
- we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
- we might need to check the flag on prepare stage (where the current code adds CONVERT(...} call). Optimizer stuff at the prepare stage is not good.
Attachments
Issue Links
- includes
-
MDEV-8872 Performance regressions with utf8mb4 vs utf8 in WordPress
- Closed
- relates to
-
MDEV-8872 Performance regressions with utf8mb4 vs utf8 in WordPress
- Closed
-
MDEV-32127 direct comparison of strings in different charsets
- Open
-
MDEV-32148 Inefficient WHERE timestamp_column=datetime_const_expr
- Closed
-
MDEV-32526 Enable Charset Narrowing by default starting from 11.3
- Closed