Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2, 11.3.1
Component/s: Character Sets, Optimizer
Labels:
None

Description

1.Task setting

2. From utf8mb3_val=utf8mb4_val to index lookups

3. NEW: Can we handle non-equality comparisons too?

4. How to implement this

4.1 Suggested solution #1: rewrite

4.2 Suggested solution #2: change the way string comparisons are done

4.3 Suggested solution #3: pseudo-charset

5. Other stuff

5.1 The optimization must be switchable

1. Task setting

Consider this example:

create table t1 (

  a varchar(32) collate utf8mb3_general_ci,

  key(a)

);

insert into t1 select seq from seq_1_to_10000;

create table t2 (

  a varchar(32) collate utf8mb4_general_ci,

  key(a)

);

insert into t2 select seq from seq_1_to_10000;

The optimizer is able to use the index on t2.a for ref access:

explain select * from t1 straight_join t2 on t1.a=t2.a;

+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+

| id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                    |

+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+

|    1 | SIMPLE      | t1    | index | NULL          | a    | 99      | NULL | 10000 | Using index              |

|    1 | SIMPLE      | t2    | ref   | a             | a    | 131     | func | 1     | Using where; Using index |

+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+

But not the index on t1.a:

explain select * from t2 straight_join t1 on t1.a=t2.a;

+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+

| id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                                                        |

+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+

|    1 | SIMPLE      | t2    | index | a             | a    | 131     | NULL | 10000 | Using index                                                  |

|    1 | SIMPLE      | t1    | index | NULL          | a    | 99      | NULL | 10000 | Using where; Using index; Using join buffer (flat, BNL join) |

+------+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------------------------------------------+

This is because internally t1.a=t2.a is converted into:

convert(t1.a using utf8mb4) = t2.a

This MDEV is about making it possible to use ref access on t2.a.

2. From utf8mb3_val=utf8mb4_val to index lookups

When we have

utf8mb3_key_col = utf8mb4_value

the normal way to compare is to convert the mb3 value into mb4 and then do the comparison.

But what if we want to use

INDEX(utf8mb3_key_col)

to find the matching value?

UTF8-MB4 consists of

BMP (Basic Multi-lingual Plane) characters. These can be represented in UTF8MB3.
Characters from other Planes, named Supplementary or Astral planes.

Consider a lookup value that is in UTF8MB4.

If the characters are in BMP, we can just make the lookup.
What if there are characters outside BMP? They may or may not compare as equal to characters in BMP. It depends on the collation.

In utf8mb4_general_ci, all characters outside the BMP compare as equal to each other
and equal to the U+FFFD Replacement Character, which is in BMP (In the code, search for MY_CS_REPLACEMENT_CHARACTER).

A demo

create table t12 (

  mb3 varchar(10) collate utf8mb3_general_ci,

  mb4 varchar(10) collate utf8mb4_general_ci

);

-- mb3 has the Replacement Character.

-- mb4 has some non-BMP character.

insert into t12 values (

 _utf8mb3 0xEFBFBD,

 _utf8mb4 0xF09D8C86);

set names utf8mb4;

MariaDB [test]> select mb3, hex(mb3), mb4, hex(mb4) from t12;

+------+----------+------+----------+

| mb3  | hex(mb3) | mb4  | hex(mb4) |

+------+----------+------+----------+

| �    | EFBFBD   | 𝌆     | F09D8C86 |

+------+----------+------+----------+

1 row in set (0.001 sec)

Non-bmp character is equal to the Replacement Character:

MariaDB [test]> select mb3=mb4 from t12;

+---------+

| mb3=mb4 |

+---------+

|       1 |

+---------+

1 row in set (0.004 sec)

If one converts UTF8MB4 to MB3, MariaDB will replace non-BMP characters with '?' (Question Mark), NOT with replacement character:

MariaDB [test]> select convert(mb4 using utf8mb3) from t12;

+----------------------------+

| convert(mb4 using utf8mb3) |

+----------------------------+

| ?                          |

+----------------------------+

1 row in set, 1 warning (0.004 sec)

MariaDB [test]> show warnings;

+---------+------+------------------------------------------------------------+

| Level   | Code | Message                                                    |

+---------+------+------------------------------------------------------------+

| Warning | 1977 | Cannot convert 'utf8mb4' character 0xF09D8C86 to 'utf8mb3' |

+---------+------+------------------------------------------------------------+

3. (NEW) NEW: Can we handle non-equality comparisons too?

It seems like we could. If all non-BMP characters are equal between each other and are equal to one BMP character, why not?

4. How to implement this

4.1 Suggested solution #1: rewrite

So, currently, when one calls fix_fields() for

  mb3_key_col= mb4_col

it will be converted into

  convert(mb3_key_col using utf8mb4) = mb4_col

The optimizer only considers ref access for column references, that is, for

   mb3_key_col= ...

The idea is to add a rewritten equality, that is, convert "mb3_key_col=mb4_col" into

  convert(mb3_key_col using utf8mb4) = mb4_col -- AS BEFORE

AND

  mb3_key_col= convert_with_replacement(mb4_col, replacement_char) -- ADDED PART

The convert_with_replacement() is like CONVERT(), except that it

replaces non-converted symbols with the Replacement Character.
doesn't emit any warnings.

Advantage of this approach:

the fix is not intrusive. We don't need to change the way string comparisons are done.
the fix covers all optimizer. Range optimizer will be able to handle utf8mb3_col= utf8mb4_const.

Disadvantages:

One condition becomes two. We can work around this by making make_join_select() ignore the "ADDED PART".
What if "mb4_col" is a complex expression. How do we make a copy of it to use in the ADDED PART? (Can we use Item_direct_ref here?)

*See below for link to an almost-finished patch. This solution receives pushback because "it's a hack"*

4.2 Suggested solution #2: change the way string comparisons are done

Currently, fix_fields() code will inject CONVERT() call if character sets do not match.
The idea is not not inject it, and instead allow string comparisons to use different character sets.
They would dynamically check if the charset is the right one and convert if it is not..

*See below for link to a patch with attempt to implement it*
The problem is that we cannot set the conversion code to do "Charset Narrowing" whenever we're converting from MB4 to MB3. There are cases where the server is converting from MB4 to MB3 and replaces non-BMP characters with '?' (question mark). We need to keep this behavior.

I've made an attempt to add hooks to only perform Charset Narrowing when we're producing an index lookup value.
I had to add hooks in 4 places already:

There are multiple places where conversion is done:

1. CharsetNarrowing#1: class store_key_field. Uses Field::do_field_string.

2. CharsetNarrowing#2: class store_key_item, copy_inner()

3. CharsetNarrowing#3: class store_key_item_const

4. CharsetNarrowing#4: in Field_str::get_mm_leaf()

...

The patch seems to be very intrusive and risky.

4.3 Suggested solution #3: pseudo-charset

Basic idea: So we're comparing

mb3col = mb4col

We need to handle two operations:

1. Compare the values. We can just assume that mb3col has utf8mb4 content in it.

const char *mb3col_data= ...; // This is in UTF8MB3
const char *mb4col_data= ...; // This is in UTF8mb4.
const char *left_value_in_mb4= utf8mb3_data; // Ok
res= compare(left_value_in_mb4, mb4col_data, &my_charset_utf8mb4_general_ci);

2. Take MB4 data and create MB3 index lookup key from it.

Narrowing is done by using a special "Narrowing charset":

We take the source utf8mb4_general_ci but change mb_wc function to
replace non-Bmp characters with MY_CS_REPLACEMENT_CHAR.

Todo: is there really a single place we could replace the charset?

5. Other stuff

5.1 The optimization must be switchable

One must be able to turn the new functionality on and off. Possible ways to do it:

optimizer_switch flag.
old_mode variable.

Possible issues with optimizer_switch flag:

we only want the optimization to be switchable in the stable versions. New versions should have it always ON, there is no need for a switch.
we might need to check the flag on prepare stage (where the current code adds CONVERT(...} call). Optimizer stuff at the prepare stage is not good.

Attachments

Issue Links

includes

MDEV-8872 Performance regressions with utf8mb4 vs utf8 in WordPress

Closed

relates to

MDEV-8872 Performance regressions with utf8mb4 vs utf8 in WordPress

Closed

MDEV-32127 direct comparison of strings in different charsets

Open

MDEV-32148 Inefficient WHERE timestamp_column=datetime_const_expr

Closed

MDEV-32526 Enable Charset Narrowing by default starting from 11.3

Closed

Activity

Ascending order - Click to sort in descending order

View 12 older comments

Sergei Petrunia added a comment - 2023-10-10 12:11

Patch for review: https://github.com/MariaDB/server/commit/3bd8810117aefbe08a4895416a126e50c021447a
pushed into bb-10.6-mdev32113-variant3.

Sergei Petrunia added a comment - 2023-10-10 12:11 Patch for review: https://github.com/MariaDB/server/commit/3bd8810117aefbe08a4895416a126e50c021447a pushed into bb-10.6-mdev32113-variant3 .

Sergei Petrunia added a comment - 2023-10-17 09:05

Patch addressing review input: https://github.com/MariaDB/server/commit/c81db721cad473eaea716228b1380fa827795f81

Sergei Petrunia added a comment - 2023-10-17 09:05 Patch addressing review input: https://github.com/MariaDB/server/commit/c81db721cad473eaea716228b1380fa827795f81

Lena Startseva added a comment - 2023-10-19 11:49 - edited

Testing of variant3 is done, if there is no changes, it's Ok to push

Lena Startseva added a comment - 2023-10-19 11:49 - edited Testing of variant3 is done, if there is no changes, it's Ok to push

Michael Widenius added a comment - 2023-10-19 14:25

Review approved

Michael Widenius added a comment - 2023-10-19 14:25 Review approved

Sergei Petrunia added a comment - 2023-12-05 11:49

The docs are here: https://mariadb.com/kb/en/charset-narrowing-optimization/

Sergei Petrunia added a comment - 2023-12-05 11:49 The docs are here: https://mariadb.com/kb/en/charset-narrowing-optimization/

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2023-09-06 12:28

Updated:: 2024-07-07 19:22

Resolved:: 2023-10-19 17:22

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

utf8mb3_key_col=utf8mb4_value cannot be used for ref access