[MDEV-32127] direct comparison of strings in different charsets - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Character Sets
Labels:
None

Description

Since 10.10.1 (~~MDEV-27009~~) MariaDB has detached collations from character sets.

A character set defines characters, what sequence of bytes represents what character, what properties what character has, etc. Examples of character sets are utf8mb3, ucs2, latin1, sjis.
A collation defines how sequences of characters are compared. Examples of collations are uca1400_latvian_ai_ci, latin1_german2_ci, sjis_japanese_ci.

Some collations apply only to one character set and have the character set as a part of the collation name (e.g. latin1_german2_ci and sjis_japanese_ci), others (like uca1400_latvian_ai_ci) apply to many different character sets.

When MariaDB internally need to compare two strings, it first needs to have them both in the same character set. Thus when two expressions (items) are compared, first MariaDB determines what character set and collation they should be compared in, then wraps them, as needed, into CONVERT(expr USING charset) function. Then, during execution it gets both expression results already in the same character set and compares them.

Note that for this to succeed, the server must determine one single collation that can be used to compare results of expressions, this is logically unavoidable. But strictly speaking, there is no need to convert both results to the same character set. If they are in different character sets, but can be compared according to one specific collation, then this collation must apply to both character sets. This is only true for the UCA collations. Comparison in the UCA collation generally works like

read the next weight from the first string (read characters as needed, convert to the weight)
read the next weight from the second string
compare weights
repeat

converting characters to weights is character-set dependent operation, but it's very easy to modify the above loop to use different character sets for the first and the second string, it won't affect how weights are compared.

This will eliminate fragile expression tree rewrites and expensive character set conversions.

Attachments

Issue Links

relates to

MDEV-32113 utf8mb3_key_col=utf8mb4_value cannot be used for ref access

Closed

Activity

People

Assignee:: Alexander Barkov

Reporter:: Sergei Golubchik

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2023-09-07 18:50

Updated:: 2023-09-07 18:50

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server