[MDEV-32127] direct comparison of strings in different charsets Created: 2023-09-07 Updated: 2023-09-07 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major |
| Reporter: | Sergei Golubchik | Assignee: | Alexander Barkov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
Since 10.10.1 ( A character set defines characters, what sequence of bytes represents what character, what properties what character has, etc. Examples of character sets are utf8mb3, ucs2, latin1, sjis. Some collations apply only to one character set and have the character set as a part of the collation name (e.g. latin1_german2_ci and sjis_japanese_ci), others (like uca1400_latvian_ai_ci) apply to many different character sets. When MariaDB internally need to compare two strings, it first needs to have them both in the same character set. Thus when two expressions (items) are compared, first MariaDB determines what character set and collation they should be compared in, then wraps them, as needed, into CONVERT(expr USING charset) function. Then, during execution it gets both expression results already in the same character set and compares them. Note that for this to succeed, the server must determine one single collation that can be used to compare results of expressions, this is logically unavoidable. But strictly speaking, there is no need to convert both results to the same character set. If they are in different character sets, but can be compared according to one specific collation, then this collation must apply to both character sets. This is only true for the UCA collations. Comparison in the UCA collation generally works like
converting characters to weights is character-set dependent operation, but it's very easy to modify the above loop to use different character sets for the first and the second string, it won't affect how weights are compared. This will eliminate fragile expression tree rewrites and expensive character set conversions. |