[MDEV-27009] Add UCA-14.0.0 collations Created: 2021-11-09 Updated: 2023-11-21 Resolved: 2022-08-10 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Fix Version/s: | 10.10.1 |
| Type: | Task | Priority: | Critical |
| Reporter: | Alexander Barkov | Assignee: | Alexander Barkov |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | Preview_10.10 | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4. utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U10000 to U10FFFF) as equal to each other. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123). Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. Character setsNew collations will be added for these Unicode character sets
Tailoring (language specific collations)We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.
Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. Built-in contractionsNew collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them:
Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. PerformanceThe patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
Customization flagsUnder terms of this task we'll add support for the following flags
Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
Naming conventionNaming convention for the new collations is described in MDEV-27210. Short collation nameThe character set name in the collation name will be optional in all syntactic constructs:
Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
The full name will be detected automatically according to the character set effective in the given context. SHOW CREATE statementsSHOW CREATE statements will display long collations names. For example, SHOW CREATE for the table created using the CREATE TABLE statement will produce the following output:
Notice, the COLLATE clause will contain the full collation name, including the character set prefix. INFORMATION_SCHEMA.COLLATIONS changesThe following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:
All new UCA1400 collations will be displayed as follows:
For old collations nothing will change. For example, the output from this query:
will look like this:
The idea is that short collation names will be applicable to multiple character sets. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changesIn addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:
The column COLLATION_NAME will display:
So for example, the output from this query:
will look about like this:
New ID rangesWe'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode:
into 12 bits as follows:
where
|
| Comments |
| Comment by Vladislav Vaintroub [ 2021-11-09 ] | ||
|
We also talked , within the natural sort discussion, about several things. I do remember that only case-insensitive would not be enough, and case-insensitive was requested several times. Apart of that, we talked how to add different properties without creating a new collation (level=1,2,3, numeric,nopad, locale=de_DE!). This might expand the scope of this MDEV, but let's not forget about it, it is quite important. I personally feel like level 3 (accent-case-insensitive) is long due for good sort. As for utf32 and utf16 collation, I feel like this is not important at all, in 2021 Unicode is utf8, and anything else is legacy | ||
| Comment by John Bilicki [ 2021-12-23 ] | ||
|
Definitely utf8mb4_unicode_1400_ci instead of utf8mb4_1400_ai_ci as MySQL is obsolete due to Oracle. It sounds like case-sensitive is the way to go. "Level 3 (accent-case-insensitive) is long due for good sort" sounds like a good idea. I think for now UTF8 collation should be the focus though UTF16 and UTF32 could be appended afterwards though I wouldn't consider them legacy considering the potential extreme need for language expansion in regards to an entirely different subject pending ~two decades from now. If there is a slow down I imagine that any one it effects could just elect to use a lower level Unicode standard and anyone who truly needs a higher standard would likely be able to justify the hardware costs to compensate. | ||
| Comment by Alexander Barkov [ 2022-03-09 ] | ||
|
Hello serg, Please review a set of patches implementing UCA-14.0.0 collations in Thanks. | ||
| Comment by Alexander Barkov [ 2022-05-25 ] | ||
|
serg, please review a new version here: | ||
| Comment by Alexander Barkov [ 2022-06-15 ] | ||
|
elenst, please test this task. | ||
| Comment by Lena Startseva [ 2022-07-12 ] | ||
|
I'm not sure that it is a bug, but for "croatian" and "vietnamese" are generated incorrect ids. According new rule for ID ranges:
In tasks says that we have 22 tailorings (default + 21 language), but "croatian" has tailoring ID "11000" (24 in decimal) and "vietnamese" has "10111" (23 in decimal). Also, in MDEV-27210 are described more flags than in current task, but their potential presence is not included in IDs. UPD:
| ||
| Comment by Lena Startseva [ 2022-08-08 ] | ||
|
Ok to push |