Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.
utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U10000 to U10FFFF) as equal to each other.
In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.
Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old.
Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).
Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.
Character sets
New collations will be added for these Unicode character sets
- utf8mb3
- utf8mb4
- ucs2
- utf16
- utf32
Tailoring (language specific collations)
We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.
- icelandic
- latvian
- romanian
- slovenian
- polish
- estonian
- spanish
- swedish
- turkish
- czech
- danish
- lithuanian
- slovak
- spanish2
- roman
- persian
- esperanto
- hungarian
- sinhala
- german2
- croatian
- vietnamese
Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.
Built-in contractions
New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt
There are 939 built-in contractions in UCA-14.0.0.
To mention a few of them:
0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
|
0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
|
0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
|
0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
|
0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
|
Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.
Performance
The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
- MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
- MDEV-27265 Improve contraction performance in UCA collations
Customization flags
Under terms of this task we'll add support for the following flags
- pad/nopad
- accent sensitivity
- case sensititivy
Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
+---------------------------+
|
| collation_name |
|
+---------------------------+
|
| uca1400_czech_ai_ci |
|
| uca1400_czech_ai_cs |
|
| uca1400_czech_as_ci |
|
| uca1400_czech_as_cs |
|
| uca1400_czech_nopad_ai_ci |
|
| uca1400_czech_nopad_ai_cs |
|
| uca1400_czech_nopad_as_ci |
|
| uca1400_czech_nopad_as_cs |
|
+---------------------------+
|
Naming convention
Naming convention for the new collations is described in MDEV-27210.
Short collation name
The character set name in the collation name will be optional in all syntactic constructs:
CREATE TABLE t1 ( |
a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, |
b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci |
);
|
Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
- utf8mb4_uca1400_as_ci for the column "a", or
- utf16_uca1400_as_ci for the column "b".
The full name will be detected automatically according to the character set effective in the given context.
SHOW CREATE statements
SHOW CREATE statements will display long collations names.
For example, SHOW CREATE for the table created using the CREATE TABLE statement will produce the following output:
SHOW CREATE TABLE t1; |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Table | Create Table |
|
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| t1 | CREATE TABLE `t1` (
|
`a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
|
`b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
But this will possibly change in later versions to display short collation names.
INFORMATION_SCHEMA.COLLATIONS changes
The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:
- CHARACTER_SET_NAME
- ID
- IS_DEFAULT
All new UCA1400 collations will be displayed as follows:
- The COLLATION_NAME column will display the short name, without character set prefix.
- The three NULL-able column mentioned above will display NULL
For old collations nothing will change.
For example, the output from this query:
SELECT * FROM INFORMATION_SCHEMA.COLLATIONS |
WHERE COLLATION_NAME LIKE '%czech%'; |
will look like this:
+---------------------------+--------------------+------+------------+-------------+---------+
|
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
|
+---------------------------+--------------------+------+------------+-------------+---------+
|
| latin2_czech_cs | latin2 | 2 | | Yes | 4 |
|
| cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
|
| utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
|
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
|
| utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
|
| uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
|
| uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
|
| utf16_czech_ci | utf16 | 111 | | Yes | 8 |
|
| utf32_czech_ci | utf32 | 170 | | Yes | 8 |
|
+---------------------------+--------------------+------+------------+-------------+---------+
|
The idea is that short collation names will be applicable to multiple character sets.
We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.
INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes
In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
- FULL_COLLATION_NAME
- ID
- IS_DEFAULT
So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:
CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( |
`COLLATION_NAME` varchar(64) NOT NULL, |
`CHARACTER_SET_NAME` varchar(32) NOT NULL, |
`FULL_COLLATION_NAME` varchar(64) NOT NULL, |
`ID` bigint(11) NOT NULL, |
`IS_DEFAULT` varchar(3) NOT NULL |
) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; |
The column COLLATION_NAME will display:
- short names, without the character set name prefix, for new UCA1400 collations.
- long names for all old collations
So for example, the output from this query:
SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY |
WHERE COLLATION_NAME LIKE '%czech%cs'; |
will look about like this:
+---------------------------+--------------------+-----------------------------------+------+------------+
|
| COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
|
+---------------------------+--------------------+-----------------------------------+------+------------+
|
| latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
|
| cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
|
| uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
|
| uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
|
| uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
|
| uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
|
| uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
|
| uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
|
| uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
|
| uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
|
| uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
|
| uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
|
| uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
|
| uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
|
| uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
|
| uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
|
| uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
|
| uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
|
| uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
|
| uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
|
| uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
|
| uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
|
+---------------------------+--------------------+-----------------------------------+------+------------+
|
New ID ranges
We'll use the range 2048-4095 for new UCA-14.0.0 collation.
The ID will encode:
- character set
- tailoring
- flags
into 12 bits as follows:
1ccc tttt tPST
|
where
- ccc (3 bits) - the character set ID
|
(0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)
|
|
- ttttt (5 bits) - the tailoring ID
|
(0 - default, 1 - icelandic, 2 - latvian, etc)
|
|
- P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)
|
|
- S (1 bit) - Secondary level
|
0 - disabled (accent insensitive)
|
1 - enabled (accent sensitive)
|
|
- T (1 bit) - Tertiary level
|
0 - disabled (case insensitive)
|
1 - enabled (case sensitive)
|
Attachments
Issue Links
- blocks
-
MDEV-19123 Change default charset from latin1 to utf8mb4
-
- Closed
-
-
MDEV-23465 Implement a collation for identifiers
-
- Closed
-
- causes
-
MDEV-29763 ER_EVENT_STORE_FAILED upon creating event under some uca140_xxx_nopad_xx_xx collations
-
- Open
-
-
MDEV-29776 collation_connection and db_collation are too short in mysql.proc and mysql.event
-
- Closed
-
-
MDEV-31608 MySQL.Data can not connect to MariaDB 11 due a collations with a NULL id.
-
- Closed
-
-
MDEV-32816 Make "show collations" data not-nullable
-
- Closed
-
-
MDEV-34478 SHOW COLLATION shows strange NULLs
-
- Closed
-
-
MDEV-34631 Internal Unicode 14.0 collation character sets are NULL
-
- Closed
-
-
MDEV-35538 UBSAN: nullptr-with-offset: runtime error: applying zero offset to null pointer in check_rules and in init_weight_level
-
- Closed
-
-
MDEV-35876 speedup collation/charset lookup
-
- In Testing
-
- is blocked by
-
MDEV-27042 UCA: Resetting contractions to ignorable does not work well
-
- Closed
-
-
MDEV-27154 allkeys.txt based tests for Unicode-4.0.0 and 5.2.0
-
- Closed
-
-
MDEV-27743 Remove Lex::charset
-
- Closed
-
-
MDEV-27896 Wrong result upon `COLLATE latin1_bin CHARACTER SET latin1` on the table or the database level
-
- Closed
-
- is duplicated by
-
MDEV-12339 Adding case sensitive utf8 collation to MariaDB
-
- Closed
-
- relates to
-
MDEV-10132 utf8_thai_520_w2 collation
-
- Closed
-
-
MDEV-23400 Add UCA case sensitive accent sensitive collations for Unicode character sets
-
- Closed
-
-
MDEV-27195 SIGSEGV in Table_scope_and_contents_source_st::vers_check_system_fields
-
- Closed
-
-
MDEV-27210 New naming convention for UCA collations
-
- Open
-
-
MDEV-27265 Improve contraction performance in UCA collations
-
- Closed
-
-
MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
-
- Closed
-
-
MDEV-27712 Reduce the size of Lex_length_and_dec_st from 16 to 8
-
- Closed
-
-
MDEV-27896 Wrong result upon `COLLATE latin1_bin CHARACTER SET latin1` on the table or the database level
-
- Closed
-
-
MDEV-27906 CREATE TABLE/DATABASE .. CHARSET .. COLLATE is not consistent on errors
-
- Closed
-
-
MDEV-30577 Case folding for uca1400 collations is not up to date
-
- Closed
-
-
MDEV-34808 Update HeidiSQL to v12.8
-
- Closed
-
-
MDEV-23400 Add UCA case sensitive accent sensitive collations for Unicode character sets
-
- Closed
-
-
MDEV-27690 Crash on `CHARACTER SET csname COLLATE DEFAULT` in column definition
-
- Closed
-
-
MDEV-27696 Json table columns accept redundant COLLATE syntax
-
- Closed
-
-
MDEV-27782 Wrong columns when using table level `CHARACTER SET utf8mb4 COLLATE DEFAULT`
-
- Closed
-
-
MDEV-27853 Wrong data type on column `COLLATE DEFAULT` and table `COLLATE some_non_default_collation`
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Open questions: - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Open questions: - To avoid slow down on switch from utf8mb4_general_ci to the new collation, should we refactor the current code implementation somehow, or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Open questions: - To avoid slow down on switch from utf8mb4_general_ci to the new collation, should we refactor the current code implementation somehow, or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Open questions: - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Open questions: - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later. Open questions: - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later. Open questions: - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later. Open questions: - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
Link |
This issue blocks |
Link |
This issue blocks |
Priority | Major [ 3 ] | Critical [ 2 ] |
Status | Open [ 1 ] | In Progress [ 3 ] |
Link |
This issue is blocked by |
Link |
This issue relates to |
Link |
This issue is blocked by |
Link |
This issue relates to |
Workflow | MariaDB v3 [ 127183 ] | MariaDB v4 [ 131854 ] |
Status | In Progress [ 3 ] | In Testing [ 10301 ] |
Assignee | Alexander Barkov [ bar ] | Elena Stepanova [ elenst ] |
Link |
This issue relates to |
Assignee | Elena Stepanova [ elenst ] | Ramesh Sivaraman [ JIRAUSER48189 ] |
Link | This issue relates to MDEV-27210 [ MDEV-27210 ] |
Link |
This issue relates to |
Link |
This issue is blocked by |
Link |
This issue is blocked by |
Link |
This issue relates to |
Link |
This issue relates to |
Status | In Testing [ 10301 ] | Stalled [ 10000 ] |
Assignee | Ramesh Sivaraman [ JIRAUSER48189 ] | Alexander Barkov [ bar ] |
Link |
This issue relates to |
Fix Version/s | 10.9 [ 26905 ] | |
Fix Version/s | 10.8 [ 26121 ] |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue is blocked by |
Link |
This issue relates to |
Link |
This issue is blocked by |
Link |
This issue relates to |
Status | Stalled [ 10000 ] | In Progress [ 3 ] |
Assignee | Alexander Barkov [ bar ] | Sergei Golubchik [ serg ] |
Status | In Progress [ 3 ] | In Review [ 10002 ] |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later. Open questions: - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable? - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time? - Which naming should we use: -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci) -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci) -- Some other naming |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will looks about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will looks about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the effective character set. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Tailoring (language specific collations) We'll add collations for all tailorings that exists for the old UCA-4.0.0 collations, except Thai: - icelandic - latvian - romanian - slovenian - polish - estonian - spanish - swedish - turkish - czech - danish - lithuanian - slovak - spanish2 - roman - persian - esperanto - hungarian - sinhala - german2 - croatian - vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Tailoring (language specific collations) We'll add collations for all tailorings that exists for the old UCA-4.0.0 collations, except Thai: - icelandic - latvian - romanian - slovenian - polish - estonian - spanish - swedish - turkish - czech - danish - lithuanian - slovak - spanish2 - roman - persian - esperanto - hungarian - sinhala - german2 - croatian - vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai. - icelandic - latvian - romanian - slovenian - polish - estonian - spanish - swedish - turkish - czech - danish - lithuanian - slovak - spanish2 - roman - persian - esperanto - hungarian - sinhala - german2 - croatian - vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Fix Version/s | 10.10 [ 27530 ] | |
Fix Version/s | 10.9 [ 26905 ] |
Assignee | Sergei Golubchik [ serg ] | Alexander Barkov [ bar ] |
Status | In Review [ 10002 ] | Stalled [ 10000 ] |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai. - icelandic - latvian - romanian - slovenian - polish - estonian - spanish - swedish - turkish - czech - danish - lithuanian - slovak - spanish2 - roman - persian - esperanto - hungarian - sinhala - german2 - croatian - vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built-in contractions New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as_ci for the column "a", or - utf16_uca1400_as_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION_SCHEMA.COLLATIONS changes The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able: - CHARACTER_SET_NAME - ID - IS_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
In order to make utf8mb4 the default character set (see MDEV\-19123), we need a reasonable default collation for utf8mb4. utf8mb4_general_ci is not good \- it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable \- there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either \- these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets \- utf8mb3 \- utf8mb4 \- ucs2 \- utf16 \- utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA\-4.0.0 collations, except Thai. \- icelandic \- latvian \- romanian \- slovenian \- polish \- estonian \- spanish \- swedish \- turkish \- czech \- danish \- lithuanian \- slovak \- spanish2 \- roman \- persian \- esperanto \- hungarian \- sinhala \- german2 \- croatian \- vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built\-in contractions New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags \- pad/nopad \- accent sensitivity \- case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV\-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: \- utf8mb4_uca1400_as\_ci for the column "a", or \- utf16_uca1400_as\_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION\_SCHEMA.COLLATIONS changes The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able: \- CHARACTER_SET_NAME \- ID \- IS\_DEFAULT All new UCA1400 collations will be displayed as follows: \- The COLLATION\_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added: \- FULL_COLLATION_NAME \- ID \- IS\_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION\_NAME will display: \- short names, without the character set name prefix, for new UCA1400 collations. \- long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: \- character set \- tailoring \- flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Description |
In order to make utf8mb4 the default character set (see MDEV\-19123), we need a reasonable default collation for utf8mb4. utf8mb4_general_ci is not good \- it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable \- there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either \- these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets \- utf8mb3 \- utf8mb4 \- ucs2 \- utf16 \- utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA\-4.0.0 collations, except Thai. \- icelandic \- latvian \- romanian \- slovenian \- polish \- estonian \- spanish \- swedish \- turkish \- czech \- danish \- lithuanian \- slovak \- spanish2 \- roman \- persian \- esperanto \- hungarian \- sinhala \- german2 \- croatian \- vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built\-in contractions New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags \- pad/nopad \- accent sensitivity \- case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV\-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: \- utf8mb4_uca1400_as\_ci for the column "a", or \- utf16_uca1400_as\_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION\_SCHEMA.COLLATIONS changes The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able: \- CHARACTER_SET_NAME \- ID \- IS\_DEFAULT All new UCA1400 collations will be displayed as follows: \- The COLLATION\_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added: \- FULL_COLLATION_NAME \- ID \- IS\_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION\_NAME will display: \- short names, without the character set name prefix, for new UCA1400 collations. \- long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: \- character set \- tailoring \- flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai. - icelandic - latvian - romanian - slovenian - polish - estonian - spanish - swedish - turkish - czech - danish - lithuanian - slovak - spanish2 - roman - persian - esperanto - hungarian - sinhala - german2 - croatian - vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built-in contractions New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as\_ci for the column "a", or - utf16_uca1400_as\_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION\_SCHEMA.COLLATIONS changes The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able: - CHARACTER_SET_NAME - ID - IS\_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION\_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS\_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION\_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Assignee | Alexander Barkov [ bar ] | Sergei Golubchik [ serg ] |
Status | Stalled [ 10000 ] | In Review [ 10002 ] |
Assignee | Sergei Golubchik [ serg ] | Alexander Barkov [ bar ] |
Status | In Review [ 10002 ] | Stalled [ 10000 ] |
Status | Stalled [ 10000 ] | In Testing [ 10301 ] |
Assignee | Alexander Barkov [ bar ] | Elena Stepanova [ elenst ] |
Assignee | Elena Stepanova [ elenst ] | Lena Startseva [ JIRAUSER50478 ] |
Status | In Testing [ 10301 ] | Stalled [ 10000 ] |
Assignee | Lena Startseva [ JIRAUSER50478 ] | Alexander Barkov [ bar ] |
Description |
In order to make utf8mb4 the default character set (see utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other. In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets - utf8mb3 - utf8mb4 - ucs2 - utf16 - utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai. - icelandic - latvian - romanian - slovenian - polish - estonian - spanish - swedish - turkish - czech - danish - lithuanian - slovak - spanish2 - roman - persian - esperanto - hungarian - sinhala - german2 - croatian - vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built-in contractions New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance: - - h2. Customization flags Under terms of this task we'll add support for the following flags - pad/nopad - accent sensitivity - case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: - utf8mb4_uca1400_as\_ci for the column "a", or - utf16_uca1400_as\_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION\_SCHEMA.COLLATIONS changes The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able: - CHARACTER_SET_NAME - ID - IS\_DEFAULT All new UCA1400 collations will be displayed as follows: - The COLLATION\_NAME column will display the short name, without character set prefix. - The three NULL-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added: - FULL_COLLATION_NAME - ID - IS\_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION\_NAME will display: - short names, without the character set name prefix, for new UCA1400 collations. - long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: - character set - tailoring - flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
In order to make utf8mb4 the default character set (see MDEV\-19123), we need a reasonable default collation for utf8mb4. utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U10000 to U10FFFF) as equal to each other. In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table. Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old. Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version. h2. Character sets New collations will be added for these Unicode character sets * utf8mb3 * utf8mb4 * ucs2 * utf16 * utf32 h2. Tailoring (language specific collations) We'll add collations for all 22 tailorings that exists for the old UCA\-4.0.0 collations, except Thai. * icelandic * latvian * romanian * slovenian * polish * estonian * spanish * swedish * turkish * czech * danish * lithuanian * slovak * spanish2 * roman * persian * esperanto * hungarian * sinhala * german2 * croatian * vietnamese Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation. h2. Built\-in contractions New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt There are 939 built-in contractions in UCA-14.0.0. To mention a few of them: {noformat} 0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> 0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI> 0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI> 0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI> 0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI> {noformat} Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions. This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation. h2. Performance The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance: * MDEV\-27266 Improve UCA collation performance for utf8mb3 and utf8mb4 * MDEV\-27265 Improve contraction performance in UCA collations h2. Customization flags Under terms of this task we'll add support for the following flags * pad/nopad * accent sensitivity * case sensititivy Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available: {noformat} +---------------------------+ | collation_name | +---------------------------+ | uca1400_czech_ai_ci | | uca1400_czech_ai_cs | | uca1400_czech_as_ci | | uca1400_czech_as_cs | | uca1400_czech_nopad_ai_ci | | uca1400_czech_nopad_ai_cs | | uca1400_czech_nopad_as_ci | | uca1400_czech_nopad_as_cs | +---------------------------+ {noformat} h2. Naming convention Naming convention for the new collations is described in MDEV\-27210. h2. Short collation name The character set name in the collation name will be optional in all syntactic constructs: {code:sql} CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci, b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci ); {code} Notice, it will be enough to specify just uca1400_as_ci without having to type the full names: * utf8mb4_uca1400_as\_ci for the column "a", or * utf16_uca1400_as\_ci for the column "b". The full name will be detected automatically according to the character set effective in the given context. h2. SHOW CREATE statements {{SHOW CREATE}} statements will display long collations names. For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output: {code:sql} SHOW CREATE TABLE t1; {code} {noformat} +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL, `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {noformat} Notice, the COLLATE clause will contain the full collation name, including the character set prefix. But this will possibly change in later versions to display short collation names. h2. INFORMATION\_SCHEMA.COLLATIONS changes The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able: * CHARACTER_SET_NAME * ID * IS\_DEFAULT All new UCA1400 collations will be displayed as follows: * The COLLATION\_NAME column will display the short name, without character set prefix. * The three NULL\-able column mentioned above will display NULL For old collations nothing will change. For example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%'; {code} will look like this: {noformat} +---------------------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +---------------------------+--------------------+------+------------+-------------+---------+ | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 | | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | +---------------------------+--------------------+------+------------+-------------+---------+ {noformat} The idea is that short collation names will be applicable to multiple character sets. We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades. h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added: * FULL_COLLATION_NAME * ID * IS\_DEFAULT So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this: {code:sql} CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` ( `COLLATION_NAME` varchar(64) NOT NULL, `CHARACTER_SET_NAME` varchar(32) NOT NULL, `FULL_COLLATION_NAME` varchar(64) NOT NULL, `ID` bigint(11) NOT NULL, `IS_DEFAULT` varchar(3) NOT NULL ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3; {code} The column COLLATION\_NAME will display: * short names, without the character set name prefix, for new UCA1400 collations. * long names for all old collations So for example, the output from this query: {code:sql} SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs'; {code} will look about like this: {noformat} +---------------------------+--------------------+-----------------------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +---------------------------+--------------------+-----------------------------------+------+------------+ | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | | | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | | | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | | | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | | | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | | | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | | | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | | | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | | | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | | | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | | | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | | | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | | | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | | | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | | | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | | | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | | | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | | | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | | | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | | | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | | | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | | | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | | +---------------------------+--------------------+-----------------------------------+------+------------+ {noformat} h2. New ID ranges We'll use the range 2048-4095 for new UCA-14.0.0 collation. The ID will encode: * character set * tailoring * flags into 12 bits as follows: {noformat} 1ccc tttt tPST {noformat} where {noformat} - ccc (3 bits) - the character set ID (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32) - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD) - S (1 bit) - Secondary level 0 - disabled (accent insensitive) 1 - enabled (accent sensitive) - T (1 bit) - Tertiary level 0 - disabled (case insensitive) 1 - enabled (case sensitive) {noformat} |
Fix Version/s | 10.10.1 [ 27913 ] | |
Fix Version/s | 10.10 [ 27530 ] | |
Resolution | Fixed [ 1 ] | |
Status | Stalled [ 10000 ] | Closed [ 6 ] |
Link |
This issue is duplicated by |
Link |
This issue is blocked by |
Link |
This issue relates to |
Link |
This issue is blocked by |
Link | This issue causes MDEV-29763 [ MDEV-29763 ] |
Link |
This issue causes |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Labels | Preview_10.10 |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Zendesk Related Tickets | 172810 154143 125537 136805 |
Link |
This issue causes |
Link |
This issue relates to |
Link | This issue causes MDEV-35876 [ MDEV-35876 ] |
Link |
This issue causes |
We also talked , within the natural sort discussion, about several things. I do remember that only case-insensitive would not be enough, and case-insensitive was requested several times. Apart of that, we talked how to add different properties without creating a new collation (level=1,2,3, numeric,nopad, locale=de_DE!). This might expand the scope of this MDEV, but let's not forget about it, it is quite important. I personally feel like level 3 (accent-case-insensitive) is long due for good sort. As for utf32 and utf16 collation, I feel like this is not important at all, in 2021 Unicode is utf8, and anything else is legacy