Details

    Description

      In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

      utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U10000 to U10FFFF) as equal to each other.
      In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

      Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old.

      Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

      Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

      Character sets

      New collations will be added for these Unicode character sets

      • utf8mb3
      • utf8mb4
      • ucs2
      • utf16
      • utf32

      Tailoring (language specific collations)

      We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.

      • icelandic
      • latvian
      • romanian
      • slovenian
      • polish
      • estonian
      • spanish
      • swedish
      • turkish
      • czech
      • danish
      • lithuanian
      • slovak
      • spanish2
      • roman
      • persian
      • esperanto
      • hungarian
      • sinhala
      • german2
      • croatian
      • vietnamese

      Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.

      Built-in contractions

      New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

      There are 939 built-in contractions in UCA-14.0.0.

      To mention a few of them:

      0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
      0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
      0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
      0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
      0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>

      Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
      This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

      Performance

      The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:

      • MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
      • MDEV-27265 Improve contraction performance in UCA collations

      Customization flags

      Under terms of this task we'll add support for the following flags

      • pad/nopad
      • accent sensitivity
      • case sensititivy

      Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:

      +---------------------------+
      | collation_name            |
      +---------------------------+
      | uca1400_czech_ai_ci       |
      | uca1400_czech_ai_cs       |
      | uca1400_czech_as_ci       |
      | uca1400_czech_as_cs       |
      | uca1400_czech_nopad_ai_ci |
      | uca1400_czech_nopad_ai_cs |
      | uca1400_czech_nopad_as_ci |
      | uca1400_czech_nopad_as_cs |
      +---------------------------+

      Naming convention

      Naming convention for the new collations is described in MDEV-27210.

      Short collation name

      The character set name in the collation name will be optional in all syntactic constructs:

      CREATE TABLE t1 (
        a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
        b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
      );

      Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:

      • utf8mb4_uca1400_as_ci for the column "a", or
      • utf16_uca1400_as_ci for the column "b".

      The full name will be detected automatically according to the character set effective in the given context.

      SHOW CREATE statements

      SHOW CREATE statements will display long collations names.

      For example, SHOW CREATE for the table created using the CREATE TABLE statement will produce the following output:

      SHOW CREATE TABLE t1;

      +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Table | Create Table                                                                                                                                                                                                            |
      +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | t1    | CREATE TABLE `t1` (
        `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
        `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
      +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

      Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
      But this will possibly change in later versions to display short collation names.

      INFORMATION_SCHEMA.COLLATIONS changes

      The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

      • CHARACTER_SET_NAME
      • ID
      • IS_DEFAULT

      All new UCA1400 collations will be displayed as follows:

      • The COLLATION_NAME column will display the short name, without character set prefix.
      • The three NULL-able column mentioned above will display NULL

      For old collations nothing will change.

      For example, the output from this query:

      SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
      WHERE COLLATION_NAME LIKE '%czech%';

      will look like this:

      +---------------------------+--------------------+------+------------+-------------+---------+
      | COLLATION_NAME            | CHARACTER_SET_NAME | ID   | IS_DEFAULT | IS_COMPILED | SORTLEN |
      +---------------------------+--------------------+------+------------+-------------+---------+
      | latin2_czech_cs           | latin2             |    2 |            | Yes         |       4 |
      | cp1250_czech_cs           | cp1250             |   34 |            | Yes         |       2 |
      | utf8mb3_czech_ci          | utf8mb3            |  202 |            | Yes         |       8 |
      | ucs2_czech_ci             | ucs2               |  138 |            | Yes         |       8 |
      | utf8mb4_czech_ci          | utf8mb4            |  234 |            | Yes         |       8 |
      | uca1400_czech_ai_ci       | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_ai_cs       | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_as_ci       | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_as_cs       | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_nopad_ai_ci | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_nopad_ai_cs | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_nopad_as_ci | NULL               | NULL | NULL       | Yes         |       8 |
      | uca1400_czech_nopad_as_cs | NULL               | NULL | NULL       | Yes         |       8 |
      | utf16_czech_ci            | utf16              |  111 |            | Yes         |       8 |
      | utf32_czech_ci            | utf32              |  170 |            | Yes         |       8 |
      +---------------------------+--------------------+------+------------+-------------+---------+

      The idea is that short collation names will be applicable to multiple character sets.
      We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

      INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

      In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:

      • FULL_COLLATION_NAME
      • ID
      • IS_DEFAULT

      So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

      CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
        `COLLATION_NAME` varchar(64) NOT NULL,
        `CHARACTER_SET_NAME` varchar(32) NOT NULL,
        `FULL_COLLATION_NAME` varchar(64) NOT NULL,
        `ID` bigint(11) NOT NULL,
        `IS_DEFAULT` varchar(3) NOT NULL
      ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;

      The column COLLATION_NAME will display:

      • short names, without the character set name prefix, for new UCA1400 collations.
      • long names for all old collations

      So for example, the output from this query:

      SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
      WHERE COLLATION_NAME LIKE '%czech%cs';

      will look about like this:

      +---------------------------+--------------------+-----------------------------------+------+------------+
      | COLLATION_NAME            | CHARACTER_SET_NAME | FULL_COLLATION_NAME               | ID   | IS_DEFAULT |
      +---------------------------+--------------------+-----------------------------------+------+------------+
      | latin2_czech_cs           | latin2             | latin2_czech_cs                   |    2 |            |
      | cp1250_czech_cs           | cp1250             | cp1250_czech_cs                   |   34 |            |
      | uca1400_czech_ai_cs       | utf8mb3            | utf8mb3_uca1400_czech_ai_cs       | 2129 |            |
      | uca1400_czech_as_cs       | utf8mb3            | utf8mb3_uca1400_czech_as_cs       | 2131 |            |
      | uca1400_czech_nopad_ai_cs | utf8mb3            | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 |            |
      | uca1400_czech_nopad_as_cs | utf8mb3            | utf8mb3_uca1400_czech_nopad_as_cs | 2135 |            |
      | uca1400_czech_ai_cs       | ucs2               | ucs2_uca1400_czech_ai_cs          | 2641 |            |
      | uca1400_czech_as_cs       | ucs2               | ucs2_uca1400_czech_as_cs          | 2643 |            |
      | uca1400_czech_nopad_ai_cs | ucs2               | ucs2_uca1400_czech_nopad_ai_cs    | 2645 |            |
      | uca1400_czech_nopad_as_cs | ucs2               | ucs2_uca1400_czech_nopad_as_cs    | 2647 |            |
      | uca1400_czech_ai_cs       | utf8mb4            | utf8mb4_uca1400_czech_ai_cs       | 2385 |            |
      | uca1400_czech_as_cs       | utf8mb4            | utf8mb4_uca1400_czech_as_cs       | 2387 |            |
      | uca1400_czech_nopad_ai_cs | utf8mb4            | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 |            |
      | uca1400_czech_nopad_as_cs | utf8mb4            | utf8mb4_uca1400_czech_nopad_as_cs | 2391 |            |
      | uca1400_czech_ai_cs       | utf16              | utf16_uca1400_czech_ai_cs         | 2897 |            |
      | uca1400_czech_as_cs       | utf16              | utf16_uca1400_czech_as_cs         | 2899 |            |
      | uca1400_czech_nopad_ai_cs | utf16              | utf16_uca1400_czech_nopad_ai_cs   | 2901 |            |
      | uca1400_czech_nopad_as_cs | utf16              | utf16_uca1400_czech_nopad_as_cs   | 2903 |            |
      | uca1400_czech_ai_cs       | utf32              | utf32_uca1400_czech_ai_cs         | 3153 |            |
      | uca1400_czech_as_cs       | utf32              | utf32_uca1400_czech_as_cs         | 3155 |            |
      | uca1400_czech_nopad_ai_cs | utf32              | utf32_uca1400_czech_nopad_ai_cs   | 3157 |            |
      | uca1400_czech_nopad_as_cs | utf32              | utf32_uca1400_czech_nopad_as_cs   | 3159 |            |
      +---------------------------+--------------------+-----------------------------------+------+------------+

      New ID ranges

      We'll use the range 2048-4095 for new UCA-14.0.0 collation.

      The ID will encode:

      • character set
      • tailoring
      • flags

      into 12 bits as follows:

      1ccc tttt tPST

      where

      - ccc   (3 bits) - the character set ID
                         (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)
       
      - ttttt (5 bits) - the tailoring ID
                         (0 - default, 1 - icelandic, 2 - latvian, etc)
       
      - P     (1 bit)  - the PAD flag (0 - PAD, 1 - NOPAD)
       
      - S     (1 bit)  - Secondary level
                         0 - disabled (accent insensitive)
                         1 - enabled  (accent sensitive)
       
      - T     (1 bit)  - Tertiary level
                         0 -  disabled (case insensitive)
                         1 -  enabled  (case sensitive)

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov created issue -
            bar Alexander Barkov made changes -
            Field Original Value New Value
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Open questions:
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Open questions:
            - To avoid slow down on switch from utf8mb4_general_ci to the new collation, should we refactor the current code implementation somehow, or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Open questions:
            - To avoid slow down on switch from utf8mb4_general_ci to the new collation, should we refactor the current code implementation somehow, or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Open questions:
            - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Open questions:
            - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later.

            Open questions:
            - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later.

            Open questions:
            - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later.

            Open questions:
            - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            wlad Vladislav Vaintroub added a comment - - edited

            We also talked , within the natural sort discussion, about several things. I do remember that only case-insensitive would not be enough, and case-insensitive was requested several times. Apart of that, we talked how to add different properties without creating a new collation (level=1,2,3, numeric,nopad, locale=de_DE!). This might expand the scope of this MDEV, but let's not forget about it, it is quite important. I personally feel like level 3 (accent-case-insensitive) is long due for good sort. As for utf32 and utf16 collation, I feel like this is not important at all, in 2021 Unicode is utf8, and anything else is legacy

            wlad Vladislav Vaintroub added a comment - - edited We also talked , within the natural sort discussion, about several things. I do remember that only case-insensitive would not be enough, and case-insensitive was requested several times. Apart of that, we talked how to add different properties without creating a new collation (level=1,2,3, numeric,nopad, locale=de_DE!). This might expand the scope of this MDEV, but let's not forget about it, it is quite important. I personally feel like level 3 (accent-case-insensitive) is long due for good sort. As for utf32 and utf16 collation, I feel like this is not important at all, in 2021 Unicode is utf8, and anything else is legacy
            bar Alexander Barkov made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 127183 ] MariaDB v4 [ 131854 ]
            serg Sergei Golubchik made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]
            serg Sergei Golubchik made changes -
            Assignee Alexander Barkov [ bar ] Elena Stepanova [ elenst ]
            ramesh Ramesh Sivaraman made changes -
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ] Ramesh Sivaraman [ JIRAUSER48189 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            Richard Richard Stracke made changes -
            Richard Richard Stracke made changes -
            JAB Creations John Bilicki added a comment -

            Definitely utf8mb4_unicode_1400_ci instead of utf8mb4_1400_ai_ci as MySQL is obsolete due to Oracle. It sounds like case-sensitive is the way to go. "Level 3 (accent-case-insensitive) is long due for good sort" sounds like a good idea. I think for now UTF8 collation should be the focus though UTF16 and UTF32 could be appended afterwards though I wouldn't consider them legacy considering the potential extreme need for language expansion in regards to an entirely different subject pending ~two decades from now. If there is a slow down I imagine that any one it effects could just elect to use a lower level Unicode standard and anyone who truly needs a higher standard would likely be able to justify the hardware costs to compensate.

            JAB Creations John Bilicki added a comment - Definitely utf8mb4_unicode_1400_ci instead of utf8mb4_1400_ai_ci as MySQL is obsolete due to Oracle. It sounds like case-sensitive is the way to go. "Level 3 (accent-case-insensitive) is long due for good sort" sounds like a good idea. I think for now UTF8 collation should be the focus though UTF16 and UTF32 could be appended afterwards though I wouldn't consider them legacy considering the potential extreme need for language expansion in regards to an entirely different subject pending ~two decades from now. If there is a slow down I imagine that any one it effects could just elect to use a lower level Unicode standard and anyone who truly needs a higher standard would likely be able to justify the hardware costs to compensate.
            ramesh Ramesh Sivaraman made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            ramesh Ramesh Sivaraman made changes -
            Assignee Ramesh Sivaraman [ JIRAUSER48189 ] Alexander Barkov [ bar ]
            bar Alexander Barkov made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.8 [ 26121 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            bar Alexander Barkov made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            Hello serg,

            Please review a set of patches implementing UCA-14.0.0 collations in
            https://github.com/mariadb/server/tree/bb-10.9-bar

            Thanks.

            bar Alexander Barkov added a comment - Hello serg , Please review a set of patches implementing UCA-14.0.0 collations in https://github.com/mariadb/server/tree/bb-10.9-bar Thanks.
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Sergei Golubchik [ serg ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add utf8mb4_unicode_1400_ci - a Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add only the "root" collation. All language specific Unicode-14.0.0 based collations will be added under a separate task later.

            Open questions:
            - To avoid the slow down which will happen when switching from the simple collation utf8mb4_general_ci to the new UCA based collation, should we refactor the current code implementation somehow? or is the slow down tolerable?
            - Should we add _unicode_1400_ci for ucs2, utf16, utf32 at the same time?
            - Which naming should we use:
            -- utf8mb4_unicode_1400_ci (similar to MariaDB's utf8mb4_unicode_520_ci)
            -- utf8mb4_1400_ai_ci (utf8mb4_0900_ai_ci (similar to MySQL's utf8mb4_0900_ai_ci)
            -- Some other naming
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}


            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}


            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will looks about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will looks about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the effective character set.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all tailorings that exists for the old UCA-4.0.0 collations, except Thai:

            - icelandic
            - latvian
            - romanian
            - slovenian
            - polish
            - estonian
            - spanish
            - swedish
            - turkish
            - czech
            - danish
            - lithuanian
            - slovak
            - spanish2
            - roman
            - persian
            - esperanto
            - hungarian
            - sinhala
            - german2
            - croatian
            - vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            bar Alexander Barkov made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all tailorings that exists for the old UCA-4.0.0 collations, except Thai:

            - icelandic
            - latvian
            - romanian
            - slovenian
            - polish
            - estonian
            - spanish
            - swedish
            - turkish
            - czech
            - danish
            - lithuanian
            - slovak
            - spanish2
            - roman
            - persian
            - esperanto
            - hungarian
            - sinhala
            - german2
            - croatian
            - vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.

            - icelandic
            - latvian
            - romanian
            - slovenian
            - polish
            - estonian
            - spanish
            - swedish
            - turkish
            - czech
            - danish
            - lithuanian
            - slovak
            - spanish2
            - roman
            - persian
            - esperanto
            - hungarian
            - sinhala
            - german2
            - croatian
            - vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            serg Sergei Golubchik made changes -
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.9 [ 26905 ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Alexander Barkov [ bar ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            julien.fritsch Julien Fritsch made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good - it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable - there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520_ci (Unicode-5.2.0 based) is not reasonable either - these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets
            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.

            - icelandic
            - latvian
            - romanian
            - slovenian
            - polish
            - estonian
            - spanish
            - swedish
            - turkish
            - czech
            - danish
            - lithuanian
            - slovak
            - spanish2
            - roman
            - persian
            - esperanto
            - hungarian
            - sinhala
            - german2
            - croatian
            - vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.


            h2. Built-in contractions

            New collations will include built-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:
            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance
            The patch adding UCA-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags
            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:
            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}


            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as_ci for the column "a", or
            - utf16_uca1400_as_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.


            h2. INFORMATION_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION_SCHEMA.COLLATIONS will be changed to be NULL-able:

            - CHARACTER_SET_NAME
            - ID
            - IS_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}
            will look like this:
            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}


            The column COLLATION_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations


            So for example, the output from this query:
            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}
            will look about like this:
            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}


            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}
            where
            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV\-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good \- it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable \- there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either \- these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets

            New collations will be added for these Unicode character sets
            \- utf8mb3
            \- utf8mb4
            \- ucs2
            \- utf16
            \- utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA\-4.0.0 collations, except Thai.

            \- icelandic
            \- latvian
            \- romanian
            \- slovenian
            \- polish
            \- estonian
            \- spanish
            \- swedish
            \- turkish
            \- czech
            \- danish
            \- lithuanian
            \- slovak
            \- spanish2
            \- roman
            \- persian
            \- esperanto
            \- hungarian
            \- sinhala
            \- german2
            \- croatian
            \- vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.

            h2. Built\-in contractions

            New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:

            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance

            The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags

            Under terms of this task we'll add support for the following flags
            \- pad/nopad
            \- accent sensitivity
            \- case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:

            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}

            h2. Naming convention

            Naming convention for the new collations is described in MDEV\-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            \- utf8mb4_uca1400_as\_ci for the column "a", or
            \- utf16_uca1400_as\_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}

            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.

            h2. INFORMATION\_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able:

            \- CHARACTER_SET_NAME
            \- ID
            \- IS\_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            \- The COLLATION\_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}

            will look like this:

            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added:
            \- FULL_COLLATION_NAME
            \- ID
            \- IS\_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}

            The column COLLATION\_NAME will display:
            \- short names, without the character set name prefix, for new UCA1400 collations.
            \- long names for all old collations

            So for example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}

            will look about like this:

            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}

            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            \- character set
            \- tailoring
            \- flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}

            where

            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            serg Sergei Golubchik made changes -
            Description In order to make utf8mb4 the default character set (see MDEV\-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good \- it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable \- there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either \- these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets

            New collations will be added for these Unicode character sets
            \- utf8mb3
            \- utf8mb4
            \- ucs2
            \- utf16
            \- utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA\-4.0.0 collations, except Thai.

            \- icelandic
            \- latvian
            \- romanian
            \- slovenian
            \- polish
            \- estonian
            \- spanish
            \- swedish
            \- turkish
            \- czech
            \- danish
            \- lithuanian
            \- slovak
            \- spanish2
            \- roman
            \- persian
            \- esperanto
            \- hungarian
            \- sinhala
            \- german2
            \- croatian
            \- vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.

            h2. Built\-in contractions

            New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:

            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance

            The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags

            Under terms of this task we'll add support for the following flags
            \- pad/nopad
            \- accent sensitivity
            \- case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:

            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}

            h2. Naming convention

            Naming convention for the new collations is described in MDEV\-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            \- utf8mb4_uca1400_as\_ci for the column "a", or
            \- utf16_uca1400_as\_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}

            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.

            h2. INFORMATION\_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able:

            \- CHARACTER_SET_NAME
            \- ID
            \- IS\_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            \- The COLLATION\_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}

            will look like this:

            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added:
            \- FULL_COLLATION_NAME
            \- ID
            \- IS\_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}

            The column COLLATION\_NAME will display:
            \- short names, without the character set name prefix, for new UCA1400 collations.
            \- long names for all old collations

            So for example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}

            will look about like this:

            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}

            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            \- character set
            \- tailoring
            \- flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}

            where

            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets

            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.

            - icelandic
            - latvian
            - romanian
            - slovenian
            - polish
            - estonian
            - spanish
            - swedish
            - turkish
            - czech
            - danish
            - lithuanian
            - slovak
            - spanish2
            - roman
            - persian
            - esperanto
            - hungarian
            - sinhala
            - german2
            - croatian
            - vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.

            h2. Built-in contractions

            New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:

            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance

            The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags

            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:

            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}

            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as\_ci for the column "a", or
            - utf16_uca1400_as\_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}

            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.

            h2. INFORMATION\_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able:

            - CHARACTER_SET_NAME
            - ID
            - IS\_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION\_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}

            will look like this:

            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS\_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}

            The column COLLATION\_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations

            So for example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}

            will look about like this:

            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}

            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}

            where

            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            bar Alexander Barkov added a comment - serg , please review a new version here: https://github.com/MariaDB/server/commits/bb-10.9-bar-uca14
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Sergei Golubchik [ serg ]
            Status Stalled [ 10000 ] In Review [ 10002 ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Alexander Barkov [ bar ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            bar Alexander Barkov made changes -
            Status Stalled [ 10000 ] In Testing [ 10301 ]
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Elena Stepanova [ elenst ]
            bar Alexander Barkov added a comment - - edited

            elenst, please test this task.
            It's in this branch: preview-10.10-uca14.
            Thanks.

            bar Alexander Barkov added a comment - - edited elenst , please test this task. It's in this branch: preview-10.10-uca14 . Thanks.
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ] Lena Startseva [ JIRAUSER50478 ]
            lstartseva Lena Startseva added a comment - - edited

            I'm not sure that it is a bug, but for "croatian" and "vietnamese" are generated incorrect ids. According new rule for ID ranges:

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)
            

            In tasks says that we have 22 tailorings (default + 21 language), but "croatian" has tailoring ID "11000" (24 in decimal) and "vietnamese" has "10111" (23 in decimal).

            Also, in MDEV-27210 are described more flags than in current task, but their potential presence is not included in IDs.

            UPD:

            • A new range of IDs will be allocated for new flags
            • There are two special reserved IDs after "german2"
            lstartseva Lena Startseva added a comment - - edited I'm not sure that it is a bug, but for "croatian" and "vietnamese" are generated incorrect ids. According new rule for ID ranges: - ttttt (5 bits) - the tailoring ID (0 - default, 1 - icelandic, 2 - latvian, etc) In tasks says that we have 22 tailorings (default + 21 language), but "croatian" has tailoring ID "11000" (24 in decimal) and "vietnamese" has "10111" (23 in decimal). Also, in MDEV-27210 are described more flags than in current task, but their potential presence is not included in IDs. UPD: A new range of IDs will be allocated for new flags There are two special reserved IDs after "german2"
            lstartseva Lena Startseva made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            lstartseva Lena Startseva made changes -
            Assignee Lena Startseva [ JIRAUSER50478 ] Alexander Barkov [ bar ]

            Ok to push

            lstartseva Lena Startseva added a comment - Ok to push
            AirFocus AirFocus made changes -
            Description In order to make utf8mb4 the default character set (see MDEV-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U+10000 to U+10FFFF) as equal to each other.
            In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets

            New collations will be added for these Unicode character sets
            - utf8mb3
            - utf8mb4
            - ucs2
            - utf16
            - utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA-4.0.0 collations, except Thai.

            - icelandic
            - latvian
            - romanian
            - slovenian
            - polish
            - estonian
            - spanish
            - swedish
            - turkish
            - czech
            - danish
            - lithuanian
            - slovak
            - spanish2
            - roman
            - persian
            - esperanto
            - hungarian
            - sinhala
            - german2
            - croatian
            - vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.

            h2. Built-in contractions

            New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:

            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance

            The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance:
            - MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            - MDEV-27265 Improve contraction performance in UCA collations

            h2. Customization flags

            Under terms of this task we'll add support for the following flags
            - pad/nopad
            - accent sensitivity
            - case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:

            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}

            h2. Naming convention

            Naming convention for the new collations is described in MDEV-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:
            - utf8mb4_uca1400_as\_ci for the column "a", or
            - utf16_uca1400_as\_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}

            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.

            h2. INFORMATION\_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able:

            - CHARACTER_SET_NAME
            - ID
            - IS\_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            - The COLLATION\_NAME column will display the short name, without character set prefix.
            - The three NULL-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}

            will look like this:

            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added:
            - FULL_COLLATION_NAME
            - ID
            - IS\_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}

            The column COLLATION\_NAME will display:
            - short names, without the character set name prefix, for new UCA1400 collations.
            - long names for all old collations

            So for example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}

            will look about like this:

            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}

            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:
            - character set
            - tailoring
            - flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}

            where

            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            In order to make utf8mb4 the default character set (see MDEV\-19123), we need a reasonable default collation for utf8mb4.

            utf8mb4_general_ci is not good — it compares all supplementary characters (with code points in the range U10000 to U10FFFF) as equal to each other.
            In 2021 it's not acceptable — there are around 16 thousand supplementary charcters that have explicit weights in the DUCET table.

            Changing the default to utf8mb4_unicode_ci (Unicode-4.0.0 based) or utf8mb4_unicode_520\_ci (Unicode-5.2.0 based) is not reasonable either — these standards are more than 10 years old.

            Let's add a number of Unicode collation algoritm (UCA) collation based on Unicode-14.0.0 (released in September 2021), so we can make it later the default for utf8mb4 (under terms of MDEV-19123).

            Under terms of this task we'll add the "root" collation, as well as all language specific collations that exist for the old Unicode version.

            h2. Character sets

            New collations will be added for these Unicode character sets

            * utf8mb3
            * utf8mb4
            * ucs2
            * utf16
            * utf32

            h2. Tailoring (language specific collations)

            We'll add collations for all 22 tailorings that exists for the old UCA\-4.0.0 collations, except Thai.

            * icelandic
            * latvian
            * romanian
            * slovenian
            * polish
            * estonian
            * spanish
            * swedish
            * turkish
            * czech
            * danish
            * lithuanian
            * slovak
            * spanish2
            * roman
            * persian
            * esperanto
            * hungarian
            * sinhala
            * german2
            * croatian
            * vietnamese

            Note, with built-in contractions that will be supported in UCA-14.0.0 collations, a separate Thai collation won't be needed. Thai will work with the default root collation.

            h2. Built\-in contractions

            New collations will include built\-in contractions from http://www.unicode.org/Public/UCA/14.0.0/allkeys.txt

            There are 939 built-in contractions in UCA-14.0.0.

            To mention a few of them:

            {noformat}
            0E40 0E01 ; [.3339.0020.0002][.3373.0020.0002] # <THAI CHARACTER SARA E, THAI CHARACTER KO KAI>
            0E41 0E01 ; [.3339.0020.0002][.3374.0020.0002] # <THAI CHARACTER SARA AE, THAI CHARACTER KO KAI>
            0E42 0E01 ; [.3339.0020.0002][.3375.0020.0002] # <THAI CHARACTER SARA O, THAI CHARACTER KO KAI>
            0E43 0E01 ; [.3339.0020.0002][.3376.0020.0002] # <THAI CHARACTER SARA AI MAIMUAN, THAI CHARACTER KO KAI>
            0E44 0E01 ; [.3339.0020.0002][.3377.0020.0002] # <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER KO KAI>
            {noformat}

            Note the old collations based on UCA-4.0.0 and UCA-5.2.0 did not support built\-in contractions.
            This will be a step towards better Unicode Collation Algorithm compliance. With built\-in contractions, some languages (e.g. Thai) won't need specific collations and will just work with the default "root" collation.

            h2. Performance

            The patch adding UCA\-14.0.0 collations will be pushed together with these patches improving the performance:

            * MDEV\-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
            * MDEV\-27265 Improve contraction performance in UCA collations

            h2. Customization flags

            Under terms of this task we'll add support for the following flags

            * pad/nopad
            * accent sensitivity
            * case sensititivy

            Any arbitrary combination of these flags will be possible. That means, eight combinations for each tailoring (each language) will be added. For example, for Czech language, the following collations will be available:

            {noformat}
            +---------------------------+
            | collation_name |
            +---------------------------+
            | uca1400_czech_ai_ci |
            | uca1400_czech_ai_cs |
            | uca1400_czech_as_ci |
            | uca1400_czech_as_cs |
            | uca1400_czech_nopad_ai_ci |
            | uca1400_czech_nopad_ai_cs |
            | uca1400_czech_nopad_as_ci |
            | uca1400_czech_nopad_as_cs |
            +---------------------------+
            {noformat}

            h2. Naming convention

            Naming convention for the new collations is described in MDEV\-27210.

            h2. Short collation name

            The character set name in the collation name will be optional in all syntactic constructs:

            {code:sql}
            CREATE TABLE t1 (
              a CHAR(10) CHARACTER SET utf8mb4 COLLATE uca1400_as_ci,
              b CHAR(10) CHARACTER SET utf16 COLLATE uca1400_as_ci
            );
            {code}

            Notice, it will be enough to specify just uca1400_as_ci without having to type the full names:

            * utf8mb4_uca1400_as\_ci for the column "a", or
            * utf16_uca1400_as\_ci for the column "b".

            The full name will be detected automatically according to the character set effective in the given context.

            h2. SHOW CREATE statements

            {{SHOW CREATE}} statements will display long collations names.

            For example, {{SHOW CREATE}} for the table created using the CREATE TABLE statement will produce the following output:

            {code:sql}
            SHOW CREATE TABLE t1;
            {code}

            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `a` char(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_as_ci DEFAULT NULL,
              `b` char(10) CHARACTER SET utf16 COLLATE utf16_uca1400_as_ci DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the COLLATE clause will contain the full collation name, including the character set prefix.
            But this will possibly change in later versions to display short collation names.

            h2. INFORMATION\_SCHEMA.COLLATIONS changes

            The following columns in INFORMATION\_SCHEMA.COLLATIONS will be changed to be NULL\-able:

            * CHARACTER_SET_NAME
            * ID
            * IS\_DEFAULT

            All new UCA1400 collations will be displayed as follows:

            * The COLLATION\_NAME column will display the short name, without character set prefix.
            * The three NULL\-able column mentioned above will display NULL

            For old collations nothing will change.

            For example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
            WHERE COLLATION_NAME LIKE '%czech%';
            {code}

            will look like this:

            {noformat}
            +---------------------------+--------------------+------+------------+-------------+---------+
            | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
            +---------------------------+--------------------+------+------------+-------------+---------+
            | latin2_czech_cs | latin2 | 2 | | Yes | 4 |
            | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
            | utf8mb3_czech_ci | utf8mb3 | 202 | | Yes | 8 |
            | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
            | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 |
            | uca1400_czech_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_as_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_ai_cs | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_ci | NULL | NULL | NULL | Yes | 8 |
            | uca1400_czech_nopad_as_cs | NULL | NULL | NULL | Yes | 8 |
            | utf16_czech_ci | utf16 | 111 | | Yes | 8 |
            | utf32_czech_ci | utf32 | 170 | | Yes | 8 |
            +---------------------------+--------------------+------+------------+-------------+---------+
            {noformat}

            The idea is that short collation names will be applicable to multiple character sets.
            We'll eventually remove columns CHARACTER_SET_NAME, ID, IS\_DEFAULT from the table COLLATIONS. We'll keep them for now only for easier upgrades.

            h2. INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY changes

            In addition to columns COLLATION_NAME and CHARACTER_SET_NAME currently existing in the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET\_APPLICABILITY, three new columns will be added:

            * FULL_COLLATION_NAME
            * ID
            * IS\_DEFAULT

            So the new structure for INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY will look like this:

            {code:sql}
            CREATE TEMPORARY TABLE `COLLATION_CHARACTER_SET_APPLICABILITY` (
              `COLLATION_NAME` varchar(64) NOT NULL,
              `CHARACTER_SET_NAME` varchar(32) NOT NULL,
              `FULL_COLLATION_NAME` varchar(64) NOT NULL,
              `ID` bigint(11) NOT NULL,
              `IS_DEFAULT` varchar(3) NOT NULL
            ) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3;
            {code}

            The column COLLATION\_NAME will display:

            * short names, without the character set name prefix, for new UCA1400 collations.
            * long names for all old collations

            So for example, the output from this query:

            {code:sql}
            SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
            WHERE COLLATION_NAME LIKE '%czech%cs';
            {code}

            will look about like this:

            {noformat}
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            | latin2_czech_cs | latin2 | latin2_czech_cs | 2 | |
            | cp1250_czech_cs | cp1250 | cp1250_czech_cs | 34 | |
            | uca1400_czech_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_ai_cs | 2129 | |
            | uca1400_czech_as_cs | utf8mb3 | utf8mb3_uca1400_czech_as_cs | 2131 | |
            | uca1400_czech_nopad_ai_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_ai_cs | 2133 | |
            | uca1400_czech_nopad_as_cs | utf8mb3 | utf8mb3_uca1400_czech_nopad_as_cs | 2135 | |
            | uca1400_czech_ai_cs | ucs2 | ucs2_uca1400_czech_ai_cs | 2641 | |
            | uca1400_czech_as_cs | ucs2 | ucs2_uca1400_czech_as_cs | 2643 | |
            | uca1400_czech_nopad_ai_cs | ucs2 | ucs2_uca1400_czech_nopad_ai_cs | 2645 | |
            | uca1400_czech_nopad_as_cs | ucs2 | ucs2_uca1400_czech_nopad_as_cs | 2647 | |
            | uca1400_czech_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_ai_cs | 2385 | |
            | uca1400_czech_as_cs | utf8mb4 | utf8mb4_uca1400_czech_as_cs | 2387 | |
            | uca1400_czech_nopad_ai_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_ai_cs | 2389 | |
            | uca1400_czech_nopad_as_cs | utf8mb4 | utf8mb4_uca1400_czech_nopad_as_cs | 2391 | |
            | uca1400_czech_ai_cs | utf16 | utf16_uca1400_czech_ai_cs | 2897 | |
            | uca1400_czech_as_cs | utf16 | utf16_uca1400_czech_as_cs | 2899 | |
            | uca1400_czech_nopad_ai_cs | utf16 | utf16_uca1400_czech_nopad_ai_cs | 2901 | |
            | uca1400_czech_nopad_as_cs | utf16 | utf16_uca1400_czech_nopad_as_cs | 2903 | |
            | uca1400_czech_ai_cs | utf32 | utf32_uca1400_czech_ai_cs | 3153 | |
            | uca1400_czech_as_cs | utf32 | utf32_uca1400_czech_as_cs | 3155 | |
            | uca1400_czech_nopad_ai_cs | utf32 | utf32_uca1400_czech_nopad_ai_cs | 3157 | |
            | uca1400_czech_nopad_as_cs | utf32 | utf32_uca1400_czech_nopad_as_cs | 3159 | |
            +---------------------------+--------------------+-----------------------------------+------+------------+
            {noformat}

            h2. New ID ranges

            We'll use the range 2048-4095 for new UCA-14.0.0 collation.

            The ID will encode:

            * character set
            * tailoring
            * flags

            into 12 bits as follows:

            {noformat}
            1ccc tttt tPST
            {noformat}

            where

            {noformat}
            - ccc (3 bits) - the character set ID
                               (0 - utf8mb3, 1 - utf8mb4, 2 - ucs2, 3 - utf16, 4 - utf32)

            - ttttt (5 bits) - the tailoring ID
                               (0 - default, 1 - icelandic, 2 - latvian, etc)

            - P (1 bit) - the PAD flag (0 - PAD, 1 - NOPAD)

            - S (1 bit) - Secondary level
                               0 - disabled (accent insensitive)
                               1 - enabled (accent sensitive)

            - T (1 bit) - Tertiary level
                               0 - disabled (case insensitive)
                               1 - enabled (case sensitive)
            {noformat}
            serg Sergei Golubchik made changes -
            Fix Version/s 10.10.1 [ 27913 ]
            Fix Version/s 10.10 [ 27530 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            elenst Elena Stepanova made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels Preview_10.10
            Moshe L Moshe L made changes -
            danblack Daniel Black made changes -
            serg Sergei Golubchik made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 172810 154143 125537 136805
            serg Sergei Golubchik made changes -
            CaptainStarbuck Tony G made changes -
            serg Sergei Golubchik made changes -
            bar Alexander Barkov made changes -

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              3 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.