Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23400

Add UCA case sensitive accent sensitive collations for Unicode character sets

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Character Sets
    • Labels:
      None

      Description

      As of the version 10.5.5, MariaDB support the following collations for Unicode character sets (using utf8 as an example):

      1. A simple accent insensitive and case insensitive collation utf8_general_ci, with one-to-one mapping between characters and their weights. It's very fast. It's not perfect in terms of linguistic sorting order, but works for some languages.
      2. A UCA accent insensitive and case insensitive collation utf8_unicode_ci (and its Unicode-5.2.0 version utf8_unicode_520_ci), with more complex mapping (one-to-zero,one-to-one,one-to-many,many-to-one,many-to-many). They provide a better sorting order than N1.
      3. A set of language specific accent insensitive and case insensitive collations, which use utf8_unicode_ci as a base and reorder a number of characters (e.g. utf8_german2_ci, utf8_spanish_ci, etc)
      4. A UCA accent sensitive case insensitive collation (with two weight levels) utf8_thai_520_w2.
      5. A binary collation, which is accent sensitive and case sensitive. It's extremely fast, but it orders according to the character code. So accented letters are sorted not near their non-accented counterparts.

      This script demonstrates the order provided by utf8_bin:

      CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_bin);
      INSERT INTO t1 VALUES ('A'),('a');
      INSERT INTO t1 VALUES ('O'),('o');
      INSERT INTO t1 VALUES ('À'), ('Ä'), ('à'), ('ä');
      INSERT INTO t1 VALUES ('Ò'), ('Ö'), ('ò'), ('ö');
      SELECT * FROM t1 ORDER BY a;
      

      +------+
      | a    |
      +------+
      | A    |
      | O    |
      | a    |
      | o    |
      | À    |
      | Ä    |
      | Ò    |
      | Ö    |
      | à    |
      | ä    |
      | ò    |
      | ö    |
      +------+
      

      So MariaDB has collations with a good linguistic order for these comparison styles:

      • Accent insensitive and case sensitive (N2, N3)
      • Accent sensitive and case sensitive (N4)

      But it does not have collations with a good linguistic order for the case sensitive and accent sensitive comparison style.

      Let's implement good linguistic case sensitive and accent sensitive collations for Unicode character sets.

      Tentative names: xxx_unicode_520_w3 (where xxx is utf8, utf8mb4, ucs2, utf16, utf32).

      The new collations will use 3 levels of Unicode weights. It wil provide much better sorting order than utf8_bin.

      Small letters will appear before capital letters.

      Using the same data, the new collation will return records in the following order:

      +------+
      | a    |
      +------+
      | a    |
      | A    |
      | à    |
      | À    |
      | ä    |
      | Ä    |
      | o    |
      | O    |
      | ò    |
      | Ò    |
      | ö    |
      | Ö    |
      +------+
      

      Open questions:

      • Perhaps, instead of using Unicode-5.2.0, we should add the current Unicode version. But this will grow the server binary size with extra 1Mb.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              bar Alexander Barkov
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:

                  Git Integration