Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23400

Add UCA case sensitive accent sensitive collations for Unicode character sets

    XMLWordPrintable

Details

    Description

      As of the version 10.5.5, MariaDB support the following collations for Unicode character sets (using utf8 as an example):

      1. A simple accent insensitive and case insensitive collation utf8_general_ci, with one-to-one mapping between characters and their weights. It's very fast. It's not perfect in terms of linguistic sorting order, but works for some languages.
      2. A UCA accent insensitive and case insensitive collation utf8_unicode_ci (and its Unicode-5.2.0 version utf8_unicode_520_ci), with more complex mapping (one-to-zero,one-to-one,one-to-many,many-to-one,many-to-many). They provide a better sorting order than N1.
      3. A set of language specific accent insensitive and case insensitive collations, which use utf8_unicode_ci as a base and reorder a number of characters (e.g. utf8_german2_ci, utf8_spanish_ci, etc)
      4. A UCA accent sensitive case insensitive collation (with two weight levels) utf8_thai_520_w2.
      5. A binary collation, which is accent sensitive and case sensitive. It's extremely fast, but it orders according to the character code. So accented letters are sorted not near their non-accented counterparts.

      This script demonstrates the order provided by utf8_bin:

      CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_bin);
      INSERT INTO t1 VALUES ('A'),('a');
      INSERT INTO t1 VALUES ('O'),('o');
      INSERT INTO t1 VALUES ('À'), ('Ä'), ('à'), ('ä');
      INSERT INTO t1 VALUES ('Ò'), ('Ö'), ('ò'), ('ö');
      SELECT * FROM t1 ORDER BY a;
      

      +------+
      | a    |
      +------+
      | A    |
      | O    |
      | a    |
      | o    |
      | À    |
      | Ä    |
      | Ò    |
      | Ö    |
      | à    |
      | ä    |
      | ò    |
      | ö    |
      +------+
      

      So MariaDB has collations with a good linguistic order for these comparison styles:

      • Accent insensitive and case sensitive (N2, N3)
      • Accent sensitive and case sensitive (N4)

      But it does not have collations with a good linguistic order for the case sensitive and accent sensitive comparison style.

      Let's implement good linguistic case sensitive and accent sensitive collations for Unicode character sets.

      Tentative names: xxx_unicode_520_w3 (where xxx is utf8, utf8mb4, ucs2, utf16, utf32).

      The new collations will use 3 levels of Unicode weights. It wil provide much better sorting order than utf8_bin.

      Small letters will appear before capital letters.

      Using the same data, the new collation will return records in the following order:

      +------+
      | a    |
      +------+
      | a    |
      | A    |
      | à    |
      | À    |
      | ä    |
      | Ä    |
      | o    |
      | O    |
      | ò    |
      | Ò    |
      | ö    |
      | Ö    |
      +------+
      

      Open questions:

      • Perhaps, instead of using Unicode-5.2.0, we should add the current Unicode version. But this will grow the server binary size with extra 1Mb.

      Attachments

        Issue Links

          Activity

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.