[MDEV-23400] Add UCA case sensitive accent sensitive collations for Unicode character sets Created: 2020-08-04  Updated: 2022-08-18  Resolved: 2022-08-18

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Duplicate Votes: 1
Labels: None

Issue Links:
Blocks
blocks MDEV-22870 Would like sort order for sql_mode=OR... Open
Relates
relates to MDEV-4742 A function for native natural sorting... Closed
relates to MDEV-27009 Add UCA-14.0.0 collations Closed
relates to MDEV-27009 Add UCA-14.0.0 collations Closed

 Description   

As of the version 10.5.5, MariaDB support the following collations for Unicode character sets (using utf8 as an example):

  1. A simple accent insensitive and case insensitive collation utf8_general_ci, with one-to-one mapping between characters and their weights. It's very fast. It's not perfect in terms of linguistic sorting order, but works for some languages.
  2. A UCA accent insensitive and case insensitive collation utf8_unicode_ci (and its Unicode-5.2.0 version utf8_unicode_520_ci), with more complex mapping (one-to-zero,one-to-one,one-to-many,many-to-one,many-to-many). They provide a better sorting order than N1.
  3. A set of language specific accent insensitive and case insensitive collations, which use utf8_unicode_ci as a base and reorder a number of characters (e.g. utf8_german2_ci, utf8_spanish_ci, etc)
  4. A UCA accent sensitive case insensitive collation (with two weight levels) utf8_thai_520_w2.
  5. A binary collation, which is accent sensitive and case sensitive. It's extremely fast, but it orders according to the character code. So accented letters are sorted not near their non-accented counterparts.

This script demonstrates the order provided by utf8_bin:

CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_bin);
INSERT INTO t1 VALUES ('A'),('a');
INSERT INTO t1 VALUES ('O'),('o');
INSERT INTO t1 VALUES ('À'), ('Ä'), ('à'), ('ä');
INSERT INTO t1 VALUES ('Ò'), ('Ö'), ('ò'), ('ö');
SELECT * FROM t1 ORDER BY a;

+------+
| a    |
+------+
| A    |
| O    |
| a    |
| o    |
| À    |
| Ä    |
| Ò    |
| Ö    |
| à    |
| ä    |
| ò    |
| ö    |
+------+

So MariaDB has collations with a good linguistic order for these comparison styles:

  • Accent insensitive and case sensitive (N2, N3)
  • Accent sensitive and case sensitive (N4)

But it does not have collations with a good linguistic order for the case sensitive and accent sensitive comparison style.

Let's implement good linguistic case sensitive and accent sensitive collations for Unicode character sets.

Tentative names: xxx_unicode_520_w3 (where xxx is utf8, utf8mb4, ucs2, utf16, utf32).

The new collations will use 3 levels of Unicode weights. It wil provide much better sorting order than utf8_bin.

Small letters will appear before capital letters.

Using the same data, the new collation will return records in the following order:

+------+
| a    |
+------+
| a    |
| A    |
| à    |
| À    |
| ä    |
| Ä    |
| o    |
| O    |
| ò    |
| Ò    |
| ö    |
| Ö    |
+------+

Open questions:

  • Perhaps, instead of using Unicode-5.2.0, we should add the current Unicode version. But this will grow the server binary size with extra 1Mb.


 Comments   
Comment by Alexander Barkov [ 2022-08-18 ]

Case and accent sensitive collations were added into MariaDB-10.10 under terms of MDEV-27009

Generated at Thu Feb 08 09:22:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.