As of the version 10.5.5, MariaDB support the following collations for Unicode character sets (using utf8 as an example):
- A simple accent insensitive and case insensitive collation utf8_general_ci, with one-to-one mapping between characters and their weights. It's very fast. It's not perfect in terms of linguistic sorting order, but works for some languages.
- A UCA accent insensitive and case insensitive collation utf8_unicode_ci (and its Unicode-5.2.0 version utf8_unicode_520_ci), with more complex mapping (one-to-zero,one-to-one,one-to-many,many-to-one,many-to-many). They provide a better sorting order than N1.
- A set of language specific accent insensitive and case insensitive collations, which use utf8_unicode_ci as a base and reorder a number of characters (e.g. utf8_german2_ci, utf8_spanish_ci, etc)
- A UCA accent sensitive case insensitive collation (with two weight levels) utf8_thai_520_w2.
- A binary collation, which is accent sensitive and case sensitive. It's extremely fast, but it orders according to the character code. So accented letters are sorted not near their non-accented counterparts.
This script demonstrates the order provided by utf8_bin:
So MariaDB has collations with a good linguistic order for these comparison styles:
- Accent insensitive and case sensitive (N2, N3)
- Accent sensitive and case sensitive (N4)
But it does not have collations with a good linguistic order for the case sensitive and accent sensitive comparison style.
Let's implement good linguistic case sensitive and accent sensitive collations for Unicode character sets.
Tentative names: xxx_unicode_520_w3 (where xxx is utf8, utf8mb4, ucs2, utf16, utf32).
The new collations will use 3 levels of Unicode weights. It wil provide much better sorting order than utf8_bin.
Small letters will appear before capital letters.
Using the same data, the new collation will return records in the following order:
- Perhaps, instead of using Unicode-5.2.0, we should add the current Unicode version. But this will grow the server binary size with extra 1Mb.