Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Duplicate
-
None
Description
As of the version 10.5.5, MariaDB support the following collations for Unicode character sets (using utf8 as an example):
- A simple accent insensitive and case insensitive collation utf8_general_ci, with one-to-one mapping between characters and their weights. It's very fast. It's not perfect in terms of linguistic sorting order, but works for some languages.
- A UCA accent insensitive and case insensitive collation utf8_unicode_ci (and its Unicode-5.2.0 version utf8_unicode_520_ci), with more complex mapping (one-to-zero,one-to-one,one-to-many,many-to-one,many-to-many). They provide a better sorting order than N1.
- A set of language specific accent insensitive and case insensitive collations, which use utf8_unicode_ci as a base and reorder a number of characters (e.g. utf8_german2_ci, utf8_spanish_ci, etc)
- A UCA accent sensitive case insensitive collation (with two weight levels) utf8_thai_520_w2.
- A binary collation, which is accent sensitive and case sensitive. It's extremely fast, but it orders according to the character code. So accented letters are sorted not near their non-accented counterparts.
This script demonstrates the order provided by utf8_bin:
CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_bin); |
INSERT INTO t1 VALUES ('A'),('a'); |
INSERT INTO t1 VALUES ('O'),('o'); |
INSERT INTO t1 VALUES ('À'), ('Ä'), ('à'), ('ä'); |
INSERT INTO t1 VALUES ('Ò'), ('Ö'), ('ò'), ('ö'); |
SELECT * FROM t1 ORDER BY a; |
+------+
|
| a |
|
+------+
|
| A |
|
| O |
|
| a |
|
| o |
|
| À |
|
| Ä |
|
| Ò |
|
| Ö |
|
| à |
|
| ä |
|
| ò |
|
| ö |
|
+------+
|
So MariaDB has collations with a good linguistic order for these comparison styles:
- Accent insensitive and case sensitive (N2, N3)
- Accent sensitive and case sensitive (N4)
But it does not have collations with a good linguistic order for the case sensitive and accent sensitive comparison style.
Let's implement good linguistic case sensitive and accent sensitive collations for Unicode character sets.
Tentative names: xxx_unicode_520_w3 (where xxx is utf8, utf8mb4, ucs2, utf16, utf32).
The new collations will use 3 levels of Unicode weights. It wil provide much better sorting order than utf8_bin.
Small letters will appear before capital letters.
Using the same data, the new collation will return records in the following order:
+------+
|
| a |
|
+------+
|
| a |
|
| A |
|
| à |
|
| À |
|
| ä |
|
| Ä |
|
| o |
|
| O |
|
| ò |
|
| Ò |
|
| ö |
|
| Ö |
|
+------+
|
Open questions:
- Perhaps, instead of using Unicode-5.2.0, we should add the current Unicode version. But this will grow the server binary size with extra 1Mb.
Attachments
Issue Links
- blocks
-
MDEV-22870 Would like sort order for sql_mode=ORACLE to match Oracle sort order
- Open
- relates to
-
MDEV-4742 A function for native natural sorting / natural compare
- Closed
-
MDEV-27009 Add UCA-14.0.0 collations
- Closed
-
MDEV-27009 Add UCA-14.0.0 collations
- Closed