[MDEV-23400] Add UCA case sensitive accent sensitive collations for Unicode character sets - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Fix Version/s: N/A
Component/s: Character Sets
Labels:
None

Description

As of the version 10.5.5, MariaDB support the following collations for Unicode character sets (using utf8 as an example):

A simple accent insensitive and case insensitive collation utf8_general_ci, with one-to-one mapping between characters and their weights. It's very fast. It's not perfect in terms of linguistic sorting order, but works for some languages.
A UCA accent insensitive and case insensitive collation utf8_unicode_ci (and its Unicode-5.2.0 version utf8_unicode_520_ci), with more complex mapping (one-to-zero,one-to-one,one-to-many,many-to-one,many-to-many). They provide a better sorting order than N1.
A set of language specific accent insensitive and case insensitive collations, which use utf8_unicode_ci as a base and reorder a number of characters (e.g. utf8_german2_ci, utf8_spanish_ci, etc)
A UCA accent sensitive case insensitive collation (with two weight levels) utf8_thai_520_w2.
A binary collation, which is accent sensitive and case sensitive. It's extremely fast, but it orders according to the character code. So accented letters are sorted not near their non-accented counterparts.

This script demonstrates the order provided by utf8_bin:

CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_bin);

INSERT INTO t1 VALUES ('A'),('a');

INSERT INTO t1 VALUES ('O'),('o');

INSERT INTO t1 VALUES ('À'), ('Ä'), ('à'), ('ä');

INSERT INTO t1 VALUES ('Ò'), ('Ö'), ('ò'), ('ö');

SELECT * FROM t1 ORDER BY a;

+------+

| a    |

+------+

| A    |

| O    |

| a    |

| o    |

| À    |

| Ä    |

| Ò    |

| Ö    |

| à    |

| ä    |

| ò    |

| ö    |

+------+

So MariaDB has collations with a good linguistic order for these comparison styles:

Accent insensitive and case sensitive (N2, N3)
Accent sensitive and case sensitive (N4)

But it does not have collations with a good linguistic order for the case sensitive and accent sensitive comparison style.

Let's implement good linguistic case sensitive and accent sensitive collations for Unicode character sets.

Tentative names: xxx_unicode_520_w3 (where xxx is utf8, utf8mb4, ucs2, utf16, utf32).

The new collations will use 3 levels of Unicode weights. It wil provide much better sorting order than utf8_bin.

Small letters will appear before capital letters.

Using the same data, the new collation will return records in the following order:

+------+

| a    |

+------+

| a    |

| A    |

| à    |

| À    |

| ä    |

| Ä    |

| o    |

| O    |

| ò    |

| Ò    |

| ö    |

| Ö    |

+------+

Open questions:

Perhaps, instead of using Unicode-5.2.0, we should add the current Unicode version. But this will grow the server binary size with extra 1Mb.

Attachments

Issue Links

blocks

MDEV-22870 Would like sort order for sql_mode=ORACLE to match Oracle sort order

Open

relates to

MDEV-4742 A function for native natural sorting / natural compare

Closed

MDEV-27009 Add UCA-14.0.0 collations

Closed

MDEV-27009 Add UCA-14.0.0 collations

Closed

Activity

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2020-08-04 13:34

Updated:: 2024-07-07 21:57

Resolved:: 2022-08-18 04:46

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server