[MDEV-27210] New naming convention for UCA collations - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Character Sets
Labels:
None

Description

As of version 10.7, MariaDB understands the following flags in collation names:

_ci for case insensitive collations
_cs for case sensitive collations
_nopad_ for NO PAD collations

We eventually want to support all customizations (collation preferences) as described in:
https://unicode.org/reports/tr10/#Customization

This new naming convention will encode more flags inside collation names.

This new naming conversion will be applied to newly added UCA based collations. Old collation names will stay untouched.

Collation name structure

The whole collation name structure will consist of the following parts delimited by underscores:

Character set name
Unicode collation algorithm version: letters "uca" followed by two digit major version, one digit minor version, one digit patch version (e.g. uca1400 for Unicode-14.0.0).
Optional tailoring name (usually a language name). This part will be omitted if the collation is based on a UCA collation without any language specific rules.
Flags, as described below

PAD flags

_pad - NO PAD (default)
_nopad - PAD SPACE

Variable Weighting (punctuation) flags

_vn — "Variable non-ignorable" - handles variable characters on Level 1 (default)
_vs — "Variable shifted" - shifts punctuation from Level 1 to Level 4 and enables Level 4.
_vb — "Variable blanked" - variable collation elements are reset so that all weights (except for the identical level) are zero.

Accent sensitivity flags

_ai — Accent insensitive - disables Level.
_as - Accent sensitive - enables Level 2.

Case sensitivity flags

_ci - Case insensitive - disables Level 3.
_cs — Case sensitive - enables Level 3. Case difference is handled according to tertiary weight, together with fullwidth, circled, square forms. See https://unicode.org/reports/tr10/#Tertiary_Weight_Table for details.
_co - Case only - enables a dedicated Level 2.5 only consisting of the case characteristics (upper vs lower), without other tertiary weight forms.

Identity sensitivity flags

_ii - identity insensitive - disables Level 5 (default)
_is - identity sensitivy - enables Level 5 (full binary equality)

Canonical collation names

The collation name parser will understand flags in the described above order, e.g.

_as_ci - correct
_ci_as - incorrect

The canonical names (i.e. as displayed in SHOW CREATE statements or I_S queries) will also print flags in the order described above.

The accent and case sensitivity flags will always be printed in canonical names, even with default values.

Other flags will be printed only if they have a non-default value.

Examples:

utf8mb4_uca1400_as_ci - a generic Unicode-14.0.0 collation, accent sensitive, case insensitive.
utf8mb4_uca1400_czech_nopad_vs_ai_cs_is - a Czech Unicode-14.0.0 collation, with punctuation shifted from Level 1 to Level 4, accent insensitive, case sensitive, identity sensitive.

Disclaimer

We won't implement all flags mentioned here in a single patch. They will be added in steps under terms of different tasks.

Variable weighting and Identity sensitivity flags will most likely be implemented later than other flags.

Attachments

Issue Links

relates to

MDEV-27009 Add UCA-14.0.0 collations

Closed

links to

DB2 collation naming

ICU Collator naming scheme

Oracle UCA collation parameters

SQL Server collation naming

SQL Server Collation Naming (uppercase preference)

(1 links to)

Activity

There are no comments yet on this issue.

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2021-12-09 12:18

Updated:: 2025-03-17 01:39

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server