XML

Word

Printable

Details

Type: New Feature
Status: Stalled (View Workflow)
Priority: Minor
Resolution: Unresolved
Fix Version/s: None
Component/s: Character Sets
Labels:
None

PM Planning:
- PM_REJECTED

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

... discussion on character_set_system  and why it is utf8mb3...

....

bar Oct 13th, 2021 at 4:23 PM

@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.

5 replies

wlad  3 months ago

so, a surrogate pair won't do? like, @d801@dc37

bar  3 months ago

for characters that do not have lower/upper variants, it will do.

bar  3 months ago

It will actually do for characters that have lower/upper variants as well.

bar  3 months ago

Thanks for the good idea.

Table name to file name extensions overview

We need to extend the encoding to support:

new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
non-BMP characters in the range U+010000 to U+10FFFF without case folding
non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

Unicode planes allowed in identifiers

As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

PlaneN  Code Range    Abbr      Name

------  ------------  ----      --------------------------------------

0       0000-FFFF     BMP       Basic Multilingual Plane

1       10000-1FFFF   SMP       Supplementary Multilingual Plane

2       20000-2FFFF   SIP       Supplementary Ideographic Plane

3       30000-3FFFF   TIP       Tertiary Ideographic Plane

4-13    40000-DFFFF   ---       unassigned

14      E0000-EFFFF   SSP       Supplementary Special-purpose Plane

15-16   F0000-10FFFF  SPUA-A/B  Supplementary Private Use Area planes

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

Characters with unsafe casefolding

Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))

These characters can be extracted using the following script:

CREATE OR REPLACE VIEW v1 AS

SELECT

  seq,

  char(seq using utf32) collate utf32_uca1400_ai_ci AS ch

FROM seq_1_to_1114111;

SELECT

ch,

  hex(ch) AS cu,

  upper(ch) AS u,

  hex(upper(ch)) AS uc,

  upper(lower(ch)) u2,

  hex(upper(lower(ch))) AS u2c

FROM v1

WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin;

+------+----------+------+----------+------+----------+

| ch   | cu       | u    | uc       | u2   | u2c      |

+------+----------+------+----------+------+----------+

| İ    | 00000130 | İ    | 00000130 | I    | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE

| ϴ    | 000003F4 | ϴ    | 000003F4 | Θ    | 00000398 | GREEK CAPITAL THETA SYMBOL

| ẞ    | 00001E9E | ẞ    | 00001E9E | ß    | 000000DF | LATIN CAPITAL LETTER SHARP S

| Ω    | 00002126 | Ω    | 00002126 | Ω    | 000003A9 | OHM SIGN

| K    | 0000212A | K    | 0000212A | K    | 0000004B | KELVIN SIGN

| Å    | 0000212B | Å    | 0000212B | Å    | 000000C5 | ANGSTROM SIGN

+------+----------+------+----------+------+----------+

Let's consider this pair as an example:

UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA

There are two options how to encode these characters

As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot.png
1.70 MB
2022-01-13 13:20

Issue Links

is blocked by

MDEV-30556 UPPER() returns an empty string for U+0251 in Unicode-5.2.0+ collations for utf8

Closed

MDEV-30577 Case folding for uca1400 collations is not up to date

Closed

MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations for utf8

Closed

MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp()

Closed

MDEV-31531 Remove my_casedn_str() and my_caseup_str()

Closed

MDEV-31606 Refactor check_db_name() to get a const argument

Closed

MDEV-31972 Change parameter of make_sp_name*() from LEX_CSTRING to Lex_ident_sys_st

Closed

MDEV-31978 Turn ok_for_lower_case_names() to a method in Lex_ident_fs

Closed

MDEV-32002 Remove my_casedn_str() in append_identifier() context

Closed

MDEV-32019 Replace my_casedn_str(local_buffer) to CharBuffer::copy_casedn()

Closed

MDEV-32081 Remove my_casedn_str() from get_canonical_filename()

Closed

MDEV-35255 Change the collation in INFORMATION_SCHEMA to utf8mb4_general1400_as_ci

Stalled

relates to

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

MDEV-25829 Change default Unicode collation to uca1400_ai_ci

Closed

MDEV-27050 Utf8mb4 identifier is not working

Closed

MDEV-32904 smiley emoji (F09F9883) valid in utf8 but not utf8mb4

Closed

(7 is blocked by, 4 relates to)

Activity

People

Assignee:: Alexander Barkov

Reporter:: Vladislav Vaintroub

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2022-01-13 13:21

Updated:: 2026-06-18 04:42

Time Tracking

Estimated:

20d

Remaining:

19.75d

Logged:

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.