Details

Type: New Feature
Status: Stalled (View Workflow)
Priority: Minor
Resolution: Unresolved
Fix Version/s: None
Component/s: Character Sets
Labels:
None

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

... discussion on character_set_system  and why it is utf8mb3...

....

bar Oct 13th, 2021 at 4:23 PM

@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.

5 replies

wlad  3 months ago

so, a surrogate pair won't do? like, @d801@dc37

bar  3 months ago

for characters that do not have lower/upper variants, it will do.

bar  3 months ago

It will actually do for characters that have lower/upper variants as well.

bar  3 months ago

Thanks for the good idea.

Table name to file name extensions overview

We need to extend the encoding to support:

new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
non-BMP characters in the range U+010000 to U+10FFFF without case folding
non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

Unicode planes allowed in identifiers

As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

PlaneN  Code Range    Abbr      Name

------  ------------  ----      --------------------------------------

0       0000-FFFF     BMP       Basic Multilingual Plane

1       10000-1FFFF   SMP       Supplementary Multilingual Plane

2       20000-2FFFF   SIP       Supplementary Ideographic Plane

3       30000-3FFFF   TIP       Tertiary Ideographic Plane

4-13    40000-DFFFF   ---       unassigned

14      E0000-EFFFF   SSP       Supplementary Special-purpose Plane

15-16   F0000-10FFFF  SPUA-A/B  Supplementary Private Use Area planes

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

Characters with unsafe casefolding

Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))

These characters can be extracted using the following script:

CREATE OR REPLACE VIEW v1 AS

SELECT

  seq,

  char(seq using utf32) collate utf32_uca1400_ai_ci AS ch

FROM seq_1_to_1114111;

SELECT

ch,

  hex(ch) AS cu,

  upper(ch) AS u,

  hex(upper(ch)) AS uc,

  upper(lower(ch)) u2,

  hex(upper(lower(ch))) AS u2c

FROM v1

WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin;

+------+----------+------+----------+------+----------+

| ch   | cu       | u    | uc       | u2   | u2c      |

+------+----------+------+----------+------+----------+

| İ    | 00000130 | İ    | 00000130 | I    | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE

| ϴ    | 000003F4 | ϴ    | 000003F4 | Θ    | 00000398 | GREEK CAPITAL THETA SYMBOL

| ẞ    | 00001E9E | ẞ    | 00001E9E | ß    | 000000DF | LATIN CAPITAL LETTER SHARP S

| Ω    | 00002126 | Ω    | 00002126 | Ω    | 000003A9 | OHM SIGN

| K    | 0000212A | K    | 0000212A | K    | 0000004B | KELVIN SIGN

| Å    | 0000212B | Å    | 0000212B | Å    | 000000C5 | ANGSTROM SIGN

+------+----------+------+----------+------+----------+

Let's consider this pair as an example:

UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA

There are two options how to encode these characters

As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot.png
1.70 MB
2022-01-13 13:20

Issue Links

is blocked by

MDEV-30556 UPPER() returns an empty string for U+0251 in Unicode-5.2.0+ collations for utf8

Closed

MDEV-30577 Case folding for uca1400 collations is not up to date

Closed

MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations for utf8

Closed

MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp()

Closed

MDEV-31531 Remove my_casedn_str() and my_caseup_str()

Closed

MDEV-31606 Refactor check_db_name() to get a const argument

Closed

MDEV-31972 Change parameter of make_sp_name*() from LEX_CSTRING to Lex_ident_sys_st

Closed

MDEV-31978 Turn ok_for_lower_case_names() to a method in Lex_ident_fs

Closed

MDEV-32002 Remove my_casedn_str() in append_identifier() context

Closed

MDEV-32019 Replace my_casedn_str(local_buffer) to CharBuffer::copy_casedn()

Closed

MDEV-32081 Remove my_casedn_str() from get_canonical_filename()

Closed

MDEV-35255 Change the collation in INFORMATION_SCHEMA to utf8mb4_general1400_as_ci

In Progress

relates to

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

MDEV-25829 Change default Unicode collation to uca1400_ai_ci

Closed

MDEV-32904 smiley emoji (F09F9883) valid in utf8 but not utf8mb4

Closed

(7 is blocked by, 3 relates to)

Activity

Ascending order - Click to sort in descending order

Alexander Barkov added a comment - 2024-09-11 08:06 - edited

Table name to file name encoding extension, proposal #1

non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:

[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]

where:

 @        - the encoded character marker

 +        - the marker for non-BMP character without case folding

 [0-9a-v] - the first digit  (32 values)

 [0-9a-v] - the second digit (32 values)

 [0-9a-v] - the third digit  (32 values)

 [0-9a-v] - the fourth digit (32 values)

The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples

@+0000  - U+010000 = 0x10000 +   0*(32^3) +  0*(32^2) +  0*(32^1) +  0

@+1000  - U+018000 = 0x10000 +   1*(32^3) +  0*(32^2) +  0*(32^1) +  0

@+aaaa  - U+06294A = 0x10000 +  10*(32^3) + 10*(32^2) + 10*(32^1) + 10

@+vvvv  - U+10FFFF = 0x10000 +  31*(32^3) + 31*(32^2) + 31*(32^1) + 31

BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:

[@][0-9a-f][0-9a-f][g-v][0-9a-z]

where

 @        - the encoded character marker

 [0-9a-f] - the first digit  (16 values)

 [0-9a-f] - the second digit (16 values)

 [g-v]    - the third digit  (16 values) - determines upper or lower case

 [0-9a-f] - the fourth digit (16 values)

The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:

0123456789abcdef - the hex value

GHIJKLMNOPQRSTUV - the third digit, upper case

ghijklmnopqrstuv - the third digit, lower case

So for example, the hex value of 7 corresponds to

the digit 'N' in case of a upper-case character
the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

U+0180 LATIN SMALL LETTER B WITH STROKE

U+0243 LATIN CAPITAL LETTER B WITH STROKE

These characters will be encoded as:

@01o0 - the code point U+0180

@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180

Another example: Unicode-14.0.0 has the following new casefolding mapping:

U+0500 CYRILLIC CAPITAL LETTER KOMI DE

U+0501 CYRILLIC SMALL LETTER KOMI DE

These characters will be encoded as:

@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501

@05g1 - the code poing U+0501

non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:

Plane-0 (BMP)
Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

[@][0-9a-f][0-9a-f][g-v][g-v]

where

 @        - the encoded character marker

 [0-9a-f] - the first digit  (16 values)

 [0-9a-f] - the second digit (16 values)

 [g-v]    - the third digit  (16 values) - determines upper or lower case

 [g-v]    - the fourth digit (16 values)

The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:

U+10400 DESERET CAPITAL LETTER LONG I

U+10428 DESERET SMALL LETTER LONG I

They will be encoded as:

@04I8 - the code point U+10400, or literally UPPER(code point U+10428)

@04i8 - the code point U+10428

Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

Pattern                                 CodePoints               Comment

------------------------------------    -----------------------  ----------------

[@][0..9][g..z]                         10*20         = 200      BMP characters with 3.0.0 case folding

[@][g..z][0..9]                         20*10         = 200      BMP characters with 3.0.0 case folding

[@][g..z][a..z]                         20*26         = 520      BMP characters with 3.0.0 case folding

[@][@][a..z]                            1*26          = 26       BMP characters with 3.0.0 case folding

[@][a..z][@]                            1*26          = 26       BMP characters with 3.0.0 case folding

[@][a..f][g..z]                         16*20         = 320      Unused

[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f]     16*16*16*16   = 65536    BMP characters without case folding

------------------------------------    -----------------------  ----------------

[@][0-9a-f][0-9a-f][g-v][0-9a-f]        16*16*16*16   = 65536    BMP characters with 14.0.0 case folding

[@][0-9a-f][0-9a-f][g-v][g-z]           16*16*16*16   = 65536    non-BMP characters with case folding (Plane 1 only)

[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]  1*32*32*32*32 = 1048576  non-BMP characters without case folding

Alexander Barkov added a comment - 2024-09-11 08:06 - edited Table name to file name encoding extension, proposal #1 non-BMP Encoding without case folding Let's encode non-BMP characters which do not have case folding as follows: [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] where: @ - the encoded character marker + - the marker for non-BMP character without case folding [0-9a-v] - the first digit (32 values) [0-9a-v] - the second digit (32 values) [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) The total sequence length is 6 characters. This encoding gives total 32*32*32*32 = 1048576 values It covers exactly all non-BMP characters U+010000 to U+10FFFF. Examples @+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0 @+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0 @+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10 @+vvvv - U+10FFFF = 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31 BMP characters with new case folding mappings The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding). Let's encode such characters with new casefolding as follows: [@][0-9a-f][0-9a-f][g-v][0-9a-z] where @ - the encoded character marker [0-9a-f] - the first digit (16 values) [0-9a-f] - the second digit (16 values) [g-v] - the third digit (16 values) - determines upper or lower case [0-9a-f] - the fourth digit (16 values) The total encoded sequence length is 5 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The third digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF. If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters. The mapping between the third digit and it hex value: 0123456789abcdef - the hex value GHIJKLMNOPQRSTUV - the third digit, upper case ghijklmnopqrstuv - the third digit, lower case So for example, the hex value of 7 corresponds to the digit 'N' in case of a upper-case character the digit 'n' in case of a lower-case character. For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding. U+0180 LATIN SMALL LETTER B WITH STROKE U+0243 LATIN CAPITAL LETTER B WITH STROKE These characters will be encoded as: @01o0 - the code point U+0180 @01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180 Another example: Unicode-14.0.0 has the following new casefolding mapping: U+0500 CYRILLIC CAPITAL LETTER KOMI DE U+0501 CYRILLIC SMALL LETTER KOMI DE These characters will be encoded as: @05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501 @05g1 - the code poing U+0501 non-BMP characters with case folding. As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in: Plane-0 (BMP) Plane1 (U+10000..U+1FFFF). Let's encode Plane-1 letters with casefolding as follows: [@][0-9a-f][0-9a-f][g-v][g-v] where @ - the encoded character marker [0-9a-f] - the first digit (16 values) [0-9a-f] - the second digit (16 values) [g-v] - the third digit (16 values) - determines upper or lower case [g-v] - the fourth digit (16 values) The total encoded sequence length is 5 characters. The third digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF. If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters. Example. Unicode-14.0.0 has casefolding between the following characters: U+10400 DESERET CAPITAL LETTER LONG I U+10428 DESERET SMALL LETTER LONG I They will be encoded as: @04I8 - the code point U+10400, or literally UPPER(code point U+10428) @04i8 - the code point U+10428 Summary of the encoding components After adding the mentioned extensions, the encoding will consist of the components: Pattern CodePoints Comment ------------------------------------ ----------------------- ---------------- [@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding [@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding [@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding [@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..f][g..z] 16*20 = 320 Unused [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding ------------------------------------ ----------------------- ---------------- [@][0-9a-f][0-9a-f][g-v][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding [@][0-9a-f][0-9a-f][g-v][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only) [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding

Alexander Barkov added a comment - 2024-09-11 08:08 - edited

Table name to file name encoding extension, proposal #2

The old encoding has an unused range:

[@][a..f][g..z] 6*20= 120 combinations

The idea is to reuse this unused range for the new extensions.

BMP characters with new 14.0.0 casefolding

Let's encode characters with new casefolding as follows:

[@][a-b][g-v][0-9a-v][0-9a-v]

where

@         - the encoded character marker

[a-b]     - the first digit  (4 values)

[g-v]     - the second digit (16 values) - determines upper or lower case

[0-9a-v]  - the third digit  (32 values)

[0-9a-v]  - the fourth digit (32 values)

The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The second digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-V], then the character is in the upper case

This encoding gives 4*16*32*32=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

Examples:

@ah81   - U+501 = 0*16*32*32 + 1*32*32 + 8*32 + 1

@aH81   - U+500 = 0*16*32*32 + 1*32*32 + 8*32 + 1, or literally UPPER(U+501)

Non-BMP characters with case folding

Let's encode non-BMP characters with casefolding as follows:

[@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f]

where

@         - the encoded character marker

[c-f]     - the first digit  (4 values)

[g-v]     - the second digit (16 values) - determines upper or lower case

[0-9a-v]  - the third digit  (32 values)

[0-9a-v]  - the fourth digit (32 values)

[0-9a-f]  - the fivth digit  (16 values)

The total encoded sequence length is 6 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The second digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-v], then the character is in the upper case

This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire
non-BMP range U+010000 to U+10FFFF.

Non-BMP characters without folding

Let's encode non-BMP characters without casefolding as follows:

[@][c-f][g-v][0-9a-v][0-9a-v][g-v]

where

@         - the encoded character marker

[c-f]     - the first digit  (4 values)

[g-v]     - the second digit (16 values)

[0-9a-v]  - the third digit  (32 values)

[0-9a-v]  - the fourth digit (32 values)

[g-v]     - the fifth digit  (16 values)

The total encoded sequence length is 6 characters.

This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire
non-BMP range U+010000 to U+10FFFF.

Examples

@cg00g   - U+010000 = 0x10000 + 0*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@dg00g   - U+050000 = 0x10000 + 1*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@eg00g   - U+090000 = 0x10000 + 2*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@fg00g   - U+0D0000 = 0x10000 + 3*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@fvvvv   - U+10FFFF = 0x10000 + 3*16*32*32*16 + 15*32*32*16 + 31*32*16 + 31*16 + 15

Summary

With this proposal the full summary of the encoding components will look as follows:

Pattern                                CodePoints               Comment

-------------------------------------  -----------------------  -------------------

[@][0..9][g..z]                        10*20         = 200      BMP characters with 3.0.0 case folding

[@][g..z][0..9]                        20*10         = 200      BMP characters with 3.0.0 case folding

[@][g..z][a..z]                        20*26         = 520      BMP characters with 3.0.0 case folding

[@][@][a..z]                           1*26          = 26       BMP characters with 3.0.0 case folding

[@][a..z][@]                           1*26          = 26       BMP characters with 3.0.0 case folding

[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f]    16*16*16*16   = 65536    BMP characters without case folding

------------------------------------   -----------------------  ----------------

[@][a-b][g-v][0-9a-v][0-9a-v]          4*16*32*32    = 65536    BMP with new folding

[@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f]  4*16*32*32*16 = 1048576  non-BMP with folding

[@][c-f][g-v][0-9a-v][0-9a-v][g-v]     4*16*32*32*16 = 1048576  non-BMP without folding

The advantages of this proposal:

"non-BMP with folding" covers all non-BMP characters in the range U+010000..U+10FFFF.
Does not introduce new characters into the alphabet

Alexander Barkov added a comment - 2024-09-11 08:08 - edited Table name to file name encoding extension, proposal #2 The old encoding has an unused range: [@][a..f][g..z] 6*20= 120 combinations The idea is to reuse this unused range for the new extensions. BMP characters with new 14.0.0 casefolding Let's encode characters with new casefolding as follows: [@][a-b][g-v][0-9a-v][0-9a-v] where @ - the encoded character marker [a-b] - the first digit (4 values) [g-v] - the second digit (16 values) - determines upper or lower case [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) The total encoded sequence length is 5 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The second digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 4*16*32*32=65536 values, which covers the entire BMP range U+0000 to U+FFFF. Examples: @ah81 - U+501 = 0*16*32*32 + 1*32*32 + 8*32 + 1 @aH81 - U+500 = 0*16*32*32 + 1*32*32 + 8*32 + 1, or literally UPPER(U+501) Non-BMP characters with case folding Let's encode non-BMP characters with casefolding as follows: [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f] where @ - the encoded character marker [c-f] - the first digit (4 values) [g-v] - the second digit (16 values) - determines upper or lower case [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) [0-9a-f] - the fivth digit (16 values) The total encoded sequence length is 6 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The second digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-v] , then the character is in the upper case This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire non-BMP range U+010000 to U+10FFFF. Non-BMP characters without folding Let's encode non-BMP characters without casefolding as follows: [@][c-f][g-v][0-9a-v][0-9a-v][g-v] where @ - the encoded character marker [c-f] - the first digit (4 values) [g-v] - the second digit (16 values) [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) [g-v] - the fifth digit (16 values) The total encoded sequence length is 6 characters. This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire non-BMP range U+010000 to U+10FFFF. Examples @cg00g - U+010000 = 0x10000 + 0*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @dg00g - U+050000 = 0x10000 + 1*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @eg00g - U+090000 = 0x10000 + 2*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @fg00g - U+0D0000 = 0x10000 + 3*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @fvvvv - U+10FFFF = 0x10000 + 3*16*32*32*16 + 15*32*32*16 + 31*32*16 + 31*16 + 15 Summary With this proposal the full summary of the encoding components will look as follows: Pattern CodePoints Comment ------------------------------------- ----------------------- ------------------- [@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding [@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding [@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding [@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding ------------------------------------ ----------------------- ---------------- [@][a-b][g-v][0-9a-v][0-9a-v] 4*16*32*32 = 65536 BMP with new folding [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f] 4*16*32*32*16 = 1048576 non-BMP with folding [@][c-f][g-v][0-9a-v][0-9a-v][g-v] 4*16*32*32*16 = 1048576 non-BMP without folding The advantages of this proposal: "non-BMP with folding" covers all non-BMP characters in the range U+010000..U+10FFFF. Does not introduce new characters into the alphabet

Alexander Barkov added a comment - 2024-10-25 09:44

Upgrade issues.

Suppose two BMP characters U+AAAA and U+BBBB:

where not case variants of the same character in the old encoding
but become case variants of the same characters in Unicode-14.0.0

then mariadb-upgrade should not touch tables with such characters and display them as '#mdb1107#....', so the user can rename them manually.

Alexander Barkov added a comment - 2024-10-25 09:44 Upgrade issues. Suppose two BMP characters U+AAAA and U+BBBB: where not case variants of the same character in the old encoding but become case variants of the same characters in Unicode-14.0.0 then mariadb-upgrade should not touch tables with such characters and display them as '#mdb1107#....', so the user can rename them manually.

Ralf Gebhardt added a comment - 2024-11-19 18:59

bar, in today's team lead call we decided to remove this change from the roadmap.

Being able to use utf8mb4 in identifiers is a low use case and value compared to the possible drawbacks.

Ralf Gebhardt added a comment - 2024-11-19 18:59 bar , in today's team lead call we decided to remove this change from the roadmap. Being able to use utf8mb4 in identifiers is a low use case and value compared to the possible drawbacks.

People

Assignee:: Alexander Barkov

Reporter:: Vladislav Vaintroub

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2022-01-13 13:21

Updated:: 2024-11-28 16:59

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Allow full utf8mb4 for identifiers

Details

Description

Table name to file name extensions overview

Unicode planes allowed in identifiers

Characters with unsafe casefolding

Attachments

Attachments

Issue Links

Activity

Table name to file name encoding extension, proposal #1

non-BMP Encoding without case folding

BMP characters with new case folding mappings

non-BMP characters with case folding.

Summary of the encoding components

Table name to file name encoding extension, proposal #2

BMP characters with new 14.0.0 casefolding

Non-BMP characters with case folding

Non-BMP characters without folding

Summary

Upgrade issues.

People

Dates

Git Integration