Details
-
New Feature
-
Status: Stalled (View Workflow)
-
Minor
-
Resolution: Unresolved
-
None
-
None
Description
Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix
... discussion on character_set_system and why it is utf8mb3...
|
....
|
bar Oct 13th, 2021 at 4:23 PM
|
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
|
5 replies
|
|
wlad 3 months ago
|
so, a surrogate pair won't do? like, @d801@dc37
|
|
bar 3 months ago
|
for characters that do not have lower/upper variants, it will do.
|
|
bar 3 months ago
|
It will actually do for characters that have lower/upper variants as well.
|
|
bar 3 months ago
|
Thanks for the good idea.
|
Table name to file name extensions overview
We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding
Various proposals go in separate comments below.
Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:
PlaneN Code Range Abbr Name
|
------ ------------ ---- --------------------------------------
|
0 0000-FFFF BMP Basic Multilingual Plane
|
1 10000-1FFFF SMP Supplementary Multilingual Plane
|
2 20000-2FFFF SIP Supplementary Ideographic Plane
|
3 30000-3FFFF TIP Tertiary Ideographic Plane
|
4-13 40000-DFFFF --- unassigned
|
14 E0000-EFFFF SSP Supplementary Special-purpose Plane
|
15-16 F0000-10FFFF SPUA-A/B Supplementary Private Use Area planes
|
It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.
Characters with unsafe casefolding
Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))
These characters can be extracted using the following script:
CREATE OR REPLACE VIEW v1 AS |
SELECT
|
seq,
|
char(seq using utf32) collate utf32_uca1400_ai_ci AS ch |
FROM seq_1_to_1114111; |
|
SELECT
|
ch,
|
hex(ch) AS cu, |
upper(ch) AS u, |
hex(upper(ch)) AS uc, |
upper(lower(ch)) u2, |
hex(upper(lower(ch))) AS u2c |
FROM v1 |
WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin; |
+------+----------+------+----------+------+----------+
|
| ch | cu | u | uc | u2 | u2c |
|
+------+----------+------+----------+------+----------+
|
| Ä° | 00000130 | Ä° | 00000130 | I | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE
|
| ϴ | 000003F4 | ϴ | 000003F4 | Θ | 00000398 | GREEK CAPITAL THETA SYMBOL
|
| ẞ | 00001E9E | ẞ | 00001E9E | ß | 000000DF | LATIN CAPITAL LETTER SHARP S
|
| Ω | 00002126 | Ω | 00002126 | Ω | 000003A9 | OHM SIGN
|
| K | 0000212A | K | 0000212A | K | 0000004B | KELVIN SIGN
|
| â„« | 0000212B | â„« | 0000212B | Ã… | 000000C5 | ANGSTROM SIGN
|
+------+----------+------+----------+------+----------+
|
Let's consider this pair as an example:
- UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
- UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA
There are two options how to encode these characters
- As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
- As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.
Attachments
Issue Links
- is blocked by
-
MDEV-30556 UPPER() returns an empty string for U+0251 in Unicode-5.2.0+ collations for utf8
- Closed
-
MDEV-30577 Case folding for uca1400 collations is not up to date
- Closed
-
MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations for utf8
- Closed
-
MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp()
- Closed
-
MDEV-31531 Remove my_casedn_str() and my_caseup_str()
- Closed
-
MDEV-31606 Refactor check_db_name() to get a const argument
- Closed
-
MDEV-31972 Change parameter of make_sp_name*() from LEX_CSTRING to Lex_ident_sys_st
- Closed
-
MDEV-31978 Turn ok_for_lower_case_names() to a method in Lex_ident_fs
- Closed
-
MDEV-32002 Remove my_casedn_str() in append_identifier() context
- Closed
-
MDEV-32019 Replace my_casedn_str(local_buffer) to CharBuffer::copy_casedn()
- Closed
-
MDEV-32081 Remove my_casedn_str() from get_canonical_filename()
- Closed
-
MDEV-35255 Change the collation in INFORMATION_SCHEMA to utf8mb4_general1400_as_ci
- In Progress
- relates to
-
MDEV-19123 Change default charset from latin1 to utf8mb4
- Closed
-
MDEV-25829 Change default Unicode collation to uca1400_ai_ci
- Closed
-
MDEV-32904 smiley emoji (F09F9883) valid in utf8 but not utf8mb4
- Closed
Table name to file name encoding extension, proposal #1
non-BMP Encoding without case folding
Let's encode non-BMP characters which do not have case folding as follows:
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
where:
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
The total sequence length is 6 characters.
This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.
Examples
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF = 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
BMP characters with new case folding mappings
The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).
Let's encode such characters with new casefolding as follows:
[@][0-9a-f][0-9a-f][g-v][0-9a-z]
where
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-v] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
The total encoded sequence length is 5 characters.
The encoded sequence represents the Unicode code point of the lower case variant of a character.
The third digit [g-v] determines the case:
This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.
If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.
The mapping between the third digit and it hex value:
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
So for example, the hex value of 7 corresponds to
For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
These characters will be encoded as:
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
Another example: Unicode-14.0.0 has the following new casefolding mapping:
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
These characters will be encoded as:
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
non-BMP characters with case folding.
As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
Let's encode Plane-1 letters with casefolding as follows:
[@][0-9a-f][0-9a-f][g-v][g-v]
where
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-v] - the third digit (16 values) - determines upper or lower case
[g-v] - the fourth digit (16 values)
The total encoded sequence length is 5 characters.
The third digit [g-v] determines the case:
This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.
If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.
Example. Unicode-14.0.0 has casefolding between the following characters:
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
They will be encoded as:
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
Summary of the encoding components
After adding the mentioned extensions, the encoding will consist of the components:
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
------------------------------------ ----------------------- ----------------
[@][0-9a-f][0-9a-f][g-v][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-v][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding