Vladislav Vaintroub created issue - 2022-01-13 13:21

Vladislav Vaintroub made changes - 2022-01-13 13:24

Field	Original Value	New Value
Description	Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3 Here is a relevant part of Slack discussion on why it is so, and on possible fix {noformat} bar Oct 13th, 2021 at 4:23 PM @wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time. 5 replies wlad 3 months ago so, a surrogate pair won't do? like, @d801@dc37 bar 3 months ago for characters that do not have lower/upper variants, it will do. bar 3 months ago It will actually do for characters that have lower/upper variants as well. bar 3 months ago Thanks for the good idea. {noformat}	Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3 Here is a relevant part of Slack discussion on why it is so, and on possible fix {noformat} ... discussion on character_set_system and why it is utf8mb3... .... bar Oct 13th, 2021 at 4:23 PM @wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time. 5 replies wlad 3 months ago so, a surrogate pair won't do? like, @d801@dc37 bar 3 months ago for characters that do not have lower/upper variants, it will do. bar 3 months ago It will actually do for characters that have lower/upper variants as well. bar 3 months ago Thanks for the good idea. {noformat}

Sergei Golubchik made changes - 2022-01-13 14:04

Link

This issue is part of ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Sergei Golubchik made changes - 2022-04-06 10:48

Fix Version/s		10.10 [ 27530 ]
Fix Version/s	10.9 [ 26905 ]

Sergei Golubchik made changes - 2022-06-15 13:15

Fix Version/s		10.11 [ 27614 ]
Fix Version/s	10.10 [ 27530 ]

Sergei Golubchik made changes - 2022-08-20 19:50

Link

This issue blocks ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Sergei Golubchik made changes - 2022-08-20 19:50

Link

This issue is part of ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Sergei Golubchik made changes - 2022-08-20 19:51

Priority

Major [ 3 ]

Critical [ 2 ]

Alexander Barkov made changes - 2022-09-07 10:28

Link

This issue blocks ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Alexander Barkov made changes - 2022-09-07 10:28

Link

This issue relates to ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Ralf Gebhardt made changes - 2022-09-20 13:14

Fix Version/s		10.12 [ 28320 ]
Fix Version/s	10.11 [ 27614 ]

Julien Fritsch made changes - 2022-11-15 16:59

Link

This issue relates to ~~MDEV-25829~~ [ ~~MDEV-25829~~ ]

Alexander Barkov made changes - 2023-02-02 13:45

Status

Open [ 1 ]

In Progress [ 3 ]

Alexander Barkov made changes - 2023-02-03 12:59

Link

This issue is blocked by ~~MDEV-30556~~ [ ~~MDEV-30556~~ ]

Alexander Barkov made changes - 2023-02-06 06:11

Link

This issue is blocked by ~~MDEV-30577~~ [ ~~MDEV-30577~~ ]

Sergei Golubchik made changes - 2023-02-06 13:06

Fix Version/s		11.1 [ 28549 ]
Fix Version/s	11.0 [ 28320 ]

Alexander Barkov made changes - 2023-02-16 07:36

Link

This issue is blocked by ~~MDEV-30661~~ [ ~~MDEV-30661~~ ]

Julien Fritsch made changes - 2023-05-04 09:05

Fix Version/s		11.2 [ 28603 ]
Fix Version/s	11.1 [ 28549 ]

Alexander Barkov made changes - 2023-05-25 08:13

Link

This issue is blocked by ~~MDEV-31340~~ [ ~~MDEV-31340~~ ]

Alexander Barkov made changes - 2023-06-23 09:21

Link

This issue relates to ~~MDEV-31531~~ [ ~~MDEV-31531~~ ]

Alexander Barkov made changes - 2023-06-23 09:23

Link

This issue is blocked by ~~MDEV-31531~~ [ ~~MDEV-31531~~ ]

Alexander Barkov made changes - 2023-06-23 09:23

Link

This issue relates to ~~MDEV-31531~~ [ ~~MDEV-31531~~ ]

Alexander Barkov made changes - 2023-07-03 12:23

Link

This issue is blocked by ~~MDEV-31606~~ [ ~~MDEV-31606~~ ]

Ralf Gebhardt made changes - 2023-07-25 19:46

Fix Version/s		11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]

Alexander Barkov made changes - 2023-08-21 06:13

Link

This issue is blocked by ~~MDEV-31972~~ [ ~~MDEV-31972~~ ]

Alexander Barkov made changes - 2023-08-22 08:26

Link

This issue is blocked by ~~MDEV-31978~~ [ ~~MDEV-31978~~ ]

Alexander Barkov made changes - 2023-08-24 10:07

Link

This issue is blocked by ~~MDEV-32002~~ [ ~~MDEV-32002~~ ]

Alexander Barkov made changes - 2023-08-26 10:37

Link

This issue is blocked by ~~MDEV-32019~~ [ ~~MDEV-32019~~ ]

Alexander Barkov made changes - 2023-09-04 05:01

Link

This issue is blocked by ~~MDEV-32081~~ [ ~~MDEV-32081~~ ]

Sergei Golubchik made changes - 2023-09-17 18:02

Fix Version/s		11.4 [ 29301 ]
Fix Version/s	11.3 [ 28565 ]

Alexander Barkov made changes - 2023-11-10 09:45

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Julien Fritsch made changes - 2023-11-30 16:30

Issue Type

Task [ 3 ]

New Feature [ 2 ]

Alexander Barkov made changes - 2023-12-02 10:35

Link

This issue relates to ~~MDEV-32904~~ [ ~~MDEV-32904~~ ]

Sergei Golubchik made changes - 2023-12-22 17:46

Fix Version/s		11.5 [ 29506 ]
Fix Version/s	11.4 [ 29301 ]

Sergei Golubchik made changes - 2024-03-19 18:33

Fix Version/s		11.6 [ 29515 ]
Fix Version/s	11.5 [ 29506 ]

Sergei Golubchik made changes - 2024-06-04 15:39

Fix Version/s		11.7 [ 29815 ]
Fix Version/s	11.6 [ 29515 ]

Alexander Barkov made changes - 2024-09-09 15:37

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding give 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding give 16*16*16*16=65536 values, which covers the entire
Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-09 15:39

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding give 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding give 16*16*16*16=65536 values, which covers the entire
Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding give 16*16*16*16=65536 values, which covers the entire
Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-09 15:41

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding give 16*16*16*16=65536 values, which covers the entire
Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 12:49

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 12:52

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition
to those existed at the time when the first version of the table name
to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 12:57

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case:
- If it is in the range [G-Z], then the character is in the upper case:

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 13:13

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 13:14

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h2. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h2. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h2. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 13:16

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared since the first version of
the encoding.
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 13:17

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed at the time when the first version of the table name to file name encoding was introduced.

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 13:20

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The code encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire
BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the BMP range, we'll be able to encode all such characters.

The mapping beween the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which
does not exist in the old file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Alexander Barkov made changes - 2024-09-10 13:28

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in Plane0
(BMP) and Plane1 (U+10000..U+1FFFF).

Let's encode Plane1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version with more casefolding
mapping in the Plane1 range, we'll be able to encode all such characters.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

Alexander Barkov made changes - 2024-09-10 15:41

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 16.0.0, casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

Alexander Barkov made changes - 2024-09-11 06:55

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ---------- ----------------
[@][0..9][g..z] 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 65536 BMP characters without case folding
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1048576 non-BMP characters without case folding
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

Alexander Barkov made changes - 2024-09-11 06:56

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h2. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ---------- ----------------
[@][0..9][g..z] 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 65536 BMP characters without case folding
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1048576 non-BMP characters without case folding
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ---------- ----------------
[@][0..9][g..z] 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 65536 BMP characters without case folding
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1048576 non-BMP characters without case folding
{noformat}

Alexander Barkov made changes - 2024-09-11 07:11

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ---------- ----------------
[@][0..9][g..z] 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 65536 BMP characters without case folding
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1048576 non-BMP characters without case folding
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
{noformat}

Alexander Barkov made changes - 2024-09-11 07:32

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
------------------------------------ ----------------------- ----------------
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
{noformat}

Alexander Barkov made changes - 2024-09-11 07:48

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-z][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-z][g-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-z] - the third digit (16 values) - determines upper or lower case
[g-z] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-z], then the character is in the lower case
- If it is in the range [G-Z], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
------------------------------------ ----------------------- ----------------
[@][0-9a-f][0-9a-f][g-z][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-z][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-v][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-v] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-v], then the character is in the lower case
- If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-v][g-v]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-v] - the third digit (16 values) - determines upper or lower case
[g-v] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-v], then the character is in the lower case
- If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
------------------------------------ ----------------------- ----------------
[@][0-9a-f][0-9a-f][g-v][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-v][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
{noformat}

Alexander Barkov added a comment - 2024-09-11 08:06 - edited

Table name to file name encoding extension, proposal #1

non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:

[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]

where:

 @        - the encoded character marker

 +        - the marker for non-BMP character without case folding

 [0-9a-v] - the first digit  (32 values)

 [0-9a-v] - the second digit (32 values)

 [0-9a-v] - the third digit  (32 values)

 [0-9a-v] - the fourth digit (32 values)

The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples

@+0000  - U+010000 = 0x10000 +   0*(32^3) +  0*(32^2) +  0*(32^1) +  0

@+1000  - U+018000 = 0x10000 +   1*(32^3) +  0*(32^2) +  0*(32^1) +  0

@+aaaa  - U+06294A = 0x10000 +  10*(32^3) + 10*(32^2) + 10*(32^1) + 10

@+vvvv  - U+10FFFF = 0x10000 +  31*(32^3) + 31*(32^2) + 31*(32^1) + 31

BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:

[@][0-9a-f][0-9a-f][g-v][0-9a-z]

where

 @        - the encoded character marker

 [0-9a-f] - the first digit  (16 values)

 [0-9a-f] - the second digit (16 values)

 [g-v]    - the third digit  (16 values) - determines upper or lower case

 [0-9a-f] - the fourth digit (16 values)

The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:

0123456789abcdef - the hex value

GHIJKLMNOPQRSTUV - the third digit, upper case

ghijklmnopqrstuv - the third digit, lower case

So for example, the hex value of 7 corresponds to

the digit 'N' in case of a upper-case character
the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

U+0180 LATIN SMALL LETTER B WITH STROKE

U+0243 LATIN CAPITAL LETTER B WITH STROKE

These characters will be encoded as:

@01o0 - the code point U+0180

@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180

Another example: Unicode-14.0.0 has the following new casefolding mapping:

U+0500 CYRILLIC CAPITAL LETTER KOMI DE

U+0501 CYRILLIC SMALL LETTER KOMI DE

These characters will be encoded as:

@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501

@05g1 - the code poing U+0501

non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:

Plane-0 (BMP)
Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

[@][0-9a-f][0-9a-f][g-v][g-v]

where

 @        - the encoded character marker

 [0-9a-f] - the first digit  (16 values)

 [0-9a-f] - the second digit (16 values)

 [g-v]    - the third digit  (16 values) - determines upper or lower case

 [g-v]    - the fourth digit (16 values)

The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:

U+10400 DESERET CAPITAL LETTER LONG I

U+10428 DESERET SMALL LETTER LONG I

They will be encoded as:

@04I8 - the code point U+10400, or literally UPPER(code point U+10428)

@04i8 - the code point U+10428

Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

Pattern                                 CodePoints               Comment

------------------------------------    -----------------------  ----------------

[@][0..9][g..z]                         10*20         = 200      BMP characters with 3.0.0 case folding

[@][g..z][0..9]                         20*10         = 200      BMP characters with 3.0.0 case folding

[@][g..z][a..z]                         20*26         = 520      BMP characters with 3.0.0 case folding

[@][@][a..z]                            1*26          = 26       BMP characters with 3.0.0 case folding

[@][a..z][@]                            1*26          = 26       BMP characters with 3.0.0 case folding

[@][a..f][g..z]                         16*20         = 320      Unused

[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f]     16*16*16*16   = 65536    BMP characters without case folding

------------------------------------    -----------------------  ----------------

[@][0-9a-f][0-9a-f][g-v][0-9a-f]        16*16*16*16   = 65536    BMP characters with 14.0.0 case folding

[@][0-9a-f][0-9a-f][g-v][g-z]           16*16*16*16   = 65536    non-BMP characters with case folding (Plane 1 only)

[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]  1*32*32*32*32 = 1048576  non-BMP characters without case folding

Alexander Barkov added a comment - 2024-09-11 08:06 - edited Table name to file name encoding extension, proposal #1 non-BMP Encoding without case folding Let's encode non-BMP characters which do not have case folding as follows: [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] where: @ - the encoded character marker + - the marker for non-BMP character without case folding [0-9a-v] - the first digit (32 values) [0-9a-v] - the second digit (32 values) [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) The total sequence length is 6 characters. This encoding gives total 32*32*32*32 = 1048576 values It covers exactly all non-BMP characters U+010000 to U+10FFFF. Examples @+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0 @+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0 @+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10 @+vvvv - U+10FFFF = 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31 BMP characters with new case folding mappings The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding). Let's encode such characters with new casefolding as follows: [@][0-9a-f][0-9a-f][g-v][0-9a-z] where @ - the encoded character marker [0-9a-f] - the first digit (16 values) [0-9a-f] - the second digit (16 values) [g-v] - the third digit (16 values) - determines upper or lower case [0-9a-f] - the fourth digit (16 values) The total encoded sequence length is 5 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The third digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF. If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters. The mapping between the third digit and it hex value: 0123456789abcdef - the hex value GHIJKLMNOPQRSTUV - the third digit, upper case ghijklmnopqrstuv - the third digit, lower case So for example, the hex value of 7 corresponds to the digit 'N' in case of a upper-case character the digit 'n' in case of a lower-case character. For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding. U+0180 LATIN SMALL LETTER B WITH STROKE U+0243 LATIN CAPITAL LETTER B WITH STROKE These characters will be encoded as: @01o0 - the code point U+0180 @01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180 Another example: Unicode-14.0.0 has the following new casefolding mapping: U+0500 CYRILLIC CAPITAL LETTER KOMI DE U+0501 CYRILLIC SMALL LETTER KOMI DE These characters will be encoded as: @05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501 @05g1 - the code poing U+0501 non-BMP characters with case folding. As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in: Plane-0 (BMP) Plane1 (U+10000..U+1FFFF). Let's encode Plane-1 letters with casefolding as follows: [@][0-9a-f][0-9a-f][g-v][g-v] where @ - the encoded character marker [0-9a-f] - the first digit (16 values) [0-9a-f] - the second digit (16 values) [g-v] - the third digit (16 values) - determines upper or lower case [g-v] - the fourth digit (16 values) The total encoded sequence length is 5 characters. The third digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF. If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters. Example. Unicode-14.0.0 has casefolding between the following characters: U+10400 DESERET CAPITAL LETTER LONG I U+10428 DESERET SMALL LETTER LONG I They will be encoded as: @04I8 - the code point U+10400, or literally UPPER(code point U+10428) @04i8 - the code point U+10428 Summary of the encoding components After adding the mentioned extensions, the encoding will consist of the components: Pattern CodePoints Comment ------------------------------------ ----------------------- ---------------- [@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding [@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding [@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding [@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..f][g..z] 16*20 = 320 Unused [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding ------------------------------------ ----------------------- ---------------- [@][0-9a-f][0-9a-f][g-v][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding [@][0-9a-f][0-9a-f][g-v][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only) [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding

Alexander Barkov made changes - 2024-09-11 08:07

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name encoding extension

We need to extend the encoding to support:
- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

h1. non-BMP Encoding without case folding

Let's encode non-BMP characters which do not have case folding as follows:
{noformat}
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
{noformat}
where:
{noformat}
@ - the encoded character marker
+ - the marker for non-BMP character without case folding
[0-9a-v] - the first digit (32 values)
[0-9a-v] - the second digit (32 values)
[0-9a-v] - the third digit (32 values)
[0-9a-v] - the fourth digit (32 values)
{noformat}
The total sequence length is 6 characters.

This encoding gives total 32*32*32*32 = 1048576 values
It covers exactly all non-BMP characters U+010000 to U+10FFFF.

Examples
{noformat}
@+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0
@+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10
@+vvvv - U+10FFFF 0 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31
{noformat}

h1. BMP characters with new case folding mappings

The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

Let's encode such characters with new casefolding as follows:
{noformat}
[@][0-9a-f][0-9a-f][g-v][0-9a-z]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-v] - the third digit (16 values) - determines upper or lower case
[0-9a-f] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The third digit [g-v] determines the case:
- If it is in the range [g-v], then the character is in the lower case
- If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

The mapping between the third digit and it hex value:
{noformat}
0123456789abcdef - the hex value
GHIJKLMNOPQRSTUV - the third digit, upper case
ghijklmnopqrstuv - the third digit, lower case
{noformat}
So for example, the hex value of 7 corresponds to
- the digit 'N' in case of a upper-case character
- the digit 'n' in case of a lower-case character.

For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

{noformat}
U+0180 LATIN SMALL LETTER B WITH STROKE
U+0243 LATIN CAPITAL LETTER B WITH STROKE
{noformat}

These characters will be encoded as:
{noformat}
@01o0 - the code point U+0180
@01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
{noformat}

Another example: Unicode-14.0.0 has the following new casefolding mapping:
{noformat}
U+0500 CYRILLIC CAPITAL LETTER KOMI DE
U+0501 CYRILLIC SMALL LETTER KOMI DE
{noformat}

These characters will be encoded as:
{noformat}
@05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
@05g1 - the code poing U+0501
{noformat}

h1. non-BMP characters with case folding.

As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:
- Plane-0 (BMP)
- Plane1 (U+10000..U+1FFFF).

Let's encode Plane-1 letters with casefolding as follows:

{noformat}
[@][0-9a-f][0-9a-f][g-v][g-v]
{noformat}
where
{noformat}
@ - the encoded character marker
[0-9a-f] - the first digit (16 values)
[0-9a-f] - the second digit (16 values)
[g-v] - the third digit (16 values) - determines upper or lower case
[g-v] - the fourth digit (16 values)
{noformat}
The total encoded sequence length is 5 characters.

The third digit [g-v] determines the case:
- If it is in the range [g-v], then the character is in the lower case
- If it is in the range [G-V], then the character is in the upper case

This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

Example. Unicode-14.0.0 has casefolding between the following characters:
{noformat}
U+10400 DESERET CAPITAL LETTER LONG I
U+10428 DESERET SMALL LETTER LONG I
{noformat}

They will be encoded as:
{noformat}
@04I8 - the code point U+10400, or literally UPPER(code point U+10428)
@04i8 - the code point U+10428
{noformat}

h1. Summary of the encoding components

After adding the mentioned extensions, the encoding will consist of the components:

{noformat}
Pattern CodePoints Comment
------------------------------------ ----------------------- ----------------
[@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding
[@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding
[@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding
[@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding
[@][a..f][g..z] 16*20 = 320 Unused
[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding
------------------------------------ ----------------------- ----------------
[@][0-9a-f][0-9a-f][g-v][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding
[@][0-9a-f][0-9a-f][g-v][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only)
[@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

Alexander Barkov added a comment - 2024-09-11 08:08 - edited

Table name to file name encoding extension, proposal #2

The old encoding has an unused range:

[@][a..f][g..z] 6*20= 120 combinations

The idea is to reuse this unused range for the new extensions.

BMP characters with new 14.0.0 casefolding

Let's encode characters with new casefolding as follows:

[@][a-b][g-v][0-9a-v][0-9a-v]

where

@         - the encoded character marker

[a-b]     - the first digit  (4 values)

[g-v]     - the second digit (16 values) - determines upper or lower case

[0-9a-v]  - the third digit  (32 values)

[0-9a-v]  - the fourth digit (32 values)

The total encoded sequence length is 5 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The second digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-V], then the character is in the upper case

This encoding gives 4*16*32*32=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

Examples:

@ah81   - U+501 = 0*16*32*32 + 1*32*32 + 8*32 + 1

@aH81   - U+500 = 0*16*32*32 + 1*32*32 + 8*32 + 1, or literally UPPER(U+501)

Non-BMP characters with case folding

Let's encode non-BMP characters with casefolding as follows:

[@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f]

where

@         - the encoded character marker

[c-f]     - the first digit  (4 values)

[g-v]     - the second digit (16 values) - determines upper or lower case

[0-9a-v]  - the third digit  (32 values)

[0-9a-v]  - the fourth digit (32 values)

[0-9a-f]  - the fivth digit  (16 values)

The total encoded sequence length is 6 characters.

The encoded sequence represents the Unicode code point of the lower case variant of a character.

The second digit [g-v] determines the case:

If it is in the range [g-v], then the character is in the lower case
If it is in the range [G-v], then the character is in the upper case

This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire
non-BMP range U+010000 to U+10FFFF.

Non-BMP characters without folding

Let's encode non-BMP characters without casefolding as follows:

[@][c-f][g-v][0-9a-v][0-9a-v][g-v]

where

@         - the encoded character marker

[c-f]     - the first digit  (4 values)

[g-v]     - the second digit (16 values)

[0-9a-v]  - the third digit  (32 values)

[0-9a-v]  - the fourth digit (32 values)

[g-v]     - the fifth digit  (16 values)

The total encoded sequence length is 6 characters.

This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire
non-BMP range U+010000 to U+10FFFF.

Examples

@cg00g   - U+010000 = 0x10000 + 0*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@dg00g   - U+050000 = 0x10000 + 1*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@eg00g   - U+090000 = 0x10000 + 2*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@fg00g   - U+0D0000 = 0x10000 + 3*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0

@fvvvv   - U+10FFFF = 0x10000 + 3*16*32*32*16 + 15*32*32*16 + 31*32*16 + 31*16 + 15

Summary

With this proposal the full summary of the encoding components will look as follows:

Pattern                                CodePoints               Comment

-------------------------------------  -----------------------  -------------------

[@][0..9][g..z]                        10*20         = 200      BMP characters with 3.0.0 case folding

[@][g..z][0..9]                        20*10         = 200      BMP characters with 3.0.0 case folding

[@][g..z][a..z]                        20*26         = 520      BMP characters with 3.0.0 case folding

[@][@][a..z]                           1*26          = 26       BMP characters with 3.0.0 case folding

[@][a..z][@]                           1*26          = 26       BMP characters with 3.0.0 case folding

[@][0-9a-f][0-9a-f][0-9a-f][0-9a-f]    16*16*16*16   = 65536    BMP characters without case folding

------------------------------------   -----------------------  ----------------

[@][a-b][g-v][0-9a-v][0-9a-v]          4*16*32*32    = 65536    BMP with new folding

[@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f]  4*16*32*32*16 = 1048576  non-BMP with folding

[@][c-f][g-v][0-9a-v][0-9a-v][g-v]     4*16*32*32*16 = 1048576  non-BMP without folding

The advantages of this proposal:

"non-BMP with folding" covers all non-BMP characters in the range U+010000..U+10FFFF.
Does not introduce new characters into the alphabet

Alexander Barkov added a comment - 2024-09-11 08:08 - edited Table name to file name encoding extension, proposal #2 The old encoding has an unused range: [@][a..f][g..z] 6*20= 120 combinations The idea is to reuse this unused range for the new extensions. BMP characters with new 14.0.0 casefolding Let's encode characters with new casefolding as follows: [@][a-b][g-v][0-9a-v][0-9a-v] where @ - the encoded character marker [a-b] - the first digit (4 values) [g-v] - the second digit (16 values) - determines upper or lower case [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) The total encoded sequence length is 5 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The second digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 4*16*32*32=65536 values, which covers the entire BMP range U+0000 to U+FFFF. Examples: @ah81 - U+501 = 0*16*32*32 + 1*32*32 + 8*32 + 1 @aH81 - U+500 = 0*16*32*32 + 1*32*32 + 8*32 + 1, or literally UPPER(U+501) Non-BMP characters with case folding Let's encode non-BMP characters with casefolding as follows: [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f] where @ - the encoded character marker [c-f] - the first digit (4 values) [g-v] - the second digit (16 values) - determines upper or lower case [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) [0-9a-f] - the fivth digit (16 values) The total encoded sequence length is 6 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The second digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-v] , then the character is in the upper case This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire non-BMP range U+010000 to U+10FFFF. Non-BMP characters without folding Let's encode non-BMP characters without casefolding as follows: [@][c-f][g-v][0-9a-v][0-9a-v][g-v] where @ - the encoded character marker [c-f] - the first digit (4 values) [g-v] - the second digit (16 values) [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) [g-v] - the fifth digit (16 values) The total encoded sequence length is 6 characters. This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire non-BMP range U+010000 to U+10FFFF. Examples @cg00g - U+010000 = 0x10000 + 0*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @dg00g - U+050000 = 0x10000 + 1*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @eg00g - U+090000 = 0x10000 + 2*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @fg00g - U+0D0000 = 0x10000 + 3*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @fvvvv - U+10FFFF = 0x10000 + 3*16*32*32*16 + 15*32*32*16 + 31*32*16 + 31*16 + 15 Summary With this proposal the full summary of the encoding components will look as follows: Pattern CodePoints Comment ------------------------------------- ----------------------- ------------------- [@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding [@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding [@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding [@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding ------------------------------------ ----------------------- ---------------- [@][a-b][g-v][0-9a-v][0-9a-v] 4*16*32*32 = 65536 BMP with new folding [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f] 4*16*32*32*16 = 1048576 non-BMP with folding [@][c-f][g-v][0-9a-v][0-9a-v][g-v] 4*16*32*32*16 = 1048576 non-BMP without folding The advantages of this proposal: "non-BMP with folding" covers all non-BMP characters in the range U+010000..U+10FFFF. Does not introduce new characters into the alphabet

Alexander Barkov made changes - 2024-09-11 08:14

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

Alexander Barkov made changes - 2024-09-13 10:00

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Name
------ ------------ --------------------------------------
0 0000-FFFF Basic Multilingual Plane
1 10000-1FFFF Supplementary Multilingual Plane
2 20000-2FFFF Supplementary Ideographic Plane
3 30000-3FFFF Tertiary Ideographic Plane
4-13 unassigned
14 E0000-EFFFF Supplementary Special-purpose Plane
15-16 F0000-10FFFF Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

Alexander Barkov made changes - 2024-09-13 10:06

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Name
------ ------------ --------------------------------------
0 0000-FFFF Basic Multilingual Plane
1 10000-1FFFF Supplementary Multilingual Plane
2 20000-2FFFF Supplementary Ideographic Plane
3 30000-3FFFF Tertiary Ideographic Plane
4-13 unassigned
14 E0000-EFFFF Supplementary Special-purpose Plane
15-16 F0000-10FFFF Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Abbr Name
------ ------------ ---- --------------------------------------
0 0000-FFFF BMP Basic Multilingual Plane
1 10000-1FFFF SMP Supplementary Multilingual Plane
2 20000-2FFFF SIP Supplementary Ideographic Plane
3 30000-3FFFF TIP Tertiary Ideographic Plane
4-13 40000-DFFFF --- unassigned
14 E0000-EFFFF SSP Supplementary Special-purpose Plane
15-16 F0000-10FFFF SPUA-A/B Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

Alexander Barkov made changes - 2024-09-16 10:16

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Abbr Name
------ ------------ ---- --------------------------------------
0 0000-FFFF BMP Basic Multilingual Plane
1 10000-1FFFF SMP Supplementary Multilingual Plane
2 20000-2FFFF SIP Supplementary Ideographic Plane
3 30000-3FFFF TIP Tertiary Ideographic Plane
4-13 40000-DFFFF --- unassigned
14 E0000-EFFFF SSP Supplementary Special-purpose Plane
15-16 F0000-10FFFF SPUA-A/B Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Abbr Name
------ ------------ ---- --------------------------------------
0 0000-FFFF BMP Basic Multilingual Plane
1 10000-1FFFF SMP Supplementary Multilingual Plane
2 20000-2FFFF SIP Supplementary Ideographic Plane
3 30000-3FFFF TIP Tertiary Ideographic Plane
4-13 40000-DFFFF --- unassigned
14 E0000-EFFFF SSP Supplementary Special-purpose Plane
15-16 F0000-10FFFF SPUA-A/B Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

h1. Characters with unsafe casefolding

Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))

These characters can be extracted using the following script:
{code:sql}
CREATE OR REPLACE VIEW v1 AS
SELECT
  seq,
  char(seq using utf32) collate utf32_uca1400_ai_ci AS ch
FROM seq_1_to_1114111;

SELECT
  ch,
  hex(ch) AS cu,
  upper(ch) AS u,
  hex(upper(ch)) AS uc,
  upper(lower(ch)) u2,
  hex(upper(lower(ch))) AS u2c
FROM v1
WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin;
{code}

{noformat}
+------+----------+------+----------+------+----------+
| ch | cu | u | uc | u2 | u2c |
+------+----------+------+----------+------+----------+
| İ | 00000130 | İ | 00000130 | I | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE
| ϴ | 000003F4 | ϴ | 000003F4 | Θ | 00000398 | GREEK CAPITAL THETA SYMBOL
| ẞ | 00001E9E | ẞ | 00001E9E | ß | 000000DF | LATIN CAPITAL LETTER SHARP S
| Ω | 00002126 | Ω | 00002126 | Ω | 000003A9 | OHM SIGN
| K | 0000212A | K | 0000212A | K | 0000004B | KELVIN SIGN
| Å | 0000212B | Å | 0000212B | Å | 000000C5 | ANGSTROM SIGN
+------+----------+------+----------+------+----------+
{noformat}

Let's consider this pair an example:
- UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
- UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA

There are two options how to encode these characters
- As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
- As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.

Alexander Barkov made changes - 2024-09-16 10:17

Description

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Abbr Name
------ ------------ ---- --------------------------------------
0 0000-FFFF BMP Basic Multilingual Plane
1 10000-1FFFF SMP Supplementary Multilingual Plane
2 20000-2FFFF SIP Supplementary Ideographic Plane
3 30000-3FFFF TIP Tertiary Ideographic Plane
4-13 40000-DFFFF --- unassigned
14 E0000-EFFFF SSP Supplementary Special-purpose Plane
15-16 F0000-10FFFF SPUA-A/B Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

h1. Characters with unsafe casefolding

Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))

These characters can be extracted using the following script:
{code:sql}
CREATE OR REPLACE VIEW v1 AS
SELECT
  seq,
  char(seq using utf32) collate utf32_uca1400_ai_ci AS ch
FROM seq_1_to_1114111;

SELECT
  ch,
  hex(ch) AS cu,
  upper(ch) AS u,
  hex(upper(ch)) AS uc,
  upper(lower(ch)) u2,
  hex(upper(lower(ch))) AS u2c
FROM v1
WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin;
{code}

{noformat}
+------+----------+------+----------+------+----------+
| ch | cu | u | uc | u2 | u2c |
+------+----------+------+----------+------+----------+
| İ | 00000130 | İ | 00000130 | I | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE
| ϴ | 000003F4 | ϴ | 000003F4 | Θ | 00000398 | GREEK CAPITAL THETA SYMBOL
| ẞ | 00001E9E | ẞ | 00001E9E | ß | 000000DF | LATIN CAPITAL LETTER SHARP S
| Ω | 00002126 | Ω | 00002126 | Ω | 000003A9 | OHM SIGN
| K | 0000212A | K | 0000212A | K | 0000004B | KELVIN SIGN
| Å | 0000212B | Å | 0000212B | Å | 000000C5 | ANGSTROM SIGN
+------+----------+------+----------+------+----------+
{noformat}

Let's consider this pair an example:
- UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
- UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA

There are two options how to encode these characters
- As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
- As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.

Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
Here is a relevant part of Slack discussion on why it is so, and on possible fix

{noformat}
... discussion on character_set_system and why it is utf8mb3...
....
bar Oct 13th, 2021 at 4:23 PM
@wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
5 replies

wlad 3 months ago
so, a surrogate pair won't do? like, @d801@dc37

bar 3 months ago
for characters that do not have lower/upper variants, it will do.

bar 3 months ago
It will actually do for characters that have lower/upper variants as well.

bar 3 months ago
Thanks for the good idea.
{noformat}

h1. Table name to file name extensions overview

We need to extend the encoding to support:

- new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
- non-BMP characters in the range U+010000 to U+10FFFF without case folding
- non-BMP characters in the range U+010000 to U+10FFFF with case folding

Various proposals go in separate comments below.

h1. Unicode planes allowed in identifiers
As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

{noformat}
PlaneN Code Range Abbr Name
------ ------------ ---- --------------------------------------
0 0000-FFFF BMP Basic Multilingual Plane
1 10000-1FFFF SMP Supplementary Multilingual Plane
2 20000-2FFFF SIP Supplementary Ideographic Plane
3 30000-3FFFF TIP Tertiary Ideographic Plane
4-13 40000-DFFFF --- unassigned
14 E0000-EFFFF SSP Supplementary Special-purpose Plane
15-16 F0000-10FFFF SPUA-A/B Supplementary Private Use Area planes
{noformat}

It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

h1. Characters with unsafe casefolding

Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))

These characters can be extracted using the following script:
{code:sql}
CREATE OR REPLACE VIEW v1 AS
SELECT
  seq,
  char(seq using utf32) collate utf32_uca1400_ai_ci AS ch
FROM seq_1_to_1114111;

SELECT
  ch,
  hex(ch) AS cu,
  upper(ch) AS u,
  hex(upper(ch)) AS uc,
  upper(lower(ch)) u2,
  hex(upper(lower(ch))) AS u2c
FROM v1
WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin;
{code}

{noformat}
+------+----------+------+----------+------+----------+
| ch | cu | u | uc | u2 | u2c |
+------+----------+------+----------+------+----------+
| İ | 00000130 | İ | 00000130 | I | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE
| ϴ | 000003F4 | ϴ | 000003F4 | Θ | 00000398 | GREEK CAPITAL THETA SYMBOL
| ẞ | 00001E9E | ẞ | 00001E9E | ß | 000000DF | LATIN CAPITAL LETTER SHARP S
| Ω | 00002126 | Ω | 00002126 | Ω | 000003A9 | OHM SIGN
| K | 0000212A | K | 0000212A | K | 0000004B | KELVIN SIGN
| Å | 0000212B | Å | 0000212B | Å | 000000C5 | ANGSTROM SIGN
+------+----------+------+----------+------+----------+
{noformat}

Let's consider this pair as an example:
- UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
- UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA

There are two options how to encode these characters
- As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
- As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.

Sergei Golubchik made changes - 2024-09-24 13:53

Fix Version/s		11.8 [ 29921 ]
Fix Version/s	11.7 [ 29815 ]

Alexander Barkov added a comment - 2024-10-25 09:44

Upgrade issues.

Suppose two BMP characters U+AAAA and U+BBBB:

where not case variants of the same character in the old encoding
but become case variants of the same characters in Unicode-14.0.0

then mariadb-upgrade should not touch tables with such characters and display them as '#mdb1107#....', so the user can rename them manually.

Alexander Barkov added a comment - 2024-10-25 09:44 Upgrade issues. Suppose two BMP characters U+AAAA and U+BBBB: where not case variants of the same character in the old encoding but become case variants of the same characters in Unicode-14.0.0 then mariadb-upgrade should not touch tables with such characters and display them as '#mdb1107#....', so the user can rename them manually.

Alexander Barkov made changes - 2024-10-25 10:12

Link

This issue is blocked by MDEV-35255 [ MDEV-35255 ]

Ralf Gebhardt made changes - 2024-11-19 18:54

Priority

Critical [ 2 ]

Minor [ 4 ]

Ralf Gebhardt made changes - 2024-11-19 18:54

Fix Version/s

11.8 [ 29921 ]

Ralf Gebhardt added a comment - 2024-11-19 18:59

bar, in today's team lead call we decided to remove this change from the roadmap.

Being able to use utf8mb4 in identifiers is a low use case and value compared to the possible drawbacks.

Ralf Gebhardt added a comment - 2024-11-19 18:59 bar , in today's team lead call we decided to remove this change from the roadmap. Being able to use utf8mb4 in identifiers is a low use case and value compared to the possible drawbacks.

MariaDB Server

Allow full utf8mb4 for identifiers

Details

Description

Table name to file name extensions overview

Unicode planes allowed in identifiers

Characters with unsafe casefolding

Attachments

Attachments

Issue Links

Activity

Table name to file name encoding extension, proposal #1

non-BMP Encoding without case folding

BMP characters with new case folding mappings

non-BMP characters with case folding.

Summary of the encoding components

Table name to file name encoding extension, proposal #2

BMP characters with new 14.0.0 casefolding

Non-BMP characters with case folding

Non-BMP characters without folding

Summary

Upgrade issues.

People

Dates

Git Integration