Details

    • New Feature
    • Status: Stalled (View Workflow)
    • Minor
    • Resolution: Unresolved
    • None
    • Character Sets
    • None

    Description

      Identifier names can't contain characters outside of BMP, i.e they are restricted to utf8mb3
      Here is a relevant part of Slack discussion on why it is so, and on possible fix

      ... discussion on character_set_system  and why it is utf8mb3...
      ....
      bar Oct 13th, 2021 at 4:23 PM
      @wlad yes, it's hard-coded. I think the biggest problem is to implement table-name-to-file-name encoding for non-BMP characters. Should be doable but needs some time.
      5 replies
       
      wlad  3 months ago
      so, a surrogate pair won't do? like, @d801@dc37
       
      bar  3 months ago
      for characters that do not have lower/upper variants, it will do.
       
      bar  3 months ago
      It will actually do for characters that have lower/upper variants as well.
       
      bar  3 months ago
      Thanks for the good idea.
      

      Table name to file name extensions overview

      We need to extend the encoding to support:

      • new case folding in the BMP range appeared between Unicode-3.0.0 (used in the first version of the encoding) and Unicode-14.0.0 (the current version in MariaDB).
      • non-BMP characters in the range U+010000 to U+10FFFF without case folding
      • non-BMP characters in the range U+010000 to U+10FFFF with case folding

      Various proposals go in separate comments below.

      Unicode planes allowed in identifiers

      As of version 14.0.0 (and 16.0.0) Unicode plane assignment looks as follow:

      PlaneN  Code Range    Abbr      Name
      ------  ------------  ----      --------------------------------------
      0       0000-FFFF     BMP       Basic Multilingual Plane
      1       10000-1FFFF   SMP       Supplementary Multilingual Plane
      2       20000-2FFFF   SIP       Supplementary Ideographic Plane
      3       30000-3FFFF   TIP       Tertiary Ideographic Plane
      4-13    40000-DFFFF   ---       unassigned
      14      E0000-EFFFF   SSP       Supplementary Special-purpose Plane
      15-16   F0000-10FFFF  SPUA-A/B  Supplementary Private Use Area planes
      

      It is an open question whether we should support unassigned planes in identifiers (and in table file name encoding), or should limit to assigned planes only.

      Characters with unsafe casefolding

      Since the version 3.0.0, Unicode added casefolding rules for a few characters which is not round trip safe: UPPER(ch)<>UPPER(LOWER(ch))

      These characters can be extracted using the following script:

      CREATE OR REPLACE VIEW v1 AS
      SELECT
        seq,
        char(seq using utf32) collate utf32_uca1400_ai_ci AS ch
      FROM seq_1_to_1114111;
       
      SELECT
        ch,
        hex(ch) AS cu,
        upper(ch) AS u,
        hex(upper(ch)) AS uc,
        upper(lower(ch)) u2,
        hex(upper(lower(ch))) AS u2c
      FROM v1
      WHERE upper(ch) collate utf32_bin<>upper(lower(ch)) collate utf32_bin;
      

      +------+----------+------+----------+------+----------+
      | ch   | cu       | u    | uc       | u2   | u2c      |
      +------+----------+------+----------+------+----------+
      | Ä°    | 00000130 | Ä°    | 00000130 | I    | 00000049 | LATIN CAPITAL LETTER I WITH DOT ABOVE
      | ϴ    | 000003F4 | ϴ    | 000003F4 | Θ    | 00000398 | GREEK CAPITAL THETA SYMBOL
      | ẞ    | 00001E9E | ẞ    | 00001E9E | ß    | 000000DF | LATIN CAPITAL LETTER SHARP S
      | Ω    | 00002126 | Ω    | 00002126 | Ω    | 000003A9 | OHM SIGN
      | K    | 0000212A | K    | 0000212A | K    | 0000004B | KELVIN SIGN
      | â„«    | 0000212B | â„«    | 0000212B | Ã…    | 000000C5 | ANGSTROM SIGN
      +------+----------+------+----------+------+----------+
      

      Let's consider this pair as an example:

      • UPPER(U+2126 OHM SIGN) = U+2126 OHM SIGN
      • UPPER(LOWER(U+2126 OHM SIGN)) = U+03A9 GREEK CAPITAL LETTER OMEGA

      There are two options how to encode these characters

      • As not having case folding. It will preserve the exact character OHM SIGN. But OHM SIGN and GREEK SMALL LETTER OMEGA will be two distinct characters even on a case insensitive file system.
      • As having case folding. In this case OHM SIGN will be replaced GREEK CAPITAL LETTER OMEGA. It will equal to GREEK SMALL LETTER OMEGA on a case insensitive file system.

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov added a comment - - edited

            Table name to file name encoding extension, proposal #1

            non-BMP Encoding without case folding

            Let's encode non-BMP characters which do not have case folding as follows:

            [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]
            

            where:

             @        - the encoded character marker
             +        - the marker for non-BMP character without case folding
             [0-9a-v] - the first digit  (32 values)
             [0-9a-v] - the second digit (32 values)
             [0-9a-v] - the third digit  (32 values)
             [0-9a-v] - the fourth digit (32 values)
            

            The total sequence length is 6 characters.

            This encoding gives total 32*32*32*32 = 1048576 values
            It covers exactly all non-BMP characters U+010000 to U+10FFFF.

            Examples

            @+0000  - U+010000 = 0x10000 +   0*(32^3) +  0*(32^2) +  0*(32^1) +  0
            @+1000  - U+018000 = 0x10000 +   1*(32^3) +  0*(32^2) +  0*(32^1) +  0
            @+aaaa  - U+06294A = 0x10000 +  10*(32^3) + 10*(32^2) + 10*(32^1) + 10
            @+vvvv  - U+10FFFF = 0x10000 +  31*(32^3) + 31*(32^2) + 31*(32^1) + 31
            

            BMP characters with new case folding mappings

            The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding).

            Let's encode such characters with new casefolding as follows:

            [@][0-9a-f][0-9a-f][g-v][0-9a-z]
            

            where

             @        - the encoded character marker
             [0-9a-f] - the first digit  (16 values)
             [0-9a-f] - the second digit (16 values)
             [g-v]    - the third digit  (16 values) - determines upper or lower case
             [0-9a-f] - the fourth digit (16 values)
            

            The total encoded sequence length is 5 characters.

            The encoded sequence represents the Unicode code point of the lower case variant of a character.

            The third digit [g-v] determines the case:

            • If it is in the range [g-v], then the character is in the lower case
            • If it is in the range [G-V], then the character is in the upper case

            This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

            If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters.

            The mapping between the third digit and it hex value:

            0123456789abcdef - the hex value
            GHIJKLMNOPQRSTUV - the third digit, upper case
            ghijklmnopqrstuv - the third digit, lower case
            

            So for example, the hex value of 7 corresponds to

            • the digit 'N' in case of a upper-case character
            • the digit 'n' in case of a lower-case character.

            For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding.

            U+0180 LATIN SMALL LETTER B WITH STROKE
            U+0243 LATIN CAPITAL LETTER B WITH STROKE
            

            These characters will be encoded as:

            @01o0 - the code point U+0180
            @01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180
            

            Another example: Unicode-14.0.0 has the following new casefolding mapping:

            U+0500 CYRILLIC CAPITAL LETTER KOMI DE
            U+0501 CYRILLIC SMALL LETTER KOMI DE
            

            These characters will be encoded as:

            @05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501
            @05g1 - the code poing U+0501
            

            non-BMP characters with case folding.

            As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in:

            • Plane-0 (BMP)
            • Plane1 (U+10000..U+1FFFF).

            Let's encode Plane-1 letters with casefolding as follows:

            [@][0-9a-f][0-9a-f][g-v][g-v]
            

            where

             @        - the encoded character marker
             [0-9a-f] - the first digit  (16 values)
             [0-9a-f] - the second digit (16 values)
             [g-v]    - the third digit  (16 values) - determines upper or lower case
             [g-v]    - the fourth digit (16 values)
            

            The total encoded sequence length is 5 characters.

            The third digit [g-v] determines the case:

            • If it is in the range [g-v], then the character is in the lower case
            • If it is in the range [G-V], then the character is in the upper case

            This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF.

            If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters.

            Example. Unicode-14.0.0 has casefolding between the following characters:

            U+10400 DESERET CAPITAL LETTER LONG I
            U+10428 DESERET SMALL LETTER LONG I
            

            They will be encoded as:

            @04I8 - the code point U+10400, or literally UPPER(code point U+10428)
            @04i8 - the code point U+10428
            

            Summary of the encoding components

            After adding the mentioned extensions, the encoding will consist of the components:

            Pattern                                 CodePoints               Comment
            ------------------------------------    -----------------------  ----------------
            [@][0..9][g..z]                         10*20         = 200      BMP characters with 3.0.0 case folding
            [@][g..z][0..9]                         20*10         = 200      BMP characters with 3.0.0 case folding
            [@][g..z][a..z]                         20*26         = 520      BMP characters with 3.0.0 case folding
            [@][@][a..z]                            1*26          = 26       BMP characters with 3.0.0 case folding
            [@][a..z][@]                            1*26          = 26       BMP characters with 3.0.0 case folding
            [@][a..f][g..z]                         16*20         = 320      Unused
            [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f]     16*16*16*16   = 65536    BMP characters without case folding
            ------------------------------------    -----------------------  ----------------
            [@][0-9a-f][0-9a-f][g-v][0-9a-f]        16*16*16*16   = 65536    BMP characters with 14.0.0 case folding
            [@][0-9a-f][0-9a-f][g-v][g-z]           16*16*16*16   = 65536    non-BMP characters with case folding (Plane 1 only)
            [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v]  1*32*32*32*32 = 1048576  non-BMP characters without case folding
            

            bar Alexander Barkov added a comment - - edited Table name to file name encoding extension, proposal #1 non-BMP Encoding without case folding Let's encode non-BMP characters which do not have case folding as follows: [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] where: @ - the encoded character marker + - the marker for non-BMP character without case folding [0-9a-v] - the first digit (32 values) [0-9a-v] - the second digit (32 values) [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) The total sequence length is 6 characters. This encoding gives total 32*32*32*32 = 1048576 values It covers exactly all non-BMP characters U+010000 to U+10FFFF. Examples @+0000 - U+010000 = 0x10000 + 0*(32^3) + 0*(32^2) + 0*(32^1) + 0 @+1000 - U+018000 = 0x10000 + 1*(32^3) + 0*(32^2) + 0*(32^1) + 0 @+aaaa - U+06294A = 0x10000 + 10*(32^3) + 10*(32^2) + 10*(32^1) + 10 @+vvvv - U+10FFFF = 0x10000 + 31*(32^3) + 31*(32^2) + 31*(32^1) + 31 BMP characters with new case folding mappings The Unicode version 14.0.0 has more casefolding mappings in addition to those existed Unicode-3.0.0 (used in the original version of the file name encoding). Let's encode such characters with new casefolding as follows: [@][0-9a-f][0-9a-f][g-v][0-9a-z] where @ - the encoded character marker [0-9a-f] - the first digit (16 values) [0-9a-f] - the second digit (16 values) [g-v] - the third digit (16 values) - determines upper or lower case [0-9a-f] - the fourth digit (16 values) The total encoded sequence length is 5 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The third digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 16*16*16*16=65536 values, which covers the entire BMP range U+0000 to U+FFFF. If in the future we switch to a new Unicode version with more casefolding mapping in the BMP range, we'll be able to encode all such characters. The mapping between the third digit and it hex value: 0123456789abcdef - the hex value GHIJKLMNOPQRSTUV - the third digit, upper case ghijklmnopqrstuv - the third digit, lower case So for example, the hex value of 7 corresponds to the digit 'N' in case of a upper-case character the digit 'n' in case of a lower-case character. For example, Unicode-14.0.0 has the following new casefolding mapping which does not exist in the original file name encoding. U+0180 LATIN SMALL LETTER B WITH STROKE U+0243 LATIN CAPITAL LETTER B WITH STROKE These characters will be encoded as: @01o0 - the code point U+0180 @01O0 - the code point U+0243, or literally, UPPER case of the code point U+0180 Another example: Unicode-14.0.0 has the following new casefolding mapping: U+0500 CYRILLIC CAPITAL LETTER KOMI DE U+0501 CYRILLIC SMALL LETTER KOMI DE These characters will be encoded as: @05G1 - the code point U+0500, or literally, UPPER case of the code point U+0501 @05g1 - the code poing U+0501 non-BMP characters with case folding. As of Unicode version 14.0.0 (and even in 16.0.0), casefolding data presents only in: Plane-0 (BMP) Plane1 (U+10000..U+1FFFF). Let's encode Plane-1 letters with casefolding as follows: [@][0-9a-f][0-9a-f][g-v][g-v] where @ - the encoded character marker [0-9a-f] - the first digit (16 values) [0-9a-f] - the second digit (16 values) [g-v] - the third digit (16 values) - determines upper or lower case [g-v] - the fourth digit (16 values) The total encoded sequence length is 5 characters. The third digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 16*16*16*16=65536 values, which covers the entire Plane1 range U+10000 to U+1FFFF. If in the future we switch to a new Unicode version (from 14.0.0) with more casefolding mapping in the Plane-1 range, we'll be able to encode all such characters. Example. Unicode-14.0.0 has casefolding between the following characters: U+10400 DESERET CAPITAL LETTER LONG I U+10428 DESERET SMALL LETTER LONG I They will be encoded as: @04I8 - the code point U+10400, or literally UPPER(code point U+10428) @04i8 - the code point U+10428 Summary of the encoding components After adding the mentioned extensions, the encoding will consist of the components: Pattern CodePoints Comment ------------------------------------ ----------------------- ---------------- [@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding [@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding [@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding [@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..f][g..z] 16*20 = 320 Unused [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding ------------------------------------ ----------------------- ---------------- [@][0-9a-f][0-9a-f][g-v][0-9a-f] 16*16*16*16 = 65536 BMP characters with 14.0.0 case folding [@][0-9a-f][0-9a-f][g-v][g-z] 16*16*16*16 = 65536 non-BMP characters with case folding (Plane 1 only) [@][+][0-9a-v][0-9a-v][0-9a-v][0-9a-v] 1*32*32*32*32 = 1048576 non-BMP characters without case folding
            bar Alexander Barkov added a comment - - edited

            Table name to file name encoding extension, proposal #2

            The old encoding has an unused range:

            [@][a..f][g..z] 6*20= 120 combinations
            

            The idea is to reuse this unused range for the new extensions.

            BMP characters with new 14.0.0 casefolding

            Let's encode characters with new casefolding as follows:

            [@][a-b][g-v][0-9a-v][0-9a-v]
            

            where

            @         - the encoded character marker
            [a-b]     - the first digit  (4 values)
            [g-v]     - the second digit (16 values) - determines upper or lower case 
            [0-9a-v]  - the third digit  (32 values)
            [0-9a-v]  - the fourth digit (32 values)
            

            The total encoded sequence length is 5 characters.

            The encoded sequence represents the Unicode code point of the lower case variant of a character.

            The second digit [g-v] determines the case:

            • If it is in the range [g-v], then the character is in the lower case
            • If it is in the range [G-V], then the character is in the upper case

            This encoding gives 4*16*32*32=65536 values, which covers the entire BMP range U+0000 to U+FFFF.

            Examples:

            @ah81   - U+501 = 0*16*32*32 + 1*32*32 + 8*32 + 1
            @aH81   - U+500 = 0*16*32*32 + 1*32*32 + 8*32 + 1, or literally UPPER(U+501)
            

            Non-BMP characters with case folding

            Let's encode non-BMP characters with casefolding as follows:

            [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f]
            

            where

            @         - the encoded character marker
            [c-f]     - the first digit  (4 values)
            [g-v]     - the second digit (16 values) - determines upper or lower case 
            [0-9a-v]  - the third digit  (32 values)
            [0-9a-v]  - the fourth digit (32 values)
            [0-9a-f]  - the fivth digit  (16 values)
            

            The total encoded sequence length is 6 characters.

            The encoded sequence represents the Unicode code point of the lower case variant of a character.

            The second digit [g-v] determines the case:

            • If it is in the range [g-v], then the character is in the lower case
            • If it is in the range [G-v], then the character is in the upper case

            This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire
            non-BMP range U+010000 to U+10FFFF.

            Non-BMP characters without folding

            Let's encode non-BMP characters without casefolding as follows:

            [@][c-f][g-v][0-9a-v][0-9a-v][g-v]
            

            where

            @         - the encoded character marker
            [c-f]     - the first digit  (4 values)
            [g-v]     - the second digit (16 values)
            [0-9a-v]  - the third digit  (32 values)
            [0-9a-v]  - the fourth digit (32 values)
            [g-v]     - the fifth digit  (16 values)
            

            The total encoded sequence length is 6 characters.

            This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire
            non-BMP range U+010000 to U+10FFFF.

            Examples

            @cg00g   - U+010000 = 0x10000 + 0*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0
            @dg00g   - U+050000 = 0x10000 + 1*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0
            @eg00g   - U+090000 = 0x10000 + 2*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0
            @fg00g   - U+0D0000 = 0x10000 + 3*16*32*32*16 +  0*32*32*16 +  0*32*16 +  0*16 +  0
            @fvvvv   - U+10FFFF = 0x10000 + 3*16*32*32*16 + 15*32*32*16 + 31*32*16 + 31*16 + 15
            

            Summary

            With this proposal the full summary of the encoding components will look as follows:

            Pattern                                CodePoints               Comment
            -------------------------------------  -----------------------  -------------------
            [@][0..9][g..z]                        10*20         = 200      BMP characters with 3.0.0 case folding
            [@][g..z][0..9]                        20*10         = 200      BMP characters with 3.0.0 case folding
            [@][g..z][a..z]                        20*26         = 520      BMP characters with 3.0.0 case folding
            [@][@][a..z]                           1*26          = 26       BMP characters with 3.0.0 case folding
            [@][a..z][@]                           1*26          = 26       BMP characters with 3.0.0 case folding
            [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f]    16*16*16*16   = 65536    BMP characters without case folding
            ------------------------------------   -----------------------  ----------------
            [@][a-b][g-v][0-9a-v][0-9a-v]          4*16*32*32    = 65536    BMP with new folding
            [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f]  4*16*32*32*16 = 1048576  non-BMP with folding
            [@][c-f][g-v][0-9a-v][0-9a-v][g-v]     4*16*32*32*16 = 1048576  non-BMP without folding
            

            The advantages of this proposal:

            • "non-BMP with folding" covers all non-BMP characters in the range U+010000..U+10FFFF.
            • Does not introduce new characters into the alphabet
            bar Alexander Barkov added a comment - - edited Table name to file name encoding extension, proposal #2 The old encoding has an unused range: [@][a..f][g..z] 6*20= 120 combinations The idea is to reuse this unused range for the new extensions. BMP characters with new 14.0.0 casefolding Let's encode characters with new casefolding as follows: [@][a-b][g-v][0-9a-v][0-9a-v] where @ - the encoded character marker [a-b] - the first digit (4 values) [g-v] - the second digit (16 values) - determines upper or lower case [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) The total encoded sequence length is 5 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The second digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-V] , then the character is in the upper case This encoding gives 4*16*32*32=65536 values, which covers the entire BMP range U+0000 to U+FFFF. Examples: @ah81 - U+501 = 0*16*32*32 + 1*32*32 + 8*32 + 1 @aH81 - U+500 = 0*16*32*32 + 1*32*32 + 8*32 + 1, or literally UPPER(U+501) Non-BMP characters with case folding Let's encode non-BMP characters with casefolding as follows: [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f] where @ - the encoded character marker [c-f] - the first digit (4 values) [g-v] - the second digit (16 values) - determines upper or lower case [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) [0-9a-f] - the fivth digit (16 values) The total encoded sequence length is 6 characters. The encoded sequence represents the Unicode code point of the lower case variant of a character. The second digit [g-v] determines the case: If it is in the range [g-v] , then the character is in the lower case If it is in the range [G-v] , then the character is in the upper case This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire non-BMP range U+010000 to U+10FFFF. Non-BMP characters without folding Let's encode non-BMP characters without casefolding as follows: [@][c-f][g-v][0-9a-v][0-9a-v][g-v] where @ - the encoded character marker [c-f] - the first digit (4 values) [g-v] - the second digit (16 values) [0-9a-v] - the third digit (32 values) [0-9a-v] - the fourth digit (32 values) [g-v] - the fifth digit (16 values) The total encoded sequence length is 6 characters. This encoding gives 4*16*32*32*16=1048576 values, which exactly covers the entire non-BMP range U+010000 to U+10FFFF. Examples @cg00g - U+010000 = 0x10000 + 0*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @dg00g - U+050000 = 0x10000 + 1*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @eg00g - U+090000 = 0x10000 + 2*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @fg00g - U+0D0000 = 0x10000 + 3*16*32*32*16 + 0*32*32*16 + 0*32*16 + 0*16 + 0 @fvvvv - U+10FFFF = 0x10000 + 3*16*32*32*16 + 15*32*32*16 + 31*32*16 + 31*16 + 15 Summary With this proposal the full summary of the encoding components will look as follows: Pattern CodePoints Comment ------------------------------------- ----------------------- ------------------- [@][0..9][g..z] 10*20 = 200 BMP characters with 3.0.0 case folding [@][g..z][0..9] 20*10 = 200 BMP characters with 3.0.0 case folding [@][g..z][a..z] 20*26 = 520 BMP characters with 3.0.0 case folding [@][@][a..z] 1*26 = 26 BMP characters with 3.0.0 case folding [@][a..z][@] 1*26 = 26 BMP characters with 3.0.0 case folding [@][0-9a-f][0-9a-f][0-9a-f][0-9a-f] 16*16*16*16 = 65536 BMP characters without case folding ------------------------------------ ----------------------- ---------------- [@][a-b][g-v][0-9a-v][0-9a-v] 4*16*32*32 = 65536 BMP with new folding [@][c-f][g-v][0-9a-v][0-9a-v][0-9a-f] 4*16*32*32*16 = 1048576 non-BMP with folding [@][c-f][g-v][0-9a-v][0-9a-v][g-v] 4*16*32*32*16 = 1048576 non-BMP without folding The advantages of this proposal: "non-BMP with folding" covers all non-BMP characters in the range U+010000..U+10FFFF. Does not introduce new characters into the alphabet

            Upgrade issues.

            Suppose two BMP characters U+AAAA and U+BBBB:

            • where not case variants of the same character in the old encoding
            • but become case variants of the same characters in Unicode-14.0.0

            then mariadb-upgrade should not touch tables with such characters and display them as '#mdb1107#....', so the user can rename them manually.

            bar Alexander Barkov added a comment - Upgrade issues. Suppose two BMP characters U+AAAA and U+BBBB: where not case variants of the same character in the old encoding but become case variants of the same characters in Unicode-14.0.0 then mariadb-upgrade should not touch tables with such characters and display them as '#mdb1107#....', so the user can rename them manually.
            ralf.gebhardt Ralf Gebhardt added a comment -

            bar, in today's team lead call we decided to remove this change from the roadmap.

            Being able to use utf8mb4 in identifiers is a low use case and value compared to the possible drawbacks.

            ralf.gebhardt Ralf Gebhardt added a comment - bar , in today's team lead call we decided to remove this change from the roadmap. Being able to use utf8mb4 in identifiers is a low use case and value compared to the possible drawbacks.

            People

              bar Alexander Barkov
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.