Details
-
Task
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
None
Description
Currently MariaDB's has two utf8 character sets:
- utf8 that can store 1 to 3 byte characters and implements Unicode BMP range U+0000..U+FFFF
This character set is also available under name "utf8mb3"
- utf8mb4 that can store 1 to 4 byte characters and implements the full Unicode standard range U+0000..U+10FFFF.
In long terms we want the name utf8 mean the full featured UTF-8.
We'll do a few preparatory steps:
1. Change the main name of the 3-byte character set from "utf8" to "utf8m3" and make "utf8" alias for "utf8mb3". This will change all SHOW and INFORMATION_SCHEMA output to display utf8mb3 instread of utf8, as well as change mysqldump to dump utf8mb3 instead of just utf8.
2. Add a new server option, say --utf8-is-utf8mb3, which will be true by default, but the DBA will be able to change it to false and thus make "utf8" mean "utf8mb4".
3. A few releases later we'll change --utf8-is-utf8mb3 to be "false" by default.
Or
2. do not add any new server options and
3. add a new old_mode value for reverting utf8 to utf8mb3 when the default will mean utf8mb4
(optionally)4. make utf8 to mean utf8mb4 already in 10.6 and make the default value of old_mode to revert this in 10.6
Or
Do not add any new server options and implement charset aliases via the SQL standard statement:
CREATE CHARACTER SET <character set name> [ AS ] <character set source> [ <collate clause> ] |
<character set source> ::= GET <character set specification> |
<character set specification> ::= |
<standard character set name> |
| <implementation-defined character set name> |
| <user-defined character set name> |
|
Alternative solution
Originally, there were two reasons to have two utf8 implementations:
- The CHAR column needs less space in case of utf8mb3. InnoDB can store CHAR in a packed format, so space needed is the same for utf8mb3 and utf8mb4 on the same data. Other engines could probably do the same trick to safe space: store CHAR in a packed format with trailing spaces removed.
- Before 10.5, filesort was faster for utf8mb3 than for utf8mb4, because utf8mb3 needs to reserve less bytes for one weight. Now with Varun's improvements (e.g.
MDEV-21580) in filesort (sort buffer now can store the original string instead if its weight array), filesort should be the same fast for utf8mb3 and utf8mb4 on equal data sets.
So we could have just one "utf8", with the following aliases:
- utf8mb4 is just a simple alias for the "new utf8"
- utf8mb3 is also an alias for the "new utf8", but with an automatic constraint added
After the upgrade, SHOW for old tables with the 3-byte utf8 could be displayed about like this:
CREATE TABLE t1 |
(
|
a VARCHAR(10) CHARACTER SET utf8 CHECK(is_bmp_only(a)) |
);
|
where is_bmp_only() is a new built-in function to test if a string contains only Basic Multilingual Plane characters and returning:
- TRUE if a string contains only BMP characters U+0000..U+FFFF, fitting into 3-byte utf8 sequences
- FALSE if the string has characters outside of BMP, i.e. U+10000..U+10FFFF, and therefore require 4 bytes in utf8 encoding.
The exact API for the constrain function may be different, e.g. it could test for an arbitrary Unicode character range (not only BMP vs non-BMP). This could be useful for other purposes as well.
Open questions:
- It's not clear how to handle the database and the table level clause CHARACTER SET utf8mb3:
CREATE TABLE t1
(
a VARCHAR(10)
) CHARACTER SET utf8mb3;
Â
CREATE TABLE t2
(
a VARCHAR(10)
) CHARACTER SET utf8mb4;
The table level CHARACTER SET for "t1" could probably automatically add the constraint into all columns that would have implicitly created as utf8mb3.
- TODO: add upgrade details
- TODO: add replication details
Attachments
Issue Links
- blocks
-
MDEV-7128 Configuring charsets or collations as utf8 yields surprising result and leads to data loss
- Closed
- causes
-
MDEV-25924 Client shows `utf8mb3` csname replace warning message while logging into server
- Closed
-
MDEV-26105 MariaDB 10.6 cannot be used from C# client applications
- Closed
-
MDEV-26163 after 10.6 upgrade problems connecting to pipo db
- Closed
-
MDEV-26165 Failed to upgrade from 10.4 to 10.6
- Closed
-
MDEV-26605 Creating table with primary key constraint name fails when using C# connector
- Open
-
MDEV-26607 Information schema not accessable in C# using MySql connector
- Open
-
MDEV-26863 MariaDB 10.6.4 & Roundcubemail
- Open
-
MDEV-27814 Mariadb_Upgrade_Wizard fails from 10.5 to 10.6
- Open
-
MDEV-27819 func_2.xxx_charset skipped after renaming utf8 to utf8mb3
- Closed
-
MDEV-30086 Character set 'utf8' is not a compiled character set and is not specified in the '/usr/share/mysql/charsets/Index.xml' file
- In Review
- is blocked by
-
MDEV-19897 Rename source code variable names from utf8 to utf8mb3
- Closed
-
MDEV-21581 Helper functions and methods for CHARSET_INFO
- Closed
- relates to
-
MDEV-19123 Change default charset from latin1 to utf8mb4
- Closed
-
MDEV-30086 Character set 'utf8' is not a compiled character set and is not specified in the '/usr/share/mysql/charsets/Index.xml' file
- In Review
-
MDEV-8765 mysqldump silently corrupts 4-byte UTF-8 data
- Closed
-
MDEV-17662 Default to UTF8
- Closed
-
MDEV-22217 Make OS character sets "utf8" and "utf-8" map to MariaDB character set "utf8mb4"
- In Testing