[MDEV-8334] Rename utf8 to utf8mb3 Created: 2015-06-18  Updated: 2023-11-22  Resolved: 2021-05-19

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 10.6.1

Type: Task Priority: Blocker
Reporter: Alexander Barkov Assignee: Oleksandr Byelkin
Resolution: Fixed Votes: 8
Labels: None

Issue Links:
Blocks
blocks MDEV-7128 Configuring charsets or collations as... Closed
is blocked by MDEV-19897 Rename source code variable names fro... Closed
is blocked by MDEV-21581 Helper functions and methods for CHAR... Closed
Problem/Incident
causes MDEV-25924 Client shows `utf8mb3` csname replace... Closed
causes MDEV-26105 MariaDB 10.6 cannot be used from C# c... Closed
causes MDEV-26163 after 10.6 upgrade problems connectin... Closed
causes MDEV-26165 Failed to upgrade from 10.4 to 10.6 Closed
causes MDEV-26605 Creating table with primary key const... Open
causes MDEV-26607 Information schema not accessable in ... Open
causes MDEV-26863 MariaDB 10.6.4 & Roundcubemail Open
causes MDEV-27814 Mariadb_Upgrade_Wizard fails from 10.... Open
causes MDEV-27819 func_2.xxx_charset skipped after rena... Closed
Relates
relates to MDEV-19123 Change default charset from latin1 to... Open
relates to MDEV-30086 Character set 'utf8' is not a compile... Closed
relates to MDEV-8765 mysqldump silently corrupts 4-byte UT... Closed
relates to MDEV-17662 Default to UTF8 Closed
relates to MDEV-22217 Make OS character sets "utf8" and "ut... Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDEV-25706 Dokument "Rename utf8 to utf8mb3" - M... Technical task Closed Ian Gilfillan  

 Description   

Currently MariaDB's has two utf8 character sets:

  • utf8 that can store 1 to 3 byte characters and implements Unicode BMP range U+0000..U+FFFF
    This character set is also available under name "utf8mb3"
  • utf8mb4 that can store 1 to 4 byte characters and implements the full Unicode standard range U+0000..U+10FFFF.

In long terms we want the name utf8 mean the full featured UTF-8.
We'll do a few preparatory steps:

1. Change the main name of the 3-byte character set from "utf8" to "utf8m3" and make "utf8" alias for "utf8mb3". This will change all SHOW and INFORMATION_SCHEMA output to display utf8mb3 instread of utf8, as well as change mysqldump to dump utf8mb3 instead of just utf8.

2. Add a new server option, say --utf8-is-utf8mb3, which will be true by default, but the DBA will be able to change it to false and thus make "utf8" mean "utf8mb4".

3. A few releases later we'll change --utf8-is-utf8mb3 to be "false" by default.

Or

2. do not add any new server options and
3. add a new old_mode value for reverting utf8 to utf8mb3 when the default will mean utf8mb4
(optionally)4. make utf8 to mean utf8mb4 already in 10.6 and make the default value of old_mode to revert this in 10.6

Or

Do not add any new server options and implement charset aliases via the SQL standard statement:

CREATE CHARACTER SET <character set name> [ AS ] <character set source> [ <collate clause> ]
<character set source> ::= GET <character set specification>
<character set specification> ::=
    <standard character set name>
  | <implementation-defined character set name>
  | <user-defined character set name>

Alternative solution

Originally, there were two reasons to have two utf8 implementations:

  • The CHAR column needs less space in case of utf8mb3. InnoDB can store CHAR in a packed format, so space needed is the same for utf8mb3 and utf8mb4 on the same data. Other engines could probably do the same trick to safe space: store CHAR in a packed format with trailing spaces removed.
  • Before 10.5, filesort was faster for utf8mb3 than for utf8mb4, because utf8mb3 needs to reserve less bytes for one weight. Now with Varun's improvements (e.g. MDEV-21580) in filesort (sort buffer now can store the original string instead if its weight array), filesort should be the same fast for utf8mb3 and utf8mb4 on equal data sets.

So we could have just one "utf8", with the following aliases:

  • utf8mb4 is just a simple alias for the "new utf8"
  • utf8mb3 is also an alias for the "new utf8", but with an automatic constraint added

After the upgrade, SHOW for old tables with the 3-byte utf8 could be displayed about like this:

CREATE TABLE t1
(
  a VARCHAR(10) CHARACTER SET utf8 CHECK(is_bmp_only(a))
);

where is_bmp_only() is a new built-in function to test if a string contains only Basic Multilingual Plane characters and returning:

  • TRUE if a string contains only BMP characters U+0000..U+FFFF, fitting into 3-byte utf8 sequences
  • FALSE if the string has characters outside of BMP, i.e. U+10000..U+10FFFF, and therefore require 4 bytes in utf8 encoding.

The exact API for the constrain function may be different, e.g. it could test for an arbitrary Unicode character range (not only BMP vs non-BMP). This could be useful for other purposes as well.

Open questions:

  • It's not clear how to handle the database and the table level clause CHARACTER SET utf8mb3:

    CREATE TABLE t1
    (
      a VARCHAR(10)
    ) CHARACTER SET utf8mb3;
     
    CREATE TABLE t2
    (
      a VARCHAR(10)
    ) CHARACTER SET utf8mb4;
    

    The table level CHARACTER SET for "t1" could probably automatically add the constraint into all columns that would have implicitly created as utf8mb3.

  • TODO: add upgrade details
  • TODO: add replication details


 Comments   
Comment by Alexander Barkov [ 2018-11-26 ]

ralf.gebhardt@mariadb.com, this is a good idea. We can keep this MDEV as a "super task", and have individual three tasks for every step.

Comment by Todd Michael [ 2019-05-17 ]

This might both conflict with and agree with the long-term usage envisaged for MySQL ... :

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html

-------------------------------------------------------------------------------
Note
The utf8mb3 character set is deprecated and will be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
-------------------------------------------------------------------------------

Comment by Nuno [ 2019-07-02 ]

Hello,

Is this going to fix/improve the fact that, many times, when inserting 150 characters on a VARCHAR(200) returns a truncation error?

This happens when using utf8 or utf8mb4.

Thank you.

Comment by Alexander Barkov [ 2019-10-30 ]

julien.fritsch, yes, I want to finish it before beta.

Comment by Rick James [ 2020-01-31 ]

I fear is that upgrades will fail. And downgrades will be problematic. Think about these issues when changing the meaning of utf8, even if it is to the equivalent utf8mb3. Also be aware that doing something different than Oracle will lead to a lot of grief when people try to move from (or to) MySQL.

Comment by Martin Häcker [ 2020-08-25 ]

I'm getting lots of change mail from this bug, but I don't see any changes. Is there a script running amok here perhaps?

Comment by Julien Fritsch [ 2020-08-25 ]

dwt you are getting all those emails from this task, because bar is working on it and is updating the description. If you don't want to get those, you can stop to watch it.

Comment by Oleksandr Byelkin [ 2020-11-17 ]

The plan is:

  1. rename utf8 -> utf8mb3
  2. make utf8 alias of utf8mb4
  3. make old_mode=UTF8_IS_UTF8MB3 where utf8 is an alias of utf8mb3
  4. make UTF8_IS_UTF8MB3 default setting
Comment by Nuno [ 2020-11-17 ]

Guys,
This looks great.

I just want to ask,
if we activate UTF8_IS_UTF8MB3, the existing tables will be unaffected, right? Just the definition will start saying "utf8mb3" rather than "utf8", is that it?

Comment by Sergei Golubchik [ 2020-11-17 ]

Correct.
With a minor detail that you don't need to activate UTF8_IS_UTF8MB3, it'll be active by default in 10.6

Comment by Rucha Deodhar [ 2021-04-17 ]

PR for mariadb-connector-c: https://github.com/mariadb-corporation/mariadb-connector-c/pull/169
serg patch for server after latest review:
https://github.com/MariaDB/server/commit/3072ba1b7cafc97a4df0909885d8c7cb30121e35

Comment by Marko Mäkelä [ 2021-04-21 ]

The Connector/C part has apparently been applied. I merged it to 10.6 and adjusted tests/mysql_client_test.c accordingly.

Comment by Oleksandr Byelkin [ 2021-04-22 ]

OK to push

Comment by Todd Michael [ 2021-04-29 ]

See new documentation on OLD_MODE for more info:

https://mariadb.com/kb/en/old-mode/

Comment by Sergei Golubchik [ 2021-05-11 ]

commit 3072ba1b7ca is ok to push, thanks!

Comment by Martin Häcker [ 2021-05-19 ]

As the guy who triggered all of this with a bug report many years ago - after all this time - I just want to say thank you for the work you guys put in to make this happen. Stopping the confusion of utf8 (utf8mb3) with utf8mb4 in MariaDB is a huge thing and still something I have to fight all the time because people just miss it.

This will help a lot!

Thanks!

Comment by Roel Van de Paar [ 2021-06-15 ]

See MDEV-8334

Generated at Thu Feb 08 07:26:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.