[MDEV-8334] Rename utf8 to utf8mb3 - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Fix Version/s: 10.6.1
Component/s: Character Sets
Labels:
None

Description

Currently MariaDB's has two utf8 character sets:

utf8 that can store 1 to 3 byte characters and implements Unicode BMP range U+0000..U+FFFF
This character set is also available under name "utf8mb3"

utf8mb4 that can store 1 to 4 byte characters and implements the full Unicode standard range U+0000..U+10FFFF.

In long terms we want the name utf8 mean the full featured UTF-8.
We'll do a few preparatory steps:

1. Change the main name of the 3-byte character set from "utf8" to "utf8m3" and make "utf8" alias for "utf8mb3". This will change all SHOW and INFORMATION_SCHEMA output to display utf8mb3 instread of utf8, as well as change mysqldump to dump utf8mb3 instead of just utf8.

2. Add a new server option, say --utf8-is-utf8mb3, which will be true by default, but the DBA will be able to change it to false and thus make "utf8" mean "utf8mb4".

3. A few releases later we'll change --utf8-is-utf8mb3 to be "false" by default.

2. do not add any new server options and
3. add a new old_mode value for reverting utf8 to utf8mb3 when the default will mean utf8mb4
(optionally)4. make utf8 to mean utf8mb4 already in 10.6 and make the default value of old_mode to revert this in 10.6

Do not add any new server options and implement charset aliases via the SQL standard statement:

CREATE CHARACTER SET <character set name> [ AS ] <character set source> [ <collate clause> ]

<character set source> ::= GET <character set specification>

<character set specification> ::=

    <standard character set name>

  | <implementation-defined character set name>

  | <user-defined character set name>

Alternative solution

Originally, there were two reasons to have two utf8 implementations:

The CHAR column needs less space in case of utf8mb3. InnoDB can store CHAR in a packed format, so space needed is the same for utf8mb3 and utf8mb4 on the same data. Other engines could probably do the same trick to safe space: store CHAR in a packed format with trailing spaces removed.
Before 10.5, filesort was faster for utf8mb3 than for utf8mb4, because utf8mb3 needs to reserve less bytes for one weight. Now with Varun's improvements (e.g. ~~MDEV-21580~~) in filesort (sort buffer now can store the original string instead if its weight array), filesort should be the same fast for utf8mb3 and utf8mb4 on equal data sets.

So we could have just one "utf8", with the following aliases:

utf8mb4 is just a simple alias for the "new utf8"
utf8mb3 is also an alias for the "new utf8", but with an automatic constraint added

After the upgrade, SHOW for old tables with the 3-byte utf8 could be displayed about like this:

CREATE TABLE t1

  a VARCHAR(10) CHARACTER SET utf8 CHECK(is_bmp_only(a))

);

where is_bmp_only() is a new built-in function to test if a string contains only Basic Multilingual Plane characters and returning:

TRUE if a string contains only BMP characters U+0000..U+FFFF, fitting into 3-byte utf8 sequences
FALSE if the string has characters outside of BMP, i.e. U+10000..U+10FFFF, and therefore require 4 bytes in utf8 encoding.

The exact API for the constrain function may be different, e.g. it could test for an arbitrary Unicode character range (not only BMP vs non-BMP). This could be useful for other purposes as well.

Open questions:

It's not clear how to handle the database and the table level clause CHARACTER SET utf8mb3:

CREATE TABLE t1

  a VARCHAR(10)

) CHARACTER SET utf8mb3;

CREATE TABLE t2

  a VARCHAR(10)

) CHARACTER SET utf8mb4;

The table level CHARACTER SET for "t1" could probably automatically add the constraint into all columns that would have implicitly created as utf8mb3.

TODO: add upgrade details
TODO: add replication details

Attachments

Issue Links

blocks

MDEV-7128 Configuring charsets or collations as utf8 yields surprising result and leads to data loss

Closed

causes

MDEV-25924 Client shows `utf8mb3` csname replace warning message while logging into server

Closed

MDEV-26105 MariaDB 10.6 cannot be used from C# client applications

Closed

MDEV-26163 after 10.6 upgrade problems connecting to pipo db

Closed

MDEV-26165 Failed to upgrade from 10.4 to 10.6

Closed

MDEV-26605 Creating table with primary key constraint name fails when using C# connector

Open

MDEV-26607 Information schema not accessable in C# using MySql connector

Open

MDEV-26863 MariaDB 10.6.4 & Roundcubemail

Open

MDEV-27814 Mariadb_Upgrade_Wizard fails from 10.5 to 10.6

Open

MDEV-27819 func_2.xxx_charset skipped after renaming utf8 to utf8mb3

Closed

is blocked by

MDEV-19897 Rename source code variable names from utf8 to utf8mb3

Closed

MDEV-21581 Helper functions and methods for CHARSET_INFO

Closed

relates to

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

MDEV-30086 Character set 'utf8' is not a compiled character set and is not specified in the '/usr/share/mysql/charsets/Index.xml' file

Closed

MDEV-8765 mysqldump silently corrupts 4-byte UTF-8 data

Closed

MDEV-17662 Default to UTF8

Closed

MDEV-22217 Make OS character sets "utf8" and "utf-8" map to MariaDB character set "utf8mb4"

Open

(5 causes, 2 is blocked by, 5 relates to)

Sub-Tasks

Dokument "Rename utf8 to utf8mb3" - MDEV-8334

Closed

Ian Gilfillan

Activity

People

Assignee:: Oleksandr Byelkin

Reporter:: Alexander Barkov

Votes:: 8 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 2015-06-18 12:04

Updated:: 2024-07-09 11:54

Resolved:: 2021-05-19 05:41

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.