[MDEV-19123] Change default charset from latin1 to utf8mb4 Created: 2019-04-01  Updated: 2023-12-22

Status: Open
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 11.5

Type: New Feature Priority: Critical
Reporter: Diego Dupin Assignee: Alexander Barkov
Resolution: Unresolved Votes: 9
Labels: None

Issue Links:
Blocks
blocks MDEV-30041 don't set utf8_is_utf8mb3 by default ... Open
is blocked by MDEV-22981 Bad "default-character-set" option in... Closed
is blocked by MDEV-25829 Change default collation to utf8mb4_u... In Review
is blocked by MDEV-27009 Add UCA-14.0.0 collations Closed
is blocked by MDEV-29446 Change SHOW CREATE TABLE to display d... Closed
is blocked by MDEV-30556 UPPER() returns an empty string for U... Closed
is blocked by MDEV-30577 Case folding for uca1400 collations i... Closed
is blocked by MDEV-30661 UPPER() returns an empty string for U... Closed
Duplicate
is duplicated by MDEV-17662 Default to UTF8 Closed
Relates
relates to MDEV-7128 Configuring charsets or collations as... Closed
relates to MDEV-8334 Rename utf8 to utf8mb3 Closed
relates to MDEV-8872 Performance regressions with utf8mb4 ... Closed
relates to MDEV-27490 Allow full utf8mb4 for identifiers Stalled
relates to MDEV-29414 Map utf8 OS locales to utf8mb4 Open

 Description   

Goal of this task is to set default global variables to 4 bytes utf8 charset
meaning :

  • character_set_client : from from utf8 to utf8mb4.
  • character_set_database : from latin1 to utf8mb4
  • character_set_server : from latin1 to utf8mb4
  • character_set_results: from utf8 to utf8mb4
  • character_set_connection: from utf8 to utf8mb4
  • collation_database: from latin1_swedish_ci to utf8mb4_general_ci
  • collation_server: from latin1_swedish_ci to utf8mb4_general_ci

Default changed in mysql 8.0.1

There are some questions which should be discussed before/while working on this task:

  • Should we change the default collation for utf8mb4 from utf8mb4_general_ci to uca1400_ai_ci? The problem is that utf8mb4_general_ci is very bad for non-BMP characters - it considers all non-BMP charcters as equal to each other. See MDEV-25829
  • Should we reassign the UTF8 Linux Locale from utf8mb3 to utf8mb4 in the client? Or to what the server side uses as the alias for "utf8". See MDEV-19123
  • Should we change system_charset_info from utf8mb3 to utf8mb4 and allow non-BMP characters in identifiers?
    • If so, table name to file name encoding should be extended to support non-BMP characters. See MDEV-27490
    • system charset cannot be utf8mb4 until we fix the collation as above
  • Should we change numerous INFORMATION_SCHEMA columns from utf8mb3 to utf8mb4?
    • they should be in the system_charset_info, as they store identifiers


 Comments   
Comment by Dario Seidl [ 2019-07-02 ]

Please consider making this change. utf8mb4 is really the most sensible default nowadays. As pointed out, MySQL 8 also made the switch.

Comment by Otto Kekäläinen [ 2020-06-25 ]

In the 10.5 we switched to UTFMB4 by default for new databases in https://github.com/MariaDB/server/commit/7c2079f600bacbd4d24762159550b3d40ad856c1 but then reverted in https://github.com/MariaDB/server/commit/039cb6f6bfaaeafeb87e6d10c88be2cac87654e7

Comment by Sergei Golubchik [ 2020-06-25 ]

No, wasn't reverted, only the client charset was reverted, it did not affect how the data is stored.

Comment by Otto Kekäläinen [ 2021-10-31 ]

Is this still relevant? In MariaDB 10.6 the default charset was already changed to utf8mb3, which solves most of the issues people had and why many switched to utf8mb4 earlier?

https://mariadb.com/kb/en/unicode/
> From MariaDB 10.6, utf8 is by default an alias for utf8mb3

Comment by Sergei Golubchik [ 2021-11-01 ]

no, the default hasn't been changed, iirc. The meaning of "utf8" was.
Also Debian and SUSE used to set character-set-server and collation-server to utf8.

This task is about making the change upstream

Comment by Otto Kekäläinen [ 2021-11-02 ]

Roger that, full UTF-8 (=utfmb4) is indeed needed to support emojis (e.g. U+01F4A9 PILE OF POO �) and other characters in the full UTF-8 spec.

For followers of this Jira, the post https://mathiasbynens.be/notes/mysql-utf8mb4 is a good explanation of the topic (though it does not mention utf8mb3).

Comment by Dario Seidl [ 2021-11-03 ]

There are some questions which should be discussed before/while working on this task:

Should we introduce a new collation (as a replacement for utf8mb4_general_ci) and make it default for utf8mb4? The problem is that utf8mb4_general_ci is very bad for non-BMP characters - it considers all non-BMP charcters as equal to each other.

I agree, it should really not be `utf8mb4_general_ci`, that one sacrifices correctness for performance and that shouldn't be the default. There's `utf8mb4_unicode_ci` which correctly implements Unicode sorting, and the newer `utf8mb4_unicode_520_ci` with updated weight keys. MySQL also has the even newer `utf8mb4_0900_ai_ci` which doesn't exist in MariaDB yet, I think.

Generated at Thu Feb 08 08:49:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.