Details
-
New Feature
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
Under terms of this task will change the default collation for Unicode character sets (utf8mb3, utf8mb4, ucs2, utf16, utf32) from xxx_general_ci to uca1400_ai_ci.
We won't change the server default character set from latin1 to utf8mb4 yet. It will be done separately under terms of MDEV-19123.
Therefore, we narrowed the original scope of this MDEV, which was about both:
- Changing the default collation for Unicode character sets from xxx_general_ci to a modern Unicode collation, with proper support for SMP characters (including emoji)
- Changing the server default character set from latin1 to utf8mb4
to just #1.
Under terms of this task we won't change the collation of INFORMATION_SCHEMA columns. They'll remain utf8mb3_general_ci. They'll probably also change eventually, but not now. Just one big change at a time!
See also the section "The original task description from the reporter" below.
Implementation details
There were two different implementation ways to change the default collation:
- Change the default of the server system variable @@character_set_collations, recently added to 11.2 by
MDEV-30164, to:"utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf32=uca1400_ai_ci";
- Change the hard-coded defaults inside the collation library (which resides in the /strings sub-directory of MariaDB sources)
We'll go #1.
The hard-coded defaults inside the collation library will remain xxx_general_ci for all Unicode character sets, for the following reasons:
- It's easier to do this way, because uca1400_ai_ci needs some initialization at the server startup, while xxx_general_ci works just out of the box. At the very early startup stage (before the collation library initialization), the server needs a functional CHARSET_INFO pointer to handle utf8 data.
- Hard-coding uca1400_ai_ci won't be good for all users anyway. Some users might prefer uca1400_as_ci (for accent sensitive comparison), or prefer xxx_general_cl for backward compatibility.
Changing collation derivation for CAST(expr AS CHAR), CONVERT(expr USING cs), user variables
Changing the default collation will cause "Illegal mix of collations" errors in two important scenarios:
- Comparing a column with a non-default collation to a result of CAST/CONVERT
- Comparing a column with a non-default collation to a string user variable
This is especially important for queries to INFORMATION_SCHEMA tables, whose columns use utf8mb3_general_ci:
SET character_set_collations='utf8mb3=uca1400_ai_ci' /*Emulate MDEV-25829 */; |
SET NAMES utf8mb3; |
SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME=CONCAT('a',CAST(1 AS CHAR CHARACTER SET utf8)); |
ERROR 1267 (HY000): Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (utf8mb3_uca1400_ai_ci,IMPLICIT) for operation '='
|
SET character_set_collations='utf8mb3=uca1400_ai_ci' /*Emulate MDEV-25829 */; |
SET NAMES utf8mb3; |
SET @a='test'; |
SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME=@a; |
ERROR 1267 (HY000): Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (utf8mb3_uca1400_ai_ci,IMPLICIT) for operation '='
|
To avoid these conflicts let's change collation derivations as follows:
- for CAST/CONVERT: from IMPLICIT to a new derivation level CAST between SYSCONST and COERCIBLE
- for string user variables: from IMPLICIT to COERCIBLE
So after the change, derivations for various expressions look as follows:
|
DERIVATION_IGNORABLE= 7, // Explicit NULL |
|
DERIVATION_NUMERIC= 6, // Numbers in string context, |
// Numeric user variables |
// CAST(numeric_expr AS CHAR) |
|
DERIVATION_COERCIBLE= 5, // Literals, string user variables |
|
DERIVATION_CAST= 4, // CAST(string_expr AS CHAR), |
// CONVERT(string_expr USING cs) |
|
DERIVATION_SYSCONST= 3, // utf8 metadata functions, e.g. user(), database() |
DERIVATION_IMPLICIT= 2, // Table columns, SP variables |
DERIVATION_NONE= 1, // A mix (e.g. CONCAT) of two differrent collations |
DERIVATION_EXPLICIT= 0 // An explicit COLLATE clause |
The original task description from the reporter
Since MySQL 8.0, they changed the default collation to utf8mb4_0900_ai_ci:
MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a variety of collations. The default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, but you can specify character sets at the server, database, table, column, and string literal levels.
https://dev.mysql.com/doc/refman/8.0/en/charset.html
Frankly, if anything, this change came way too late. I've personally seen hundreds of development hours be wasted on MySQL databases created with wrong (default) collations and only through bug reports ('my characters get mangled') was it found out and fixed. Over. And over. Again. Because every dev that did 'create table whatever (...)' would forget to set the collation and first (back in the day) get a Swedish (why??) collation and then by mistake switch to the (crippled/broken) utf8, thinking it solved it... And then finally got wise and changed to utf8mb4, just to have the whole story start over on the next table someone created.
I love MySQL, but man this part was a royal mess-up and it cost the world hundreds of millions if not billions of wasted dev hours. I mean this hit everyone running a MySQL server. And now with those emoji I mean I don't even dare to guess at the number of $ wasted.
So imagine my surprise when I read
In MariaDB, the default character set is latin1, and the default collation is latin1_swedish_ci
https://mariadb.com/kb/en/setting-character-sets-and-collations/
OMG, please say it isn't so!
Please! Please change this. Like, NOW. This is costing sooooo much dev hours. Soooo much billions of dollars wasted! Devs are stupid ok? They don't understand character sets and collations ok? Never have, never will. This whole Unicode thing has been a personal interest for me for over a decade and I still don't fully grasp it. But one thing I do know for sure and that is: `latin1_swedish_ci` makes NO SENSE whatsoever, to anyone. This is not a good setting. For no one. Not even for Swedish people. Because it fits only 255 characters! I mean really? Do databases still get created for which this default actually makes any sense?
Please fix it. Please! For love of the world. For peace. To end poverty. Out of good citizenship. Because it's just a few lines in the default server config. For the lulz. For whatever reason, but just do it. Please!
Attachments
Issue Links
- blocks
-
MDEV-19123 Change default charset from latin1 to utf8mb4
- Closed
-
MDEV-30041 don't set utf8_is_utf8mb3 by default in the old-mode
- Open
- causes
-
MDEV-33696 main.dyncol and ctype_unicode_casefold_bmp.inc in --view
- Closed
-
MDEV-33698 tests in the jp suite fail
- Closed
-
MDEV-33701 upgrades from 11.4 to 11.5 don't work
- Closed
-
MDEV-35041 Simple comparison causes "Illegal mix of collations" even with default server settings
- Closed
- is blocked by
-
MDEV-27265 Improve contraction performance in UCA collations
- Closed
-
MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
- Closed
-
MDEV-30164 System variable for default collations
- Closed
-
MDEV-30556 UPPER() returns an empty string for U+0251 in Unicode-5.2.0+ collations for utf8
- Closed
-
MDEV-30577 Case folding for uca1400 collations is not up to date
- Closed
-
MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations for utf8
- Closed
-
MDEV-33788 HEX(COLUMN_CREATE(.. AS CHAR ...)) fails with --view-protocol
- Closed
- relates to
-
MDEV-27490 Allow full utf8mb4 for identifiers
- Stalled