[MDEV-25829] Change default collation to utf8mb4_uca1400_ai_ci Created: 2021-05-31 Updated: 2023-12-22 |
|
| Status: | In Review |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Fix Version/s: | 11.5 |
| Type: | New Feature | Priority: | Critical |
| Reporter: | Stijn de Witt | Assignee: | Sergei Golubchik |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Since MySQL 8.0, they changed the default collation to `utf8mb4_0900_ai_ci`: > MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a variety of collations. The default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, but you can specify character sets at the server, database, table, column, and string literal levels. https://dev.mysql.com/doc/refman/8.0/en/charset.html Frankly, if anything, this change came way too late. I've personally seen hundreds of development hours be wasted on MySQL databases created with wrong (default) collations and only through bug reports ('my characters get mangled') was it found out and fixed. Over. And over. Again. Because every dev that did 'create table whatever (...)' would forget to set the collation and first (back in the day) get a Swedish (why??) collation and then by mistake switch to the (crippled/broken) utf8, thinking it solved it... And then finally got wise and changed to utf8mb4, just to have the whole story start over on the next table someone created. I love MySQL, but man this part was a royal mess-up and it cost the world hundreds of millions if not billions of wasted dev hours. I mean this hit everyone running a MySQL server. And now with those emoji I mean I don't even dare to guess at the number of $ wasted. So imagine my surprise when I read > In MariaDB, the default character set is latin1, and the default collation is latin1_swedish_ci https://mariadb.com/kb/en/setting-character-sets-and-collations/ OMG, please say it isn't so! Please! Please change this. Like, NOW. This is costing sooooo much dev hours. Soooo much billions of dollars wasted! Devs are stupid ok? They don't understand character sets and collations ok? Never have, never will. This whole Unicode thing has been a personal interest for me for over a decade and I still don't fully grasp it. But one thing I do know for sure and that is: `latin1_swedish_ci` makes NO SENSE whatsoever, to anyone. This is not a good setting. For no one. Not even for Swedish people. Because it fits only 255 characters! I mean really? Do databases still get created for which this default actually makes any sense? Please fix it. Please! For love of the world. For peace. To end poverty. Out of good citizenship. Because it's just a few lines in the default server config. For the lulz. For whatever reason, but just do it. Please! Changing collation derivation for CAST(expr AS CHAR), CONVERT(expr USING cs), user variablesChanging the default collation will cause "Illegal mix of collations" errors in two important scenarios:
This is especially important for queries to INFORMATION_SCHEMA tables, whose column use utf8mb3_general_ci:
To avoid these conflicts let's change collation derivations as follows:
So after the change, derivations for various expressions look as follows:
|
| Comments |
| Comment by Sergei Golubchik [ 2021-06-05 ] | |
|
The default depends on the distribution you take MariaDB from. In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. Given how few complains we ever got about it, we might better take some time thinking about how to do this transition in a least disruptive way possible. Instead of flipping the switch now. In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how. | |
| Comment by cybernet2u [ 2022-05-02 ] | |
|
any news ? | |
| Comment by Stijn de Witt [ 2022-05-15 ] | |
|
> In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in. A default of `utf8` might actually be worse, since people are getting the 3-byte broken MySQL version of UTF-8 and they will not understand why most characters are stored correctly but some get mangled. It takes deep understanding of the history of both Unicode and MySQL to understand that in MySQL, `utf8` means something different than the rest of the world means by it. > In all deb packages the default is utf8mb4_general_ci, set it my.cnf That's a sane choice. > In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. I have heard this argument before and I find it rather weak. It might break existing swedish applications. That never need their text content to be read in any other part of the world. And for which 255 characters suffice... For applications from any other place in the world, or that need more than 255 characters, this default makes no sense whatsoever. The idea that there might be many applications for which a swedish 255 character set is ok is imho laughable. Either apps stick to ASCII, in which case basically all encodings are ok for it, or they need support for more characters and then basically there is only one suitable encoding and that is utf8mb4. All others are broken out of the box. In the meantime, this default continues to break all new applications. The backward compatibility argument is moot There is no compatibility with these legacy encodings. They only work for a small subset of machines in a particular region of the world and they break for all other regions. They are a left over of days gone by when people simply had no internet and transferred files to each other using floppy disks. With these legacy encodings, people from France cannot read files from Germany. People from England cannot read files from Sweden etc. There simply are ZERO applications in today's world for which latin1 makes sense. And there are ZERO aplications for which utf8mb4 is not the best and in fact ONLY encoding that makes sense. All other encodings will break some characters. Only utf8mb4 works under all circumstances. > we might better take some time thinking about how to do this transition With all due respect... "the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things." I wrote this on may 5th, 2010. over 12 years ago! And it's still true. Please switch the default. Millions of devs will be better off for it. The way MySQL (and now by inheritance MariaDB) handles Unicode is so problematic, I wrote a post to warn devs just about that in 2015. 7 years ago! > In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how. I'm glad you agree. And sorry for my rant. But frankly I have been watching the encoding drama unfold literally for well over a decade now and I've heard the backward compatibility argument so many times. So I get triggered by it I suggest you have a look at what MySQL did. You just make a new major release and splatter this thing all over the release notes. Possibly you check whether an explicit encoding is set on import of data and print a warning if not. But in my experience there are only two scenarios with MySQL:
There are no apps using latin1 that are not broken. That is a blanket statement I know, but I challenge you to show me such an app. The nature of these legacy encodings is that they are incompatible out of the box. And someone that did not address their encoding issues still after all these years does not deserve to be helped / assisted by maintaining this broken default. New devs using MariaDB in their first project deserve to be helped. | |
| Comment by Stijn de Witt [ 2022-05-15 ] | |
|
I want to add some more. Sorry about bothering you but maybe it is ok because you might learn something? There basically are 3 major problems that in MariaDB compound together to create a real drama that costs the world billions of dollars: 1: Very few developers understand character encodings So what happens is this:
These are some of the most expensive changes that are possible. And every app that is made has to go through this process if the dev was not an encoding expert. These issues involve schema migrations, data migrations and the devs to read up a lot and do a lot of research. They demand of these devs that they become Unicode and MariaDB experts just to be able to understand what is happening. And every single dev has to go through this again and again. I have issues on my current project (logged by me) saying 'we have to change the encoding for these tables' and then a long list of tables. In many occasions, the server defaults can not be easily changed by the dev (managed databases). So we just have to tell every dev: 'if you create a table, set the encoding'. All because of this one default setting that is wrong. People don't understand this. And that's ok. From your comments on how changing the default might break applications, I realize that you don't really understand this as well. And that's also ok. No one should be forced to learn about this stuff. There are better things we can do with our time. But maybe you can trust me and my experience and just change the default and be done with it. I promise you it will not break any applications that were not already broken and it will prevent the drama I described above for all new applications. And it will allow devs to not understand encodings and focus on their business logic instead of requiring them to become encoding/MariaDB experts. The world will be better off for it. | |
| Comment by Sergei Golubchik [ 2022-06-07 ] | |
|
as a status update — this issue is related or depends on a bunch of other issues, notably:
this is work in progress at the moment | |
| Comment by Stijn de Witt [ 2022-10-27 ] | |
|
If I read this correctly, you guys have fixed it! That deserves a 'Hurray!!' and congratulations to all involved. | |
| Comment by Alexander Barkov [ 2023-08-22 ] | |
|
StijnDeWitt, your original guess in 2022 was not correct. This task is still open. It's about changing the hard coded default to utf8mb4_uca1400_ai_ci. We need to make sure:
| |
| Comment by Alexander Barkov [ 2023-08-22 ] | |
|
The easiest solution to implement this task would be to change the default value for GLOBAL.character_set_collations from empty to:
If we go this way, we should make sure that the implementation of @@character_set_collations does not have any performance problems with a non-empty value. | |
| Comment by cybernet2u [ 2023-11-15 ] | |
|
I would like to thank developers for listening | |
| Comment by Alexander Barkov [ 2023-11-20 ] | |
|
Hello serg, Can you please review patches for this MDEV? There are three patches. They are in this branch: https://github.com/MariaDB/server/commits/bb-11.4-bar-MDEV-25829 |