[MDEV-25829] Change default Unicode collation to uca1400_ai_ci - Jira

Stijn de Witt created issue - 2021-05-31 20:25

Sergei Golubchik added a comment - 2021-06-05 16:30

The default depends on the distribution you take MariaDB from.
In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in.
In all deb packages the default is utf8mb4_general_ci, set it my.cnf

In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. Given how few complains we ever got about it, we might better take some time thinking about how to do this transition in a least disruptive way possible. Instead of flipping the switch now.

In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how.

Sergei Golubchik added a comment - 2021-06-05 16:30 The default depends on the distribution you take MariaDB from. In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in. In all deb packages the default is utf8mb4_general_ci, set it my.cnf In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. Given how few complains we ever got about it, we might better take some time thinking about how to do this transition in a least disruptive way possible. Instead of flipping the switch now. In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how.

Sergei Golubchik made changes - 2021-12-06 21:35

Field	Original Value	New Value
Workflow	MariaDB v3 [ 122357 ]	MariaDB v4 [ 142899 ]

cybernet2u added a comment - 2022-05-02 10:09

any news ?

cybernet2u added a comment - 2022-05-02 10:09 any news ?

Stijn de Witt added a comment - 2022-05-15 08:47 - edited

> In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in.

A default of `utf8` might actually be worse, since people are getting the 3-byte broken MySQL version of UTF-8 and they will not understand why most characters are stored correctly but some get mangled. It takes deep understanding of the history of both Unicode and MySQL to understand that in MySQL, `utf8` means something different than the rest of the world means by it.

> In all deb packages the default is utf8mb4_general_ci, set it my.cnf

That's a sane choice.

> In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications.

I have heard this argument before and I find it rather weak. It might break existing swedish applications. That never need their text content to be read in any other part of the world. And for which 255 characters suffice...

For applications from any other place in the world, or that need more than 255 characters, this default makes no sense whatsoever. The idea that there might be many applications for which a swedish 255 character set is ok is imho laughable. Either apps stick to ASCII, in which case basically all encodings are ok for it, or they need support for more characters and then basically there is only one suitable encoding and that is utf8mb4. All others are broken out of the box.

In the meantime, this default continues to break all new applications. The backward compatibility argument is moot There is no compatibility with these legacy encodings. They only work for a small subset of machines in a particular region of the world and they break for all other regions. They are a left over of days gone by when people simply had no internet and transferred files to each other using floppy disks. With these legacy encodings, people from France cannot read files from Germany. People from England cannot read files from Sweden etc. There simply are ZERO applications in today's world for which latin1 makes sense. And there are ZERO aplications for which utf8mb4 is not the best and in fact ONLY encoding that makes sense. All other encodings will break some characters. Only utf8mb4 works under all circumstances.

> we might better take some time thinking about how to do this transition

With all due respect...

"the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things."

I wrote this on may 5th, 2010. over 12 years ago! And it's still true. Please switch the default. Millions of devs will be better off for it.

The way MySQL (and now by inheritance MariaDB) handles Unicode is so problematic, I wrote a post to warn devs just about that in 2015. 7 years ago!

> In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how.

I'm glad you agree. And sorry for my rant. But frankly I have been watching the encoding drama unfold literally for well over a decade now and I've heard the backward compatibility argument so many times. So I get triggered by it

I suggest you have a look at what MySQL did. You just make a new major release and splatter this thing all over the release notes. Possibly you check whether an explicit encoding is set on import of data and print a warning if not. But in my experience there are only two scenarios with MySQL:

People set the encoding explicitly -> they are not affected
People forget to set the encoding -> they get latin1 -> their app is broken until they fix it

There are no apps using latin1 that are not broken. That is a blanket statement I know, but I challenge you to show me such an app. The nature of these legacy encodings is that they are incompatible out of the box. And someone that did not address their encoding issues still after all these years does not deserve to be helped / assisted by maintaining this broken default. New devs using MariaDB in their first project deserve to be helped.

Stijn de Witt added a comment - 2022-05-15 08:47 - edited > In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in. A default of `utf8` might actually be worse, since people are getting the 3-byte broken MySQL version of UTF-8 and they will not understand why most characters are stored correctly but some get mangled. It takes deep understanding of the history of both Unicode and MySQL to understand that in MySQL, `utf8` means something different than the rest of the world means by it. > In all deb packages the default is utf8mb4_general_ci, set it my.cnf That's a sane choice. > In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. I have heard this argument before and I find it rather weak. It might break existing swedish applications. That never need their text content to be read in any other part of the world. And for which 255 characters suffice... For applications from any other place in the world, or that need more than 255 characters, this default makes no sense whatsoever. The idea that there might be many applications for which a swedish 255 character set is ok is imho laughable. Either apps stick to ASCII, in which case basically all encodings are ok for it, or they need support for more characters and then basically there is only one suitable encoding and that is utf8mb4. All others are broken out of the box. In the meantime, this default continues to break all new applications. The backward compatibility argument is moot There is no compatibility with these legacy encodings. They only work for a small subset of machines in a particular region of the world and they break for all other regions. They are a left over of days gone by when people simply had no internet and transferred files to each other using floppy disks. With these legacy encodings, people from France cannot read files from Germany. People from England cannot read files from Sweden etc. There simply are ZERO applications in today's world for which latin1 makes sense. And there are ZERO aplications for which utf8mb4 is not the best and in fact ONLY encoding that makes sense. All other encodings will break some characters. Only utf8mb4 works under all circumstances. > we might better take some time thinking about how to do this transition With all due respect... "the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things." I wrote this on may 5th, 2010. over 12 years ago! And it's still true. Please switch the default. Millions of devs will be better off for it. The way MySQL (and now by inheritance MariaDB) handles Unicode is so problematic, I wrote a post to warn devs just about that in 2015. 7 years ago! > In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how. I'm glad you agree. And sorry for my rant. But frankly I have been watching the encoding drama unfold literally for well over a decade now and I've heard the backward compatibility argument so many times. So I get triggered by it I suggest you have a look at what MySQL did. You just make a new major release and splatter this thing all over the release notes. Possibly you check whether an explicit encoding is set on import of data and print a warning if not. But in my experience there are only two scenarios with MySQL: People set the encoding explicitly -> they are not affected People forget to set the encoding -> they get latin1 -> their app is broken until they fix it There are no apps using latin1 that are not broken. That is a blanket statement I know, but I challenge you to show me such an app. The nature of these legacy encodings is that they are incompatible out of the box. And someone that did not address their encoding issues still after all these years does not deserve to be helped / assisted by maintaining this broken default. New devs using MariaDB in their first project deserve to be helped.

Stijn de Witt added a comment - 2022-05-15 09:37

I want to add some more. Sorry about bothering you but maybe it is ok because you might learn something?

There basically are 3 major problems that in MariaDB compound together to create a real drama that costs the world billions of dollars:

1: Very few developers understand character encodings
2. The default in one of the world's most popular data storage platforms is broken
3. The `utf8` encoding in that platform is also broken!

So what happens is this:

A developer with little understanding of encodings installs MariaDB and connects his app to it
He delivers the app and soon, customers complain about mangled characters
He discovers that the default encoding is latin1_swedish and reads up on Unicode
He switches the encoding to `utf8`, oblivious of the fact that that is in fact ALSO BROKEN!
He spends valuable time trying to figure out how to fix the existing data with mangled text
He delivers the fix to the customer
For a while, all seems well
Reports start to come in about emoji's being mangled
The dev reads up on the Basic Multilingual Plane and the history of MySQL/MariaDB
To his dismay, the dev realizes that `utf8` in MariaDB is something different altogether than what he assumed it was
The dev finally changes the encoding to `utf8mb4`.
A new attempt at fixing the broken data is made
The second fix is delivered to the customer
Finally the app works!

These are some of the most expensive changes that are possible. And every app that is made has to go through this process if the dev was not an encoding expert. These issues involve schema migrations, data migrations and the devs to read up a lot and do a lot of research. They demand of these devs that they become Unicode and MariaDB experts just to be able to understand what is happening. And every single dev has to go through this again and again. I have issues on my current project (logged by me) saying 'we have to change the encoding for these tables' and then a long list of tables. In many occasions, the server defaults can not be easily changed by the dev (managed databases). So we just have to tell every dev: 'if you create a table, set the encoding'. All because of this one default setting that is wrong.

People don't understand this. And that's ok. From your comments on how changing the default might break applications, I realize that you don't really understand this as well. And that's also ok. No one should be forced to learn about this stuff. There are better things we can do with our time. But maybe you can trust me and my experience and just change the default and be done with it. I promise you it will not break any applications that were not already broken and it will prevent the drama I described above for all new applications. And it will allow devs to not understand encodings and focus on their business logic instead of requiring them to become encoding/MariaDB experts. The world will be better off for it.

Stijn de Witt added a comment - 2022-05-15 09:37 I want to add some more. Sorry about bothering you but maybe it is ok because you might learn something? There basically are 3 major problems that in MariaDB compound together to create a real drama that costs the world billions of dollars: 1: Very few developers understand character encodings 2. The default in one of the world's most popular data storage platforms is broken 3. The `utf8` encoding in that platform is also broken! So what happens is this: A developer with little understanding of encodings installs MariaDB and connects his app to it He delivers the app and soon, customers complain about mangled characters He discovers that the default encoding is latin1_swedish and reads up on Unicode He switches the encoding to `utf8`, oblivious of the fact that that is in fact ALSO BROKEN! He spends valuable time trying to figure out how to fix the existing data with mangled text He delivers the fix to the customer For a while, all seems well Reports start to come in about emoji's being mangled The dev reads up on the Basic Multilingual Plane and the history of MySQL/MariaDB To his dismay, the dev realizes that `utf8` in MariaDB is something different altogether than what he assumed it was The dev finally changes the encoding to `utf8mb4`. A new attempt at fixing the broken data is made The second fix is delivered to the customer Finally the app works! These are some of the most expensive changes that are possible. And every app that is made has to go through this process if the dev was not an encoding expert. These issues involve schema migrations, data migrations and the devs to read up a lot and do a lot of research. They demand of these devs that they become Unicode and MariaDB experts just to be able to understand what is happening. And every single dev has to go through this again and again. I have issues on my current project (logged by me) saying 'we have to change the encoding for these tables' and then a long list of tables. In many occasions, the server defaults can not be easily changed by the dev (managed databases). So we just have to tell every dev: 'if you create a table, set the encoding'. All because of this one default setting that is wrong. People don't understand this. And that's ok. From your comments on how changing the default might break applications, I realize that you don't really understand this as well. And that's also ok. No one should be forced to learn about this stuff. There are better things we can do with our time. But maybe you can trust me and my experience and just change the default and be done with it. I promise you it will not break any applications that were not already broken and it will prevent the drama I described above for all new applications. And it will allow devs to not understand encodings and focus on their business logic instead of requiring them to become encoding/MariaDB experts. The world will be better off for it.

Sergei Golubchik made changes - 2022-06-07 18:38

Link

This issue is blocked by ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Sergei Golubchik added a comment - 2022-06-07 18:42 - edited

as a status update — this issue is related or depends on a bunch of other issues, notably:

~~MDEV-19123~~ Change default charset to utf8mb4
~~MDEV-27266~~ Improve UCA collation performance for utf8mb3 and utf8mb4
MDEV-27490 Allow full utf8mb4 for identifiers
~~MDEV-27009~~ Add UCA-14.0.0 collations

this is work in progress at the moment

Sergei Golubchik added a comment - 2022-06-07 18:42 - edited as a status update — this issue is related or depends on a bunch of other issues, notably: MDEV-19123 Change default charset to utf8mb4 MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4 MDEV-27490 Allow full utf8mb4 for identifiers MDEV-27009 Add UCA-14.0.0 collations this is work in progress at the moment

Alexander Barkov made changes - 2022-09-07 10:27

Link

This issue is blocked by ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Alexander Barkov made changes - 2022-09-07 10:27

Link

This issue blocks ~~MDEV-19123~~ [ ~~MDEV-19123~~ ]

Ralf Gebhardt made changes - 2022-09-07 10:30

Priority

Major [ 3 ]

Critical [ 2 ]

Ralf Gebhardt made changes - 2022-10-27 12:34

Assignee

Alexander Barkov [ bar ]

Julien Fritsch made changes - 2022-10-27 12:54

Fix Version/s

10.12 [ 28320 ]

Stijn de Witt added a comment - 2022-10-27 14:05

If I read this correctly, you guys have fixed it!

That deserves a 'Hurray!!' and congratulations to all involved.
Thank you for making the world a better place!

Stijn de Witt added a comment - 2022-10-27 14:05 If I read this correctly, you guys have fixed it! That deserves a 'Hurray!!' and congratulations to all involved. Thank you for making the world a better place!

Sergei Golubchik made changes - 2022-11-01 13:50

Issue Type

Bug [ 1 ]

Task [ 3 ]

Julien Fritsch made changes - 2022-11-15 16:59

Link

This issue relates to MDEV-27490 [ MDEV-27490 ]

Sergei Golubchik made changes - 2022-11-19 22:02

Link

This issue blocks MDEV-30041 [ MDEV-30041 ]

Alexander Barkov made changes - 2022-12-06 12:25

Link

This issue relates to ~~MDEV-30164~~ [ ~~MDEV-30164~~ ]

Alexander Barkov made changes - 2023-01-27 06:45

Link

This issue is blocked by ~~MDEV-30164~~ [ ~~MDEV-30164~~ ]

Alexander Barkov made changes - 2023-01-27 06:45

Link

This issue relates to ~~MDEV-30164~~ [ ~~MDEV-30164~~ ]

Alexander Barkov made changes - 2023-02-03 07:27

Link

This issue is blocked by ~~MDEV-30556~~ [ ~~MDEV-30556~~ ]

Alexander Barkov made changes - 2023-02-06 06:11

Link

This issue relates to ~~MDEV-30577~~ [ ~~MDEV-30577~~ ]

Alexander Barkov made changes - 2023-02-16 07:36

Link

This issue is blocked by ~~MDEV-30661~~ [ ~~MDEV-30661~~ ]

Julien Fritsch made changes - 2023-05-04 09:16

Fix Version/s		11.2 [ 28603 ]
Fix Version/s	11.0 [ 28320 ]

Ralf Gebhardt made changes - 2023-07-25 19:47

Fix Version/s		11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]

Sergei Golubchik made changes - 2023-08-18 09:40

Summary

Change default collation to utf8mb4_0900_ai_ci

Change default collation to utf8mb4_1400_ai_ci

Alexander Barkov added a comment - 2023-08-22 08:50 - edited

StijnDeWitt, your original guess in 2022 was not correct.
Changing default collations for character sets is possible only starting from ~~MDEV-30164~~ (11.2.1).
So one can set the default collation for utf8mb4 to say utf8mb4_uca1400_ai_ci

This task is still open. It's about changing the hard coded default to utf8mb4_uca1400_ai_ci.

We need to make sure:

to do all possible optimization for utf8mb4_uca1400_ai_ci, to make sure it does not degrade much in performance comparing to utf8mb4_general_ci
we change defaults for all Unicode character sets at the same time
to add uca1400 collations for utf16be

Alexander Barkov added a comment - 2023-08-22 08:50 - edited StijnDeWitt , your original guess in 2022 was not correct. Changing default collations for character sets is possible only starting from MDEV-30164 (11.2.1). So one can set the default collation for utf8mb4 to say utf8mb4_uca1400_ai_ci This task is still open. It's about changing the hard coded default to utf8mb4_uca1400_ai_ci. We need to make sure: to do all possible optimization for utf8mb4_uca1400_ai_ci, to make sure it does not degrade much in performance comparing to utf8mb4_general_ci we change defaults for all Unicode character sets at the same time to add uca1400 collations for utf16be

Alexander Barkov added a comment - 2023-08-22 08:53 - edited

The easiest solution to implement this task would be to change the default value for GLOBAL.character_set_collations from empty to:

utf8mb3=utf8mb3_uca1400_ai_ci,ucs2=ucs2_uca1400_ai_ci,utf8mb4=utf8mb4_uca1400_ai_ci,utf16=utf16_uca1400_ai_ci,utf32=utf32_uca1400_ai_ci

If we go this way, we should make sure that the implementation of @@character_set_collations does not have any performance problems with a non-empty value.

Alexander Barkov added a comment - 2023-08-22 08:53 - edited The easiest solution to implement this task would be to change the default value for GLOBAL.character_set_collations from empty to: utf8mb3=utf8mb3_uca1400_ai_ci,ucs2=ucs2_uca1400_ai_ci,utf8mb4=utf8mb4_uca1400_ai_ci,utf16=utf16_uca1400_ai_ci,utf32=utf32_uca1400_ai_ci If we go this way, we should make sure that the implementation of @@character_set_collations does not have any performance problems with a non-empty value.

Sergei Golubchik made changes - 2023-09-17 17:53

Fix Version/s		11.4 [ 29301 ]
Fix Version/s	11.3 [ 28565 ]

Alexander Barkov made changes - 2023-10-02 04:26

Link

This issue relates to ~~MDEV-27266~~ [ ~~MDEV-27266~~ ]

Alexander Barkov made changes - 2023-10-02 04:26

Link

This issue relates to ~~MDEV-27265~~ [ ~~MDEV-27265~~ ]