[MDEV-7128] Configuring charsets or collations as utf8 yields surprising result and leads to data loss - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Affects Version/s: 10.0(EOL)
Fix Version/s: N/A
Component/s: Character Sets
Labels:
None

Description

Configuring databases and collations to be utf8 and utf8_unicode_ci respectively leads to the very surprising result that actually utf8 is NOT used, but instead a custom variant of utf8 which is incompatible with the full range of unicode characters.

Which of course leads to hard to debug problems as one doesn't even suspect that such a problem exists.

The problem is that the name utf-8 is reused to mean something that it is not according to the relevant standards rfc, unicode consortium

Instead one has to workaround the problem by a) actually knowing about and b) configuring something like this:

-- snip --

[mysqld]

# switch to 4 byte utf-8 as default

# See: https://mathiasbynens.be/notes/mysql-utf8mb4

init_connect  = "SET NAMES utf8mb4"

collation_server = utf8mb4_unicode_ci

character_set_server = utf8mb4

[client]

default-character-set = utf8mb4

[mysql]

default-character-set = utf8mb4

-- snap --

What do I expect as a user: If I configure mariadb to use utf8, it expect it to do so, and not some custom variant subset of unicode that automatically discards data my users entered as utf8 trusting it to be handled correctly.

Options to handle this:
a) Just actually use unicode if unicode is specified - as the utf8mb3 encoding is fully binary compatible with utf8mb4 (actual utf8) this would just work. People who need utf8mb3 can configure their system and likely know what it is anyway if the actually require it.
b) If the name of the encoding cannot be just redefined, a lengthy deprecation cycle can be started right now. Deprecate all the utf8 encodings which don't specify if they should be utf8mb3 or utf8mb4 and emit warnings for the next several years (I'd think something like 4 or whatever floats your boat) to fully specify the actual wanted encoding. Then after that make the utf8 name invalid for the same amount to get it out of configurations. Only then (now we're 8 years into the future) can the name utf8 be reenabled as a new encoding that actually is utf8 and doesn't surprise users.

Well a) and b) are certainly some extremes, so I suspect that you will choose something in between, but right now the situation is very bad as nobody who configures utf8 or utf8-ci actually suspects that they're not getting what they want, so just switching utf8 to mean utf8mb4 will - I suspect - do the right thing for a lot of people. So I would like to suggest choosing an option that is closer to a) than to b).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
24 kB
2018-02-14 13:55

Issue Links

duplicates

MDEV-30041 don't set utf8_is_utf8mb3 by default in the old-mode

Open

is blocked by

MDEV-8334 Rename utf8 to utf8mb3

Closed

relates to

MDEV-7649 wrong result when comparing utf8 column with an invalid literal

Closed

MDEV-8036 Fix all collations to compare broken bytes as "greater than any non-broken character"

Closed

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

MDEV-8765 mysqldump silently corrupts 4-byte UTF-8 data

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

View 17 older comments

Martin Häcker added a comment - 2021-11-10 13:42

@wlad Well I hear your concerns, and extensive testing is always good. But cleaning up old mistakes at some point is what has to be done if you want to live in a brighter future at some point.

The trick is to do it in a way that doesn't immediately break everyones stuff. The recommendation that entering utf8 will actually read and render it as utf8mb3 will at lest allow people to find out that something funky is going on. Adding a switch `--utf8-actually-means-utf8` with a default of NO is a good next step to allow people to step into a sane future where only one sysadmin has to understand this footgun. Flipping this default can then happen as far in the future as needed, and can generate ample warnings in the meantime.

Those are sane steps that allow for a sane transition - and I believe they allow for ample opportunity for compatibility with MySQL.

I have to admin that I wonder a bit though why you are looking at MySQL as upstream. Clearly for quite some time now and - as seen by developments like this - MariaDB is the upstream MySQL will have to catch up to.

Martin Häcker added a comment - 2021-11-10 13:42 @wlad Well I hear your concerns, and extensive testing is always good. But cleaning up old mistakes at some point is what has to be done if you want to live in a brighter future at some point. The trick is to do it in a way that doesn't immediately break everyones stuff. The recommendation that entering utf8 will actually read and render it as utf8mb3 will at lest allow people to find out that something funky is going on. Adding a switch `--utf8-actually-means-utf8` with a default of NO is a good next step to allow people to step into a sane future where only one sysadmin has to understand this footgun. Flipping this default can then happen as far in the future as needed, and can generate ample warnings in the meantime. Those are sane steps that allow for a sane transition - and I believe they allow for ample opportunity for compatibility with MySQL. I have to admin that I wonder a bit though why you are looking at MySQL as upstream. Clearly for quite some time now and - as seen by developments like this - MariaDB is the upstream MySQL will have to catch up to.

Vladislav Vaintroub added a comment - 2021-11-10 16:05 - edited

I'm not looking up at MySQL as upstream. MariaDB is more than mysqld (or mariadbd), and C client, and more than all our 4-5 client drivers altogether. It is maybe 20 or something language drivers, not written by us, and then ORM frameworks , not written by us, and then killer-applications like Wordpress, and JIRA, not written by us.

What do we actually test? our Jira maybe, last time I've heard. Atlassian does not test their producs with us, and does not support MariaDB officially, neither the DB, nor the JDBC driver, even after literally hundreds of people asked. I presume, most of people who write language drivers, do not test it with MariaDB, and do "MySQL first". They would not to add another database to their test matrix. ORM, like Hibernate does not test with MariaDB, last time I've heard. Oracle MySQL developers, who produce popular drivers, JDBC, .NET , Python, ODBC, and some are also PHP committers, of course do "MySQL first", and are not likely to test with us ever.

It is therefore us who should be responsible for not breaking stuff, that works with MySQL. We're not a binary drop-in replacement anymore, and this is fine. But if we screw drivers, or ORMs, or killer-applications, I think it is as bad as one can imagine. MariaDB does not have its own, isolated ecosystem. Not even Oracle MySQL has, even if they have a larger headcount. Thus have to be extra nice to 3rd party developers, who ensure that the server can be accessed from any language, and other developers, who integrate us into ORMs, and yet other people who write "killer apps".

Vladislav Vaintroub added a comment - 2021-11-10 16:05 - edited I'm not looking up at MySQL as upstream. MariaDB is more than mysqld (or mariadbd), and C client, and more than all our 4-5 client drivers altogether. It is maybe 20 or something language drivers, not written by us, and then ORM frameworks , not written by us, and then killer-applications like Wordpress, and JIRA, not written by us. What do we actually test? our Jira maybe, last time I've heard. Atlassian does not test their producs with us, and does not support MariaDB officially, neither the DB, nor the JDBC driver, even after literally hundreds of people asked. I presume, most of people who write language drivers, do not test it with MariaDB, and do "MySQL first". They would not to add another database to their test matrix. ORM, like Hibernate does not test with MariaDB, last time I've heard. Oracle MySQL developers, who produce popular drivers, JDBC, .NET , Python, ODBC, and some are also PHP committers, of course do "MySQL first", and are not likely to test with us ever. It is therefore us who should be responsible for not breaking stuff, that works with MySQL. We're not a binary drop-in replacement anymore, and this is fine. But if we screw drivers, or ORMs, or killer-applications, I think it is as bad as one can imagine. MariaDB does not have its own, isolated ecosystem. Not even Oracle MySQL has, even if they have a larger headcount. Thus have to be extra nice to 3rd party developers, who ensure that the server can be accessed from any language, and other developers, who integrate us into ORMs, and yet other people who write "killer apps".

Martin Häcker added a comment - 2021-11-11 07:53

@wlad I think your heart is in the right spot. Maybe the only thing I would just like to ask you to also consider, is that the users that are bitten by their misunderstanding of what utf8 really means are also all the users of the database and treating them extra nice is just as much a concern as the developers of the hugely important drivers, frameworks and apps that build on top of databases.

That being said: Rest assured that in my social circle development has ben MariaDB first for many years now. So that exists too.

Regarding the open source projects that don't run their test harnesses with MariaDB: I am not a Hibernate user, but having persistent Bug-Reports that they should add a test harness with MariaDB seems like a good idea and like something that is missing? (Sorry, as I am not a Hibernate Developer I may have searched in the wrong place for such bug reports, but I couldn't find anything). Do you have a tracking issue here that coordinates adding test harnesses to open source projects and directs developers who would like to help that effort to those places / issues?

Martin Häcker added a comment - 2021-11-11 07:53 @wlad I think your heart is in the right spot. Maybe the only thing I would just like to ask you to also consider, is that the users that are bitten by their misunderstanding of what utf8 really means are also all the users of the database and treating them extra nice is just as much a concern as the developers of the hugely important drivers, frameworks and apps that build on top of databases. That being said: Rest assured that in my social circle development has ben MariaDB first for many years now. So that exists too. Regarding the open source projects that don't run their test harnesses with MariaDB: I am not a Hibernate user, but having persistent Bug-Reports that they should add a test harness with MariaDB seems like a good idea and like something that is missing? (Sorry, as I am not a Hibernate Developer I may have searched in the wrong place for such bug reports, but I couldn't find anything). Do you have a tracking issue here that coordinates adding test harnesses to open source projects and directs developers who would like to help that effort to those places / issues?

Vladislav Vaintroub added a comment - 2021-11-11 10:11 - edited

There is nobody who coordinates pushing OSS projects into testing MariaDB direction, as there is nobody within Oracle MySQL, either. I was wrong about Hibernate btw, after checking they do test now with MariaDB
https://github.com/hibernate/hibernate-orm/runs/4141189548?check_suite_focus=true. But it would be naïve to think, that everyone in OSS, including Oracle/MySQL shall now start testing our unreleased stuff.

The burden of this testing is clearly on us. And we should get a lot more sensitive about any visible changes (i.e whatever we send to clients over the network). We, that is, the server developers, and the server QA. The server developers should, I believe be guided by Robustness principle , which tells to be conservative in what you send, and the QA should ensure that this principle holds.

This happened repeatedly here in the past, that new releases broke protocol. An example would be ~~MDEV-24121~~, where we broke simultaneously PHP, Oracle/MySQL Python, Oracle/MySQL JDBC, and our very own ODBC.
Sometimes, we do admit mistakes, fix stuff(even when we think, there is a bug in the all connectors above), and issue re-releases, like it happened with ~~MDEV-24121~~.

Sometimes we don't, like it happened now with 10.6. We now sit there, waiting for mercy of .NET developers working for Oracle/MySQL, to fix their software to anticipate MariaDB server's change in behavior. We now show "utf8mb3" in SHOW COLLATION, which they are not prepared to handle, and we declare that would be their bug, which I'd think does not make sense. For us, it was very important to have utf8mb3 in the output of SHOW collation, worthy breaking the above mentioned robustness principle, a great achievement, exemplified by "blocker" status of ~~MDEV-8334~~.

There is MDEV-15148, which I created back in 2018, hoping for our QA to correctly prioritize. This did no happen, unfortunately, to this day, on whatever reasons.

And unless we test, we should stop doing stuff like ~~MDEV-8334~~. I do not know anyone, who would need that change. It is cosmetics, which hurt, and it was many hours of essentially wasted developer's time, which could be put into doing something more productive than breaking unknown number of dependencies.

Vladislav Vaintroub added a comment - 2021-11-11 10:11 - edited There is nobody who coordinates pushing OSS projects into testing MariaDB direction, as there is nobody within Oracle MySQL, either. I was wrong about Hibernate btw, after checking they do test now with MariaDB https://github.com/hibernate/hibernate-orm/runs/4141189548?check_suite_focus=true . But it would be naïve to think, that everyone in OSS, including Oracle/MySQL shall now start testing our unreleased stuff. The burden of this testing is clearly on us. And we should get a lot more sensitive about any visible changes (i.e whatever we send to clients over the network). We, that is, the server developers, and the server QA. The server developers should, I believe be guided by Robustness principle , which tells to be conservative in what you send, and the QA should ensure that this principle holds. This happened repeatedly here in the past, that new releases broke protocol. An example would be MDEV-24121 , where we broke simultaneously PHP, Oracle/MySQL Python, Oracle/MySQL JDBC, and our very own ODBC. Sometimes, we do admit mistakes, fix stuff(even when we think, there is a bug in the all connectors above), and issue re-releases, like it happened with MDEV-24121 . Sometimes we don't, like it happened now with 10.6. We now sit there, waiting for mercy of .NET developers working for Oracle/MySQL, to fix their software to anticipate MariaDB server's change in behavior. We now show "utf8mb3" in SHOW COLLATION, which they are not prepared to handle, and we declare that would be their bug, which I'd think does not make sense. For us, it was very important to have utf8mb3 in the output of SHOW collation, worthy breaking the above mentioned robustness principle, a great achievement, exemplified by "blocker" status of MDEV-8334 . There is MDEV-15148 , which I created back in 2018, hoping for our QA to correctly prioritize. This did no happen, unfortunately, to this day, on whatever reasons. And unless we test, we should stop doing stuff like MDEV-8334 . I do not know anyone, who would need that change. It is cosmetics, which hurt, and it was many hours of essentially wasted developer's time, which could be put into doing something more productive than breaking unknown number of dependencies.

Sergei Golubchik added a comment - 2022-11-19 21:56 - edited

This (the original problem from the bug description) will be solved in MDEV-30041

Sergei Golubchik added a comment - 2022-11-19 21:56 - edited This (the original problem from the bug description) will be solved in MDEV-30041

MariaDB Server

Configuring charsets or collations as utf8 yields surprising result and leads to data loss

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration