[MDEV-7128] Configuring charsets or collations as utf8 yields surprising result and leads to data loss Created: 2014-11-18  Updated: 2022-11-19  Resolved: 2022-11-19

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Martin Häcker Assignee: Alexander Barkov
Resolution: Duplicate Votes: 0
Labels: None

Attachments: PNG File screenshot-1.png    
Issue Links:
Blocks
is blocked by MDEV-8334 Rename utf8 to utf8mb3 Closed
Duplicate
duplicates MDEV-30041 don't set utf8_is_utf8mb3 by default ... Open
Relates
relates to MDEV-7649 wrong result when comparing utf8 colu... Closed
relates to MDEV-8036 Fix all collations to compare broken ... Closed
relates to MDEV-19123 Change default charset from latin1 to... Open
relates to MDEV-8765 mysqldump silently corrupts 4-byte UT... Closed

 Description   

Configuring databases and collations to be utf8 and utf8_unicode_ci respectively leads to the very surprising result that actually utf8 is NOT used, but instead a custom variant of utf8 which is incompatible with the full range of unicode characters.

Which of course leads to hard to debug problems as one doesn't even suspect that such a problem exists.

The problem is that the name utf-8 is reused to mean something that it is not according to the relevant standards rfc, unicode consortium

Instead one has to workaround the problem by a) actually knowing about and b) configuring something like this:

-- snip --
[mysqld]
# switch to 4 byte utf-8 as default
# See: https://mathiasbynens.be/notes/mysql-utf8mb4
init_connect  = "SET NAMES utf8mb4"
collation_server = utf8mb4_unicode_ci
character_set_server = utf8mb4
 
[client]
default-character-set = utf8mb4
 
[mysql]
default-character-set = utf8mb4
-- snap --

What do I expect as a user: If I configure mariadb to use utf8, it expect it to do so, and not some custom variant subset of unicode that automatically discards data my users entered as utf8 trusting it to be handled correctly.

Options to handle this:
a) Just actually use unicode if unicode is specified - as the utf8mb3 encoding is fully binary compatible with utf8mb4 (actual utf8) this would just work. People who need utf8mb3 can configure their system and likely know what it is anyway if the actually require it.
b) If the name of the encoding cannot be just redefined, a lengthy deprecation cycle can be started right now. Deprecate all the utf8 encodings which don't specify if they should be utf8mb3 or utf8mb4 and emit warnings for the next several years (I'd think something like 4 or whatever floats your boat) to fully specify the actual wanted encoding. Then after that make the utf8 name invalid for the same amount to get it out of configurations. Only then (now we're 8 years into the future) can the name utf8 be reenabled as a new encoding that actually is utf8 and doesn't surprise users.

Well a) and b) are certainly some extremes, so I suspect that you will choose something in between, but right now the situation is very bad as nobody who configures utf8 or utf8-ci actually suspects that they're not getting what they want, so just switching utf8 to mean utf8mb4 will - I suspect - do the right thing for a lot of people. So I would like to suggest choosing an option that is closer to a) than to b).



 Comments   
Comment by Robert Buchholz [ 2014-11-18 ]

I fully agree: The fact that MySQL had a broken utf8 implementation way back when and now we are able to use a proper utf8 store does not mean we have to stick to using an incomplete default with a wrong name forever.

Comment by Alexander Barkov [ 2014-11-18 ]

This is a reasonable request. But I'm afraid we cannot do anything with this unless
Oracle makes the suggested change first. We'll keep maintaining backward compatibility.
Consider posting this feature request on the MySQL site.

Comment by Alexander Barkov [ 2014-11-18 ]

The only thing that can be done now is to make the things
like mysqldump, SHOW CREATE TABLE, etc, print "utf8mb3" instead of "utf8".

Both MySQL and MariadDB understand character set name "utf8mb3" as an alias for "utf8",
and collation names "utf8mb3_xxx_ci" as an alias for "utf8_xxx_ci".

So this should be safe, and should work as a start of the lengthy deprecation cycle.

Comment by Alexander Barkov [ 2014-11-18 ]

Btw, which non-BMP characters do you need for your application?

Comment by Martin Häcker [ 2014-11-18 ]

non BMP: We noticed this first with Smileys like: �� and ������ - but I would like to say that the real problem is that there is an automatic mismatch between every system that interfaces to the database for persistence and trusts it to accept utf8 because it was configured to accept utf8.

Comment by Robert Buchholz [ 2014-11-18 ]

While changing utf8 to utf8mb3 in mysqldump is certainly correct and make the current situation more explicit because users could actually see their export/reimport is utf8m3. However, it will also mean that these users have changed to an explicit utf8m3 after an export/import cycle and will never see any deprecation warning if it came so far.

Comment by Martin Häcker [ 2014-12-09 ]

Re: bug report with oracle

I have to say that I'm not going to deal with oracle, thats simply something that I refuse to have the time in my life for. On the other hand, I assume that you guys (mariadb developers) have a very good connection to them.

So why do you not bring this up on the shared mailing list that you have somewhere?

Comment by Martin Häcker [ 2015-04-28 ]

I love it how inconsistent behavior like this so often also have security implications. See this for example: https://cedricvb.be/post/wordpress-stored-xss-vulnerability-4-1-2/

Since MySQL and MariaDB treat utf-8 as something meaning three byte encoding instead of four byte, many people have this configured wrong, but assume that it means the same thing as when they say encode as 'utf-8' in the fronted.

And that leads to subtle security bugs. Here's hope that this raises the importance of this issue.

Comment by Alexander Barkov [ 2015-05-05 ]

The problem mentioned in https://cedricvb.be/post/wordpress-stored-xss-vulnerability-4-1-2/
is fixed in 5.5.43. See MDEV-7649 for details.

There will be more fixes in 10.x under terms of MDEV-8036.

Comment by Alexander Barkov [ 2015-06-18 ]

See also MDEV-8334 Rename utf8 to utf8mb3

Comment by Martin Häcker [ 2015-06-18 ]

To stay in the style of this bug: ��

Comment by Robert Buchholz [ 2018-02-14 ]

Seems like these characters broke somewhere in Jira over the years. I remember seeing actual emojis in this comment way back when.

Comment by Martin Häcker [ 2018-02-15 ]

Are you guys by any chance running on a MariaDB backend?

:-P Sorry, couldn't resist...

Comment by Sergei Golubchik [ 2018-08-15 ]

Yes, of course: https://mariadb.org/eating-dog-food-running-jira-mariadb/

Comment by Vladislav Vaintroub [ 2021-11-10 ]

I think we have to stick with the wrong at least as long as MySQL sticks to the wrong name. As such, I believe it is a good sounding on paper, yet unreasonable request. The wrong decision of using UTF8 for something it is not was done decades ago, so by now everyone knows that utf8 on MySQL, and by extension MariaDB, is 3 byte long. If someone disbelieves it, it is quite easy to google, and look up in the Wikipedia.

The series of renames will do more damage. Even the first step, renaming utf8 to utf8mb3 had the implication of breaking a popular connector, thus I think we best leave it at that, and patiently wait until upstream, that is until MySQL changes the meaning of utf8. If we don't wait, we'll break more 3rd party software than we can ever imagine. Especially given that our idea of testing does not extend beyond testing current MariaDB against our own current Connector/C.

Comment by Martin Häcker [ 2021-11-10 ]

@Vladislav Vaintroub: I would like to strongly disagree. Almost nobody I know who works with mysql knows this issue - and even all my coworkers (who regularly work with mysql every day) forget the issue and their understanding of it soon after I retell them the problem.

This issue is a big hunking footgun that has bitten us (and many others) repeatedly over the years. That is the real reason why this should be changed. The fact that you find so many google results for this is just more indication that this has bitten other developers even more often.

In light of this, I think that the changes the developers proposed are quite sane and allow a quite safe migration.

Comment by Vladislav Vaintroub [ 2021-11-10 ]

dwt, how do you know it is safe? It is caused quite a few compatibility problems already, when it did not even start.
But since you're not using Connector/NET, you do not know anything about that. I don't think we can make a small group of people happy, at the expense of larger community whose workflow is broken. We can't unilaterally change things that break a third party connectors. And when we change things that have a slightest possibility to affect them, we need to test extensively, which we don't. MariaDB does not exist in vacuum, there is a large ecosystem of drivers and applications, some of which are developed by others, including our competitors. We can't do everything by ourselves. Such is life.

Do you think the situation would improve, if we declare that utf8mb4 is called utf8. It won't. That same connector, that is now broken, will "work", but won't be able to read utf8 data outside of BMP, that it stored previously. Also, you would need to explain to your colleagues, that everything that they read on the web, is irrelevant, for MariaDB, utf8 is utf8, but for MySQL it is not, I don't think this clears the confusion.

Anyway, when choosing compatibility over beauty, compatibility wins. "In Oracle Database, the UTF8 character set uses CESU-8 encoding, and is deprecated. The AL32UTF8 character set uses standards-compliant UTF-8 encoding, and is preferred." (from Wikipedia)

So, I propose to sit and watch what MySQL is doing. If they change the meaning of utf8, we'll have to do the same. Otherwise, we should stick to what we have now. My opinion So -1 on this idea, from me.

Comment by Martin Häcker [ 2021-11-10 ]

@wlad Well I hear your concerns, and extensive testing is always good. But cleaning up old mistakes at some point is what has to be done if you want to live in a brighter future at some point.

The trick is to do it in a way that doesn't immediately break everyones stuff. The recommendation that entering utf8 will actually read and render it as utf8mb3 will at lest allow people to find out that something funky is going on. Adding a switch `--utf8-actually-means-utf8` with a default of NO is a good next step to allow people to step into a sane future where only one sysadmin has to understand this footgun. Flipping this default can then happen as far in the future as needed, and can generate ample warnings in the meantime.

Those are sane steps that allow for a sane transition - and I believe they allow for ample opportunity for compatibility with MySQL.

I have to admin that I wonder a bit though why you are looking at MySQL as upstream. Clearly for quite some time now and - as seen by developments like this - MariaDB is the upstream MySQL will have to catch up to.

Comment by Vladislav Vaintroub [ 2021-11-10 ]

I'm not looking up at MySQL as upstream. MariaDB is more than mysqld (or mariadbd), and C client, and more than all our 4-5 client drivers altogether. It is maybe 20 or something language drivers, not written by us, and then ORM frameworks , not written by us, and then killer-applications like Wordpress, and JIRA, not written by us.

What do we actually test? our Jira maybe, last time I've heard. Atlassian does not test their producs with us, and does not support MariaDB officially, neither the DB, nor the JDBC driver, even after literally hundreds of people asked. I presume, most of people who write language drivers, do not test it with MariaDB, and do "MySQL first". They would not to add another database to their test matrix. ORM, like Hibernate does not test with MariaDB, last time I've heard. Oracle MySQL developers, who produce popular drivers, JDBC, .NET , Python, ODBC, and some are also PHP committers, of course do "MySQL first", and are not likely to test with us ever.

It is therefore us who should be responsible for not breaking stuff, that works with MySQL. We're not a binary drop-in replacement anymore, and this is fine. But if we screw drivers, or ORMs, or killer-applications, I think it is as bad as one can imagine. MariaDB does not have its own, isolated ecosystem. Not even Oracle MySQL has, even if they have a larger headcount. Thus have to be extra nice to 3rd party developers, who ensure that the server can be accessed from any language, and other developers, who integrate us into ORMs, and yet other people who write "killer apps".

Comment by Martin Häcker [ 2021-11-11 ]

@wlad I think your heart is in the right spot. Maybe the only thing I would just like to ask you to also consider, is that the users that are bitten by their misunderstanding of what utf8 really means are also all the users of the database and treating them extra nice is just as much a concern as the developers of the hugely important drivers, frameworks and apps that build on top of databases.

That being said: Rest assured that in my social circle development has ben MariaDB first for many years now. So that exists too.

Regarding the open source projects that don't run their test harnesses with MariaDB: I am not a Hibernate user, but having persistent Bug-Reports that they should add a test harness with MariaDB seems like a good idea and like something that is missing? (Sorry, as I am not a Hibernate Developer I may have searched in the wrong place for such bug reports, but I couldn't find anything). Do you have a tracking issue here that coordinates adding test harnesses to open source projects and directs developers who would like to help that effort to those places / issues?

Comment by Vladislav Vaintroub [ 2021-11-11 ]

There is nobody who coordinates pushing OSS projects into testing MariaDB direction, as there is nobody within Oracle MySQL, either. I was wrong about Hibernate btw, after checking they do test now with MariaDB
https://github.com/hibernate/hibernate-orm/runs/4141189548?check_suite_focus=true. But it would be naïve to think, that everyone in OSS, including Oracle/MySQL shall now start testing our unreleased stuff.

The burden of this testing is clearly on us. And we should get a lot more sensitive about any visible changes (i.e whatever we send to clients over the network). We, that is, the server developers, and the server QA. The server developers should, I believe be guided by Robustness principle , which tells to be conservative in what you send, and the QA should ensure that this principle holds.

This happened repeatedly here in the past, that new releases broke protocol. An example would be MDEV-24121, where we broke simultaneously PHP, Oracle/MySQL Python, Oracle/MySQL JDBC, and our very own ODBC.
Sometimes, we do admit mistakes, fix stuff(even when we think, there is a bug in the all connectors above), and issue re-releases, like it happened with MDEV-24121.

Sometimes we don't, like it happened now with 10.6. We now sit there, waiting for mercy of .NET developers working for Oracle/MySQL, to fix their software to anticipate MariaDB server's change in behavior. We now show "utf8mb3" in SHOW COLLATION, which they are not prepared to handle, and we declare that would be their bug, which I'd think does not make sense. For us, it was very important to have utf8mb3 in the output of SHOW collation, worthy breaking the above mentioned robustness principle, a great achievement, exemplified by "blocker" status of MDEV-8334.

There is MDEV-15148, which I created back in 2018, hoping for our QA to correctly prioritize. This did no happen, unfortunately, to this day, on whatever reasons.

And unless we test, we should stop doing stuff like MDEV-8334. I do not know anyone, who would need that change. It is cosmetics, which hurt, and it was many hours of essentially wasted developer's time, which could be put into doing something more productive than breaking unknown number of dependencies.

Comment by Sergei Golubchik [ 2022-11-19 ]

This (the original problem from the bug description) will be solved in MDEV-30041

Generated at Thu Feb 08 07:17:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.