[MDEV-7128] Configuring charsets or collations as utf8 yields surprising result and leads to data loss - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Affects Version/s: 10.0(EOL)
Fix Version/s: N/A
Component/s: Character Sets
Labels:
None

Description

Configuring databases and collations to be utf8 and utf8_unicode_ci respectively leads to the very surprising result that actually utf8 is NOT used, but instead a custom variant of utf8 which is incompatible with the full range of unicode characters.

Which of course leads to hard to debug problems as one doesn't even suspect that such a problem exists.

The problem is that the name utf-8 is reused to mean something that it is not according to the relevant standards rfc, unicode consortium

Instead one has to workaround the problem by a) actually knowing about and b) configuring something like this:

-- snip --

[mysqld]

# switch to 4 byte utf-8 as default

# See: https://mathiasbynens.be/notes/mysql-utf8mb4

init_connect  = "SET NAMES utf8mb4"

collation_server = utf8mb4_unicode_ci

character_set_server = utf8mb4

[client]

default-character-set = utf8mb4

[mysql]

default-character-set = utf8mb4

-- snap --

What do I expect as a user: If I configure mariadb to use utf8, it expect it to do so, and not some custom variant subset of unicode that automatically discards data my users entered as utf8 trusting it to be handled correctly.

Options to handle this:
a) Just actually use unicode if unicode is specified - as the utf8mb3 encoding is fully binary compatible with utf8mb4 (actual utf8) this would just work. People who need utf8mb3 can configure their system and likely know what it is anyway if the actually require it.
b) If the name of the encoding cannot be just redefined, a lengthy deprecation cycle can be started right now. Deprecate all the utf8 encodings which don't specify if they should be utf8mb3 or utf8mb4 and emit warnings for the next several years (I'd think something like 4 or whatever floats your boat) to fully specify the actual wanted encoding. Then after that make the utf8 name invalid for the same amount to get it out of configurations. Only then (now we're 8 years into the future) can the name utf8 be reenabled as a new encoding that actually is utf8 and doesn't surprise users.

Well a) and b) are certainly some extremes, so I suspect that you will choose something in between, but right now the situation is very bad as nobody who configures utf8 or utf8-ci actually suspects that they're not getting what they want, so just switching utf8 to mean utf8mb4 will - I suspect - do the right thing for a lot of people. So I would like to suggest choosing an option that is closer to a) than to b).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
2018-02-14 13:55
24 kB
Robert Buchholz

Issue Links

duplicates

MDEV-30041 don't set utf8_is_utf8mb3 by default in the old-mode

Open

is blocked by

MDEV-8334 Rename utf8 to utf8mb3

Closed

relates to

MDEV-7649 wrong result when comparing utf8 column with an invalid literal

Closed

MDEV-8036 Fix all collations to compare broken bytes as "greater than any non-broken character"

Closed

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

MDEV-8765 mysqldump silently corrupts 4-byte UTF-8 data

Closed

(1 relates to)

Activity

People

Assignee:: Alexander Barkov

Reporter:: Martin Häcker

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2014-11-18 13:57

Updated:: 2022-11-19 22:03

Resolved:: 2022-11-19 22:03

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.