[MDEV-8765] mysqldump silently corrupts 4-byte UTF-8 data Created: 2015-09-07  Updated: 2018-10-12  Resolved: 2018-10-12

Status: Closed
Project: MariaDB Server
Component/s: Character Sets, Scripts & Clients
Affects Version/s: 10.0.20
Fix Version/s: 10.3.11, 10.4.0

Type: Bug Priority: Critical
Reporter: Daniël van Eeden Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: beginner-friendly, contribution, foundation, upstream-fixed

Issue Links:
Relates
relates to MDEV-7128 Configuring charsets or collations as... Closed
relates to MDEV-8334 Rename utf8 to utf8mb3 Closed

 Description   

Bug for Oracle MySQL: https://bugs.mysql.com/bug.php?id=71746

But this also affect MariaDB 10.0:

[dvaneeden@dve-mac msb_ma10_0_20]$ ./my sqldump --skip-extended-insert unicodedata | grep DOLPHIN
INSERT INTO `ucd` VALUES ('1F42C','?','DOLPHIN','So','0','ON','','','','','N','','','','','');
[dvaneeden@dve-mac msb_ma10_0_20]$ ./my sqldump --skip-extended-insert --default-character-set=utf8mb4 unicodedata | grep DOLPHIN
INSERT INTO `ucd` VALUES ('1F42C','��','DOLPHIN','So','0','ON','','','','','N','','','','','');



 Comments   
Comment by Daniel Black [ 2018-01-01 ]

upstream fixed as per ebaff9fffc958030a57d8ea7f1f2d527cac1df64

mariadb needs to change include/my_global.h:#define MYSQL_UNIVERSAL_CLIENT_CHARSET to utf8mb4
mysqldump is the only place this is used.

Really trivial fix to prevent backup corruption, even if utf8mb4 isn't the default.

Comment by Sergey Vojtovich [ 2018-01-23 ]

Raised priority as there's pull request now.

Comment by Rutuja Surve (Inactive) [ 2018-04-03 ]

@Sergey could you add the link to the pull request here

Comment by Sergey Vojtovich [ 2018-04-03 ]

rutuja, there's a link on the right side under "Development" section.
https://github.com/MariaDB/server/pull/547

Comment by Teodor Mircea Ionita (Inactive) [ 2018-04-20 ]

Hi, I confirm this on both 5.5 and 10.3 using the UTF dataset available at:
https://github.com/dveeden/mysqlunicodedata

The fix in the associated PR#547 does fix the dump issue, instead of garbage '?', mysqldump does export the proper UTF symbols after patching, without the need for explicit --default-character-set.

As far as the actual fix in the PR, at least the mysqldump* tests need adjusting, however, I can't speak for the overall implications of switching MYSQL_UNIVERSAL_CLIENT_CHARSET to utfmb4 for the entire suite. Someone better suited should evaluate that. Thank you!

Comment by Alexander Barkov [ 2018-10-12 ]

This issue is critical for the JSON data type, which is an alias to longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin.

Generated at Thu Feb 08 07:29:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.