[MDEV-11777] REGEXP_REPLACE converts utf8mb4 supplementary characters to '?' Created: 2017-01-11 Updated: 2024-01-09 Resolved: 2024-01-09 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Affects Version/s: | 10.1, 10.2, 10.3, 10.4, 10.5, 10.6 |
| Fix Version/s: | 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Pim Broekhof | Assignee: | Sergei Golubchik |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
The REGEXP_REPLACE function converts supplementary characters (4 byte utf8 encoding) into "?" when the charset is utf8mb4. Due to JIRA not allowing supplementary utf8 characters in the issue description, I have used the CAST/UNHEX of the smiley character, but the actual value should be _utf8mb4'<emoji>'.
Expected output: "<emoji>" |
| Comments |
| Comment by Simone [ 2018-10-12 ] | |||||||||||||||||||
|
This issue means you can't use REGEXP_REPLACE if you're working with utf8mb4 and emojis. It's disheartening that it still hasn't been resolved after more than a year. | |||||||||||||||||||
| Comment by sysdljr [ 2019-04-04 ] | |||||||||||||||||||
|
we also encounter same problem, a table charset utf8mb4 and contain emoj , after execute regexp_replace , emoj changed to ? , later restore the backup for normal data. | |||||||||||||||||||
| Comment by Alice Sherepa [ 2021-10-26 ] | |||||||||||||||||||
|
| |||||||||||||||||||
| Comment by 方觉 (Fang Jue) [ 2023-12-14 ] | |||||||||||||||||||
|
I came across this bug and was surprised to find it here filed nearly 7 years ago. I did some research of MariaDB's source code. In sql/item_cmpfunc.h and item_cmpfunc.cc, m_library_charset is initialized to hard-coded my_charset_utf8mb3_general_ci by default. There is no way to configure other charset and only my_charset_bin can override it. In Item_func_regexp_replace::fix_length_and_dec function of sql/item_strfunc.cc, the regexp engine is initialized with collation.collation. Although I cannot understand this function, it seems promising to specify a binary charset. After some testing, I found out a workaround for this bug under some circumstances:
First convert the source string to BINARY. After calling REGEXP_REPLACE, convert the result back to the desired charset. This works pretty well if your regexp only involves plain old ASCII characters (My specific use case is to filter out formatting code from rich text messages, which only matches ASCII characters). It DOES NOT work as expected if characters > U+007F are used in the regexp or the regexp relies on character counts.
|