[MDEV-11777] REGEXP_REPLACE converts utf8mb4 supplementary characters to '?' - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6
Fix Version/s: 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3
Component/s: Character Sets
Labels:
None

Description

The REGEXP_REPLACE function converts supplementary characters (4 byte utf8 encoding) into "?" when the charset is utf8mb4.

Due to JIRA not allowing supplementary utf8 characters in the issue description, I have used the CAST/UNHEX of the smiley character, but the actual value should be _utf8mb4'<emoji>'.

SELECT REGEXP_REPLACE(CAST(UNHEX('F09F9881') AS CHAR CHARACTER SET 'utf8mb4'), _utf8mb4'a', _utf8mb4'b') AS Text;

Expected output: "<emoji>"
Actual output: "?"

Attachments

Issue Links

relates to

MDEV-32904 smiley emoji (F09F9883) valid in utf8 but not utf8mb4

Closed

MDEV-34012 Regular expression queries containing 4-byte multicharacters result in an error.

Closed

Activity

Ascending order - Click to sort in descending order

Simone added a comment - 2018-10-12 14:11 - edited

This issue means you can't use REGEXP_REPLACE if you're working with utf8mb4 and emojis. It's disheartening that it still hasn't been resolved after more than a year.
Still an issue in 10.3.10

Simone added a comment - 2018-10-12 14:11 - edited This issue means you can't use REGEXP_REPLACE if you're working with utf8mb4 and emojis. It's disheartening that it still hasn't been resolved after more than a year. Still an issue in 10.3.10

sysdljr added a comment - 2019-04-04 08:59

we also encounter same problem, a table charset utf8mb4 and contain emoj , after execute regexp_replace , emoj changed to ? , later restore the backup for normal data.

sysdljr added a comment - 2019-04-04 08:59 we also encounter same problem, a table charset utf8mb4 and contain emoj , after execute regexp_replace , emoj changed to ? , later restore the backup for normal data.

Alice Sherepa added a comment - 2021-10-26 14:39

MariaDB [test]> SELECT REGEXP_REPLACE('�11�',1,2);

+----------------------------+

| REGEXP_REPLACE('?11?',1,2) |

+----------------------------+

| ?22?                       |

+----------------------------+

1 row in set (0.000 sec)

MariaDB [test]> SELECT REGEXP_REPLACE('�11�','�',2);

ERROR 1139 (42000): Regex error 'quantifier does not follow a repeatable item at offset 0'

MariaDB [test]> SELECT REGEXP_REPLACE('�11�',UNHEX('F09F98BA'),2);

+--------------------------------------------+

| REGEXP_REPLACE('?11?',UNHEX('F09F98BA'),2) |

+--------------------------------------------+

| �112                                        |

+--------------------------------------------+

1 row in set (0.000 sec)

Alice Sherepa added a comment - 2021-10-26 14:39 MariaDB [test]> SELECT REGEXP_REPLACE('�11�',1,2); +----------------------------+ | REGEXP_REPLACE('?11?',1,2) | +----------------------------+ | ?22? | +----------------------------+ 1 row in set (0.000 sec) MariaDB [test]> SELECT REGEXP_REPLACE('�11�','�',2); ERROR 1139 (42000): Regex error 'quantifier does not follow a repeatable item at offset 0' MariaDB [test]> SELECT REGEXP_REPLACE('�11�',UNHEX('F09F98BA'),2); +--------------------------------------------+ | REGEXP_REPLACE('?11?',UNHEX('F09F98BA'),2) | +--------------------------------------------+ | �112 | +--------------------------------------------+ 1 row in set (0.000 sec)

方觉 (Fang Jue) added a comment - 2023-12-14 06:56

I came across this bug and was surprised to find it here filed nearly 7 years ago. I did some research of MariaDB's source code. In sql/item_cmpfunc.h and item_cmpfunc.cc, m_library_charset is initialized to hard-coded my_charset_utf8mb3_general_ci by default. There is no way to configure other charset and only my_charset_bin can override it. In Item_func_regexp_replace::fix_length_and_dec function of sql/item_strfunc.cc, the regexp engine is initialized with collation.collation. Although I cannot understand this function, it seems promising to specify a binary charset. After some testing, I found out a workaround for this bug under some circumstances:

SELECT CONVERT(REGEXP_REPLACE(BINARY '😊', '', '') USING utf8mb4);

First convert the source string to BINARY. After calling REGEXP_REPLACE, convert the result back to the desired charset. This works pretty well if your regexp only involves plain old ASCII characters (My specific use case is to filter out formatting code from rich text messages, which only matches ASCII characters). It DOES NOT work as expected if characters > U+007F are used in the regexp or the regexp relies on character counts.

MariaDB [test]> SELECT CONVERT(REGEXP_REPLACE(BINARY '😊', '', '') USING utf8mb4);

+-----------------------------------------------------------+

| CONVERT(REGEXP_REPLACE(BINARY '?', '', '') USING utf8mb4) |

+-----------------------------------------------------------+

| 😊                                                         |

+-----------------------------------------------------------+

1 row in set (0.000 sec)

方觉 (Fang Jue) added a comment - 2023-12-14 06:56 I came across this bug and was surprised to find it here filed nearly 7 years ago. I did some research of MariaDB's source code. In sql/item_cmpfunc.h and item_cmpfunc.cc, m_library_charset is initialized to hard-coded my_charset_utf8mb3_general_ci by default. There is no way to configure other charset and only my_charset_bin can override it. In Item_func_regexp_replace::fix_length_and_dec function of sql/item_strfunc.cc, the regexp engine is initialized with collation.collation. Although I cannot understand this function, it seems promising to specify a binary charset. After some testing, I found out a workaround for this bug under some circumstances: SELECT CONVERT (REGEXP_REPLACE( BINARY '😊' , '' , '' ) USING utf8mb4); First convert the source string to BINARY. After calling REGEXP_REPLACE, convert the result back to the desired charset. This works pretty well if your regexp only involves plain old ASCII characters (My specific use case is to filter out formatting code from rich text messages, which only matches ASCII characters). It DOES NOT work as expected if characters > U+007F are used in the regexp or the regexp relies on character counts. MariaDB [test]> SELECT CONVERT(REGEXP_REPLACE(BINARY '😊', '', '') USING utf8mb4); +-----------------------------------------------------------+ | CONVERT(REGEXP_REPLACE(BINARY '?', '', '') USING utf8mb4) | +-----------------------------------------------------------+ | 😊 | +-----------------------------------------------------------+ 1 row in set (0.000 sec)

People

Assignee:: Sergei Golubchik

Reporter:: Pim Broekhof

Votes:: 3 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2017-01-11 10:59

Updated:: 2024-04-27 10:36

Resolved:: 2024-01-09 20:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server