[MDEV-11777] REGEXP_REPLACE converts utf8mb4 supplementary characters to '?' Created: 2017-01-11  Updated: 2024-01-09  Resolved: 2024-01-09

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3

Type: Bug Priority: Major
Reporter: Pim Broekhof Assignee: Sergei Golubchik
Resolution: Fixed Votes: 3
Labels: None

Issue Links:
Relates
relates to MDEV-32904 smiley emoji (F09F9883) valid in utf8... Closed

 Description   

The REGEXP_REPLACE function converts supplementary characters (4 byte utf8 encoding) into "?" when the charset is utf8mb4.

Due to JIRA not allowing supplementary utf8 characters in the issue description, I have used the CAST/UNHEX of the smiley character, but the actual value should be _utf8mb4'<emoji>'.

SELECT REGEXP_REPLACE(CAST(UNHEX('F09F9881') AS CHAR CHARACTER SET 'utf8mb4'), _utf8mb4'a', _utf8mb4'b') AS Text;

Expected output: "<emoji>"
Actual output: "?"



 Comments   
Comment by Simone [ 2018-10-12 ]

This issue means you can't use REGEXP_REPLACE if you're working with utf8mb4 and emojis. It's disheartening that it still hasn't been resolved after more than a year.
Still an issue in 10.3.10

Comment by sysdljr [ 2019-04-04 ]

we also encounter same problem, a table charset utf8mb4 and contain emoj , after execute regexp_replace , emoj changed to ? , later restore the backup for normal data.

Comment by Alice Sherepa [ 2021-10-26 ]

MariaDB [test]> SELECT REGEXP_REPLACE('�11�',1,2);
+----------------------------+
| REGEXP_REPLACE('?11?',1,2) |
+----------------------------+
| ?22?                       |
+----------------------------+
1 row in set (0.000 sec)
 
MariaDB [test]> SELECT REGEXP_REPLACE('�11�','�',2);
ERROR 1139 (42000): Regex error 'quantifier does not follow a repeatable item at offset 0'
 
MariaDB [test]> SELECT REGEXP_REPLACE('�11�',UNHEX('F09F98BA'),2);
+--------------------------------------------+
| REGEXP_REPLACE('?11?',UNHEX('F09F98BA'),2) |
+--------------------------------------------+
| �112                                        |
+--------------------------------------------+
1 row in set (0.000 sec)

Comment by 方觉 (Fang Jue) [ 2023-12-14 ]

I came across this bug and was surprised to find it here filed nearly 7 years ago. I did some research of MariaDB's source code. In sql/item_cmpfunc.h and item_cmpfunc.cc, m_library_charset is initialized to hard-coded my_charset_utf8mb3_general_ci by default. There is no way to configure other charset and only my_charset_bin can override it. In Item_func_regexp_replace::fix_length_and_dec function of sql/item_strfunc.cc, the regexp engine is initialized with collation.collation. Although I cannot understand this function, it seems promising to specify a binary charset. After some testing, I found out a workaround for this bug under some circumstances:

SELECT CONVERT(REGEXP_REPLACE(BINARY '��', '', '') USING utf8mb4);

First convert the source string to BINARY. After calling REGEXP_REPLACE, convert the result back to the desired charset. This works pretty well if your regexp only involves plain old ASCII characters (My specific use case is to filter out formatting code from rich text messages, which only matches ASCII characters). It DOES NOT work as expected if characters > U+007F are used in the regexp or the regexp relies on character counts.

MariaDB [test]> SELECT CONVERT(REGEXP_REPLACE(BINARY '��', '', '') USING utf8mb4);
+-----------------------------------------------------------+
| CONVERT(REGEXP_REPLACE(BINARY '?', '', '') USING utf8mb4) |
+-----------------------------------------------------------+
| ��                                                         |
+-----------------------------------------------------------+
1 row in set (0.000 sec)

Generated at Thu Feb 08 07:52:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.