Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11777

REGEXP_REPLACE converts utf8mb4 supplementary characters to '?'

Details

    Description

      The REGEXP_REPLACE function converts supplementary characters (4 byte utf8 encoding) into "?" when the charset is utf8mb4.

      Due to JIRA not allowing supplementary utf8 characters in the issue description, I have used the CAST/UNHEX of the smiley character, but the actual value should be _utf8mb4'<emoji>'.

      SELECT REGEXP_REPLACE(CAST(UNHEX('F09F9881') AS CHAR CHARACTER SET 'utf8mb4'), _utf8mb4'a', _utf8mb4'b') AS Text;
      

      Expected output: "<emoji>"
      Actual output: "?"

      Attachments

        Issue Links

          Activity

            simo94 Simone added a comment - - edited

            This issue means you can't use REGEXP_REPLACE if you're working with utf8mb4 and emojis. It's disheartening that it still hasn't been resolved after more than a year.
            Still an issue in 10.3.10

            simo94 Simone added a comment - - edited This issue means you can't use REGEXP_REPLACE if you're working with utf8mb4 and emojis. It's disheartening that it still hasn't been resolved after more than a year. Still an issue in 10.3.10
            sysdljr sysdljr added a comment -

            we also encounter same problem, a table charset utf8mb4 and contain emoj , after execute regexp_replace , emoj changed to ? , later restore the backup for normal data.

            sysdljr sysdljr added a comment - we also encounter same problem, a table charset utf8mb4 and contain emoj , after execute regexp_replace , emoj changed to ? , later restore the backup for normal data.
            alice Alice Sherepa added a comment -

            MariaDB [test]> SELECT REGEXP_REPLACE('�11�',1,2);
            +----------------------------+
            | REGEXP_REPLACE('?11?',1,2) |
            +----------------------------+
            | ?22?                       |
            +----------------------------+
            1 row in set (0.000 sec)
             
            MariaDB [test]> SELECT REGEXP_REPLACE('�11�','�',2);
            ERROR 1139 (42000): Regex error 'quantifier does not follow a repeatable item at offset 0'
             
            MariaDB [test]> SELECT REGEXP_REPLACE('�11�',UNHEX('F09F98BA'),2);
            +--------------------------------------------+
            | REGEXP_REPLACE('?11?',UNHEX('F09F98BA'),2) |
            +--------------------------------------------+
            | �112                                        |
            +--------------------------------------------+
            1 row in set (0.000 sec)
            
            

            alice Alice Sherepa added a comment - MariaDB [test]> SELECT REGEXP_REPLACE('�11�',1,2); +----------------------------+ | REGEXP_REPLACE('?11?',1,2) | +----------------------------+ | ?22? | +----------------------------+ 1 row in set (0.000 sec)   MariaDB [test]> SELECT REGEXP_REPLACE('�11�','�',2); ERROR 1139 (42000): Regex error 'quantifier does not follow a repeatable item at offset 0'   MariaDB [test]> SELECT REGEXP_REPLACE('�11�',UNHEX('F09F98BA'),2); +--------------------------------------------+ | REGEXP_REPLACE('?11?',UNHEX('F09F98BA'),2) | +--------------------------------------------+ | �112 | +--------------------------------------------+ 1 row in set (0.000 sec)

            I came across this bug and was surprised to find it here filed nearly 7 years ago. I did some research of MariaDB's source code. In sql/item_cmpfunc.h and item_cmpfunc.cc, m_library_charset is initialized to hard-coded my_charset_utf8mb3_general_ci by default. There is no way to configure other charset and only my_charset_bin can override it. In Item_func_regexp_replace::fix_length_and_dec function of sql/item_strfunc.cc, the regexp engine is initialized with collation.collation. Although I cannot understand this function, it seems promising to specify a binary charset. After some testing, I found out a workaround for this bug under some circumstances:

            SELECT CONVERT(REGEXP_REPLACE(BINARY '😊', '', '') USING utf8mb4);
            

            First convert the source string to BINARY. After calling REGEXP_REPLACE, convert the result back to the desired charset. This works pretty well if your regexp only involves plain old ASCII characters (My specific use case is to filter out formatting code from rich text messages, which only matches ASCII characters). It DOES NOT work as expected if characters > U+007F are used in the regexp or the regexp relies on character counts.

            MariaDB [test]> SELECT CONVERT(REGEXP_REPLACE(BINARY '😊', '', '') USING utf8mb4);
            +-----------------------------------------------------------+
            | CONVERT(REGEXP_REPLACE(BINARY '?', '', '') USING utf8mb4) |
            +-----------------------------------------------------------+
            | 😊                                                         |
            +-----------------------------------------------------------+
            1 row in set (0.000 sec)
            

            fj 方觉 (Fang Jue) added a comment - I came across this bug and was surprised to find it here filed nearly 7 years ago. I did some research of MariaDB's source code. In sql/item_cmpfunc.h and item_cmpfunc.cc, m_library_charset is initialized to hard-coded my_charset_utf8mb3_general_ci by default. There is no way to configure other charset and only my_charset_bin can override it. In Item_func_regexp_replace::fix_length_and_dec function of sql/item_strfunc.cc, the regexp engine is initialized with collation.collation. Although I cannot understand this function, it seems promising to specify a binary charset. After some testing, I found out a workaround for this bug under some circumstances: SELECT CONVERT (REGEXP_REPLACE( BINARY '😊' , '' , '' ) USING utf8mb4); First convert the source string to BINARY. After calling REGEXP_REPLACE, convert the result back to the desired charset. This works pretty well if your regexp only involves plain old ASCII characters (My specific use case is to filter out formatting code from rich text messages, which only matches ASCII characters). It DOES NOT work as expected if characters > U+007F are used in the regexp or the regexp relies on character counts. MariaDB [test]> SELECT CONVERT(REGEXP_REPLACE(BINARY '😊', '', '') USING utf8mb4); +-----------------------------------------------------------+ | CONVERT(REGEXP_REPLACE(BINARY '?', '', '') USING utf8mb4) | +-----------------------------------------------------------+ | 😊 | +-----------------------------------------------------------+ 1 row in set (0.000 sec)

            People

              serg Sergei Golubchik
              pimbroekhof Pim Broekhof
              Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.