[MDEV-32904] smiley emoji (F09F9883) valid in utf8 but not utf8mb4 Created: 2023-11-28  Updated: 2023-12-02  Resolved: 2023-11-29

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.2.44, 10.4.32, 10.6.16
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Daniel Black Assignee: Alexander Barkov
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-27490 Allow full utf8mb4 for identifiers Stalled
relates to MDEV-11777 REGEXP_REPLACE converts utf8mb4 suppl... Closed

 Description   

Our smiley emoj goes to question mark in utf8mb4 but ok in mb3.

MariaDB [(none)]> select hex('��'),@@character_set_client;
+----------+------------------------+
| hex('?') | @@character_set_client |
+----------+------------------------+
| 3F       | utf8mb4                |
+----------+------------------------+
1 row in set (0.000 sec)
 
MariaDB [(none)]> set character_set_client='utf8';
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> select hex('��'),@@character_set_client;
+-------------------------+------------------------+
| hex('\xF0\x9F\x98\x83') | @@character_set_client |
+-------------------------+------------------------+
| F09F9883                | utf8                   |
+-------------------------+------------------------+



 Comments   
Comment by Alexander Barkov [ 2023-11-29 ]

Can you please paste send the output of:

SHOW VARIABLES LiKE 'character_set%';

I think there's something wrong with @@character_set_connection.

Comment by Alexander Barkov [ 2023-11-29 ]

It's better to use SET NAMES utf8mb4 instead of setting @@character_set_{client|connection|results} directly and separately from each other.

MariaDB [test]> SET NAMES utf8mb4; SELECT hex('��'),@@character_set_client,@@character_set_connection;
Query OK, 0 rows affected (0.000 sec)
 
+----------+------------------------+----------------------------+
| hex('?') | @@character_set_client | @@character_set_connection |
+----------+------------------------+----------------------------+
| F09F9883 | utf8mb4                | utf8mb4                    |
+----------+------------------------+----------------------------+

Looks like this issue should be closed as Not a Bug.

Comment by Daniel Black [ 2023-11-29 ]

MariaDB [(none)]> \s
--------------
client/mariadb  Ver 15.1 Distrib 10.4.33-MariaDB, for Linux (x86_64) using  EditLine wrapper
 
Connection id:		8
Current database:	
Current user:		dan@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server:			MariaDB
Server version:		10.4.33-MariaDB Source distribution
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	utf8
Conn.  characterset:	utf8
UNIX socket:		/tmp/build-mariadb-server-10.4.sock
Uptime:			6 sec
 
Threads: 6  Questions: 6  Slow queries: 0  Opens: 17  Flush tables: 1  Open tables: 10  Queries per second avg: 1.000
--------------
 
MariaDB [(none)]> show variables like 'character_set%';
+--------------------------+------------------------------+
| Variable_name            | Value                        |
+--------------------------+------------------------------+
| character_set_client     | utf8                         |
| character_set_connection | utf8                         |
| character_set_database   | latin1                       |
| character_set_filesystem | binary                       |
| character_set_results    | utf8                         |
| character_set_server     | latin1                       |
| character_set_system     | utf8                         |
| character_sets_dir       | /app/mariadb/share/charsets/ |
+--------------------------+------------------------------+
8 rows in set (0.002 sec)
 
MariaDB [(none)]> select hex('��');
+-------------------------+
| hex('\xF0\x9F\x98\x83') |
+-------------------------+
| F09F9883                |
+-------------------------+
1 row in set (0.000 sec)
 
MariaDB [(none)]> set @@character_set_connection=utf8mb4;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> select hex('��');
+-------------------------+
| hex('\xF0\x9F\x98\x83') |
+-------------------------+
| 3F3F3F3F                |
+-------------------------+
1 row in set (0.000 sec)
 
MariaDB [(none)]> set names utf8mb4;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> select hex('��');
+----------+
| hex('?') |
+----------+
| F09F9883 |
+----------+
1 row in set (0.000 sec)

Comment by Alexander Barkov [ 2023-11-29 ]

I can see no bugs. Works with utf8mb4 as expected.

Closed as Not a Bug.

Comment by Daniel Black [ 2023-11-30 ]

So not even a warning generated for the truncation?

The title under set names utf8mb4 is "hex('?')".

Under character_set_results=utf8mb4 the hex form of column is displayed like below:

MariaDB [(none)]> set @@character_set_results=utf8mb4;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> SELECT VARIABLE_NAME, SESSION_VALUE    FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE    VARIABLE_NAME LIKE 'character_set_c%' OR    VARIABLE_NAME LIKE 'character_set_re%' OR    VARIABLE_NAME LIKE 'collation_c%';
+--------------------------+-----------------+
| VARIABLE_NAME            | SESSION_VALUE   |
+--------------------------+-----------------+
| CHARACTER_SET_CONNECTION | utf8            |
| CHARACTER_SET_RESULTS    | utf8mb4         |
| CHARACTER_SET_CLIENT     | utf8            |
| COLLATION_CONNECTION     | utf8_general_ci |
+--------------------------+-----------------+
4 rows in set (0.002 sec)
 
MariaDB [(none)]> select hex('��');
+-------------------------+
| hex('\xF0\x9F\x98\x83') |
+-------------------------+
| F09F9883                |
+-------------------------+
1 row in set (0.000 sec)

Comment by Alexander Barkov [ 2023-12-02 ]

The warning is not needed. The truncation does not happen in the data. Have a look into my previous comment:

+----------+------------------------+----------------------------+
| hex('?') | @@character_set_client | @@character_set_connection |
+----------+------------------------+----------------------------+
| F09F9883 | utf8mb4                | utf8mb4                    |
+----------+------------------------+----------------------------+

It correctly returns the 4-byte utf8 character 0xF09F9883

But the truncation does happen in the column title as identifiers do not support supplementaty characters yet. This will be fixed by MDEV-27490.

Generated at Thu Feb 08 10:34:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.