[MDEV-8949] COLUMN_CREATE unicode name breakage Created: 2015-10-15  Updated: 2017-11-14  Resolved: 2017-11-14

Status: Closed
Project: MariaDB Server
Component/s: Character Sets, Dynamic Columns
Affects Version/s: 10.0.21
Fix Version/s: 10.0.34, 10.1.29, 10.2.11, 10.3.3

Type: Bug Priority: Major
Reporter: Adam Johnson Assignee: Oleksandr Byelkin
Resolution: Fixed Votes: 0
Labels: None
Environment:

Seen on both OS X 10.9 and Ubuntu 14.04


Sprint: 10.2.11

 Description   

Possibly related to MDEV-461.

When calling column_create with names set to utf8, one can successfully
encode an emoji that takes 4 bytes in utf-8:

adamj@localhost [1]> set names utf8;
adamj@localhost [2]> select column_json(column_create('��', 1));
+---------------------------------------+
| column_json(column_create('��', 1))     |
+---------------------------------------+
| {"��":1}                                |
+---------------------------------------+

However if the connection is set to utf8mb4 this actually fails,
counter-intuitively:

adamj@localhost [3]> set names utf8mb4;
adamj@localhost [4]> select column_json(column_create('��', 1));
+------------------------------------+
| column_json(column_create('?', 1)) |
+------------------------------------+
| {"?":1}                            |
+------------------------------------+
 
adamj@localhost [5]> select column_list(column_create('��', 1));
+------------------------------------+
| column_list(column_create('?', 1)) |
+------------------------------------+
| `?`                                |
+------------------------------------+

Other unicode characters work fine though:

adamj@localhost [6]> select column_list(column_create('❤', 1));
+--------------------------------------+
| column_list(column_create('❤', 1))   |
+--------------------------------------+
| `❤`                                  |
+--------------------------------------+



 Comments   
Comment by Elena Stepanova [ 2015-10-20 ]

Reproducible as described.

There is also a simpler example which does not involve dynamic columns but might have the same root cause (or not?):

MySQL [test]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
 
MySQL [test]> select '��';
+------+
| ��     |
+------+
| ��     |
+------+
1 row in set (0.00 sec)
 
MySQL [test]> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
 
MySQL [test]> select '��';
+------+
| ?    |
+------+
| ��     |
+------+
1 row in set (0.00 sec)

Note the difference in the column name.
Reproducible on MySQL as well.

Assigning to bar who should be able to shed some light on it.

Comment by Adam Johnson [ 2015-10-21 ]

Seems related - emojis becoming ? on utf8mb4 with mysqldump: MDEV-8765

Comment by Alexander Barkov [ 2017-10-07 ]

The problem happens because the column_json related code uses in Item_func_dyncol_json and in mysys/ma_dyncol.c used &my_charset_utf8_general_ci, which supports Unicode characters in the BMP range U+0000..U+FFFF. Emojii is outside this range. Perhaps, it should be fixed to use &my_charset_utf8mb4_general_ci instead. But I'm not sure.
Reassigning to Sanja.

Comment by Alexander Barkov [ 2017-10-07 ]

It seems Jira does not support non-BMP characters.
This is a modified version of the script demonstrating the same problem:

SET NAMES utf8mb4;
SELECT COLUMN_JSON(COLUMN_CREATE(_utf8mb4 0xF09F988E, 1));

Comment by Oleksandr Byelkin [ 2017-11-13 ]

revision-id: e22c33e3f014ffc4d7c08d6830f710c19f1aff90 (mariadb-10.0.33-17-ge22c33e3f01)
parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7
author: Oleksandr Byelkin
committer: Oleksandr Byelkin
timestamp: 2017-11-13 16:30:02 +0100
message:

MDEV-8949: COLUMN_CREATE unicode name breakage

Use utf-mb4 if it is possible.

Comment by Oleksandr Byelkin [ 2017-11-13 ]

github tree: bb-10.0-MDEV-8949

Comment by Alexander Barkov [ 2017-11-14 ]

There is one more character set related problem with COLUMN_LIST() and COLUMN_GET():

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 AS SELECT
  COLUMN_LIST(COLUMN_CREATE('a',1)),
  COLUMN_JSON(COLUMN_CREATE('b',1));
SHOW CREATE TABLE t1;

+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                        |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE `t1` (
  `COLUMN_LIST(COLUMN_CREATE('a',1))` longblob DEFAULT NULL,
  `COLUMN_JSON(COLUMN_CREATE('b',1))` longblob DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Notice, these functions create longblob columns.
The expected result would be longtext CHARACTER SET utf8mb4.

Comment by Oleksandr Byelkin [ 2017-11-14 ]

revision-id: 2913f615f050f356f7be178e5d91650b86b33e4e (mariadb-10.0.33-17-g2913f615f05)
parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7
author: Oleksandr Byelkin
committer: Oleksandr Byelkin
timestamp: 2017-11-14 10:49:46 +0100
message:

MDEV-8949: COLUMN_CREATE unicode name breakage

Use utf-mb4 if it is possible.

Comment by Alexander Barkov [ 2017-11-14 ]

This patch is OK to push:

revision-id: 2913f615f050f356f7be178e5d91650b86b33e4e (mariadb-10.0.33-17-g2913f615f05)
parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7
author: Oleksandr Byelkin
committer: Oleksandr Byelkin
timestamp: 2017-11-14 10:49:46 +0100
message:

Generated at Thu Feb 08 07:31:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.