Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8949

COLUMN_CREATE unicode name breakage

Details

    • 10.2.11

    Description

      Possibly related to MDEV-461.

      When calling column_create with names set to utf8, one can successfully
      encode an emoji that takes 4 bytes in utf-8:

      adamj@localhost [1]> set names utf8;
      adamj@localhost [2]> select column_json(column_create('��', 1));
      +---------------------------------------+
      | column_json(column_create('��', 1))     |
      +---------------------------------------+
      | {"��":1}                                |
      +---------------------------------------+

      However if the connection is set to utf8mb4 this actually fails,
      counter-intuitively:

      adamj@localhost [3]> set names utf8mb4;
      adamj@localhost [4]> select column_json(column_create('��', 1));
      +------------------------------------+
      | column_json(column_create('?', 1)) |
      +------------------------------------+
      | {"?":1}                            |
      +------------------------------------+
       
      adamj@localhost [5]> select column_list(column_create('��', 1));
      +------------------------------------+
      | column_list(column_create('?', 1)) |
      +------------------------------------+
      | `?`                                |
      +------------------------------------+

      Other unicode characters work fine though:

      adamj@localhost [6]> select column_list(column_create('❤', 1));
      +--------------------------------------+
      | column_list(column_create('❤', 1))   |
      +--------------------------------------+
      | `❤`                                  |
      +--------------------------------------+

      Attachments

        Activity

          adamchainz Adam Johnson created issue -
          adamchainz Adam Johnson made changes -
          Field Original Value New Value
          Description Possibly related to MDEV-461.

          When calling {{column_create}} with names set to {{utf8}}, one can successfully
          encode an emoji that takes 4 bytes in utf-8:

          {{
          adamj@localhost [1]> set names utf8;
          adamj@localhost [2]> select column_json(column_create('��', 1));
          +---------------------------------------+
          | column_json(column_create('��', 1)) |
          +---------------------------------------+
          | {"��":1} |
          +---------------------------------------+
          }}

          However if the connection is set to {{utf8mb4}} this actually fails,
          counter-intuitively:

          {{
          adamj@localhost [3]> set names utf8mb4;
          adamj@localhost [4]> select column_json(column_create('��', 1));
          +------------------------------------+
          | column_json(column_create('?', 1)) |
          +------------------------------------+
          | {"?":1} |
          +------------------------------------+

          adamj@localhost [5]> select column_list(column_create('��', 1));
          +------------------------------------+
          | column_list(column_create('?', 1)) |
          +------------------------------------+
          | `?` |
          +------------------------------------+
          }}


          Other unicode characters work fine still though:

          {{
          adamj@localhost [6]> select column_list(column_create('❤', 1));
          +--------------------------------------+
          | column_list(column_create('❤', 1)) |
          +--------------------------------------+
          | `❤` |
          +--------------------------------------+
          }}
          Possibly related to MDEV-461.

          When calling {{column_create}} with names set to {{utf8}}, one can successfully
          encode an emoji that takes 4 bytes in utf-8:

          {code}
          adamj@localhost [1]> set names utf8;
          adamj@localhost [2]> select column_json(column_create('��', 1));
          +---------------------------------------+
          | column_json(column_create('��', 1)) |
          +---------------------------------------+
          | {"��":1} |
          +---------------------------------------+
          {code}

          However if the connection is set to {{utf8mb4}} this actually fails,
          counter-intuitively:

          {code}
          adamj@localhost [3]> set names utf8mb4;
          adamj@localhost [4]> select column_json(column_create('��', 1));
          +------------------------------------+
          | column_json(column_create('?', 1)) |
          +------------------------------------+
          | {"?":1} |
          +------------------------------------+

          adamj@localhost [5]> select column_list(column_create('��', 1));
          +------------------------------------+
          | column_list(column_create('?', 1)) |
          +------------------------------------+
          | `?` |
          +------------------------------------+
          {code}


          Other unicode characters work fine though:

          {code}
          adamj@localhost [6]> select column_list(column_create('❤', 1));
          +--------------------------------------+
          | column_list(column_create('❤', 1)) |
          +--------------------------------------+
          | `❤` |
          +--------------------------------------+
          {code}
          elenst Elena Stepanova made changes -
          Status Open [ 1 ] Confirmed [ 10101 ]
          elenst Elena Stepanova added a comment - - edited

          Reproducible as described.

          There is also a simpler example which does not involve dynamic columns but might have the same root cause (or not?):

          MySQL [test]> set names utf8;
          Query OK, 0 rows affected (0.00 sec)
           
          MySQL [test]> select '��';
          +------+
          | ��     |
          +------+
          | ��     |
          +------+
          1 row in set (0.00 sec)
           
          MySQL [test]> set names utf8mb4;
          Query OK, 0 rows affected (0.00 sec)
           
          MySQL [test]> select '��';
          +------+
          | ?    |
          +------+
          | ��     |
          +------+
          1 row in set (0.00 sec)

          Note the difference in the column name.
          Reproducible on MySQL as well.

          Assigning to bar who should be able to shed some light on it.

          elenst Elena Stepanova added a comment - - edited Reproducible as described. There is also a simpler example which does not involve dynamic columns but might have the same root cause (or not?): MySQL [test]> set names utf8; Query OK, 0 rows affected (0.00 sec)   MySQL [test]> select '��'; +------+ | �� | +------+ | �� | +------+ 1 row in set (0.00 sec)   MySQL [test]> set names utf8mb4; Query OK, 0 rows affected (0.00 sec)   MySQL [test]> select '��'; +------+ | ? | +------+ | �� | +------+ 1 row in set (0.00 sec) Note the difference in the column name. Reproducible on MySQL as well. Assigning to bar who should be able to shed some light on it.
          adamchainz Adam Johnson added a comment -

          Seems related - emojis becoming ? on utf8mb4 with mysqldump: MDEV-8765

          adamchainz Adam Johnson added a comment - Seems related - emojis becoming ? on utf8mb4 with mysqldump: MDEV-8765
          serg Sergei Golubchik made changes -
          Assignee Alexander Barkov [ bar ]
          serg Sergei Golubchik made changes -
          Fix Version/s 10.0 [ 16000 ]
          Fix Version/s 10.1 [ 16100 ]
          Fix Version/s 10.2 [ 14601 ]
          bar Alexander Barkov made changes -
          Assignee Alexander Barkov [ bar ]
          bar Alexander Barkov made changes -
          Assignee Oleksandr Byelkin [ sanja ]

          The problem happens because the column_json related code uses in Item_func_dyncol_json and in mysys/ma_dyncol.c used &my_charset_utf8_general_ci, which supports Unicode characters in the BMP range U+0000..U+FFFF. Emojii is outside this range. Perhaps, it should be fixed to use &my_charset_utf8mb4_general_ci instead. But I'm not sure.
          Reassigning to Sanja.

          bar Alexander Barkov added a comment - The problem happens because the column_json related code uses in Item_func_dyncol_json and in mysys/ma_dyncol.c used &my_charset_utf8_general_ci, which supports Unicode characters in the BMP range U+0000..U+FFFF. Emojii is outside this range. Perhaps, it should be fixed to use &my_charset_utf8mb4_general_ci instead. But I'm not sure. Reassigning to Sanja.
          bar Alexander Barkov added a comment - - edited

          It seems Jira does not support non-BMP characters.
          This is a modified version of the script demonstrating the same problem:

          SET NAMES utf8mb4;
          SELECT COLUMN_JSON(COLUMN_CREATE(_utf8mb4 0xF09F988E, 1));
          

          bar Alexander Barkov added a comment - - edited It seems Jira does not support non-BMP characters. This is a modified version of the script demonstrating the same problem: SET NAMES utf8mb4; SELECT COLUMN_JSON(COLUMN_CREATE(_utf8mb4 0xF09F988E, 1));
          sanja Oleksandr Byelkin made changes -
          Status Confirmed [ 10101 ] In Progress [ 3 ]
          sanja Oleksandr Byelkin made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]
          sanja Oleksandr Byelkin made changes -
          Sprint 10.2.11 [ 203 ]
          sanja Oleksandr Byelkin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]

          revision-id: e22c33e3f014ffc4d7c08d6830f710c19f1aff90 (mariadb-10.0.33-17-ge22c33e3f01)
          parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7
          author: Oleksandr Byelkin
          committer: Oleksandr Byelkin
          timestamp: 2017-11-13 16:30:02 +0100
          message:

          MDEV-8949: COLUMN_CREATE unicode name breakage

          Use utf-mb4 if it is possible.

          sanja Oleksandr Byelkin added a comment - revision-id: e22c33e3f014ffc4d7c08d6830f710c19f1aff90 (mariadb-10.0.33-17-ge22c33e3f01) parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7 author: Oleksandr Byelkin committer: Oleksandr Byelkin timestamp: 2017-11-13 16:30:02 +0100 message: MDEV-8949 : COLUMN_CREATE unicode name breakage Use utf-mb4 if it is possible. —
          sanja Oleksandr Byelkin made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]

          github tree: bb-10.0-MDEV-8949

          sanja Oleksandr Byelkin added a comment - github tree: bb-10.0- MDEV-8949
          sanja Oleksandr Byelkin made changes -
          Assignee Oleksandr Byelkin [ sanja ] Alexander Barkov [ bar ]
          Status Stalled [ 10000 ] In Review [ 10002 ]

          There is one more character set related problem with COLUMN_LIST() and COLUMN_GET():

          DROP TABLE IF EXISTS t1;
          CREATE TABLE t1 AS SELECT
            COLUMN_LIST(COLUMN_CREATE('a',1)),
            COLUMN_JSON(COLUMN_CREATE('b',1));
          SHOW CREATE TABLE t1;
          

          +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
          | Table | Create Table                                                                                                                                                                        |
          +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
          | t1    | CREATE TABLE `t1` (
            `COLUMN_LIST(COLUMN_CREATE('a',1))` longblob DEFAULT NULL,
            `COLUMN_JSON(COLUMN_CREATE('b',1))` longblob DEFAULT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
          +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
          

          Notice, these functions create longblob columns.
          The expected result would be longtext CHARACTER SET utf8mb4.

          bar Alexander Barkov added a comment - There is one more character set related problem with COLUMN_LIST() and COLUMN_GET() : DROP TABLE IF EXISTS t1; CREATE TABLE t1 AS SELECT COLUMN_LIST(COLUMN_CREATE( 'a' ,1)), COLUMN_JSON(COLUMN_CREATE( 'b' ,1)); SHOW CREATE TABLE t1; +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `COLUMN_LIST(COLUMN_CREATE('a',1))` longblob DEFAULT NULL, `COLUMN_JSON(COLUMN_CREATE('b',1))` longblob DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Notice, these functions create longblob columns. The expected result would be longtext CHARACTER SET utf8mb4 .

          revision-id: 2913f615f050f356f7be178e5d91650b86b33e4e (mariadb-10.0.33-17-g2913f615f05)
          parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7
          author: Oleksandr Byelkin
          committer: Oleksandr Byelkin
          timestamp: 2017-11-14 10:49:46 +0100
          message:

          MDEV-8949: COLUMN_CREATE unicode name breakage

          Use utf-mb4 if it is possible.

          sanja Oleksandr Byelkin added a comment - revision-id: 2913f615f050f356f7be178e5d91650b86b33e4e (mariadb-10.0.33-17-g2913f615f05) parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7 author: Oleksandr Byelkin committer: Oleksandr Byelkin timestamp: 2017-11-14 10:49:46 +0100 message: MDEV-8949 : COLUMN_CREATE unicode name breakage Use utf-mb4 if it is possible. —

          This patch is OK to push:

          revision-id: 2913f615f050f356f7be178e5d91650b86b33e4e (mariadb-10.0.33-17-g2913f615f05)
          parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7
          author: Oleksandr Byelkin
          committer: Oleksandr Byelkin
          timestamp: 2017-11-14 10:49:46 +0100
          message:

          bar Alexander Barkov added a comment - This patch is OK to push: revision-id: 2913f615f050f356f7be178e5d91650b86b33e4e (mariadb-10.0.33-17-g2913f615f05) parent(s): c0e10f375ad619d825ef7c21232cf5946bdf5be7 author: Oleksandr Byelkin committer: Oleksandr Byelkin timestamp: 2017-11-14 10:49:46 +0100 message:
          bar Alexander Barkov made changes -
          Assignee Alexander Barkov [ bar ] Oleksandr Byelkin [ sanja ]
          Status In Review [ 10002 ] Stalled [ 10000 ]
          sanja Oleksandr Byelkin made changes -
          Fix Version/s 10.0.34 [ 22613 ]
          Fix Version/s 10.1.29 [ 22636 ]
          Fix Version/s 10.2.11 [ 22634 ]
          Fix Version/s 10.3.3 [ 22644 ]
          Fix Version/s 10.2 [ 14601 ]
          Fix Version/s 10.0 [ 16000 ]
          Fix Version/s 10.1 [ 16100 ]
          Resolution Fixed [ 1 ]
          Status Stalled [ 10000 ] Closed [ 6 ]
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 72124 ] MariaDB v4 [ 149714 ]

          People

            sanja Oleksandr Byelkin
            adamchainz Adam Johnson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.