Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28078

Garbage on multiple equal ENUMs with tricky character sets

Details

    Description

      I create a table with two similar ENUM columns, both using CHARACTER SET utf32:

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (
        c1 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a',
        c2 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a' 
      );
      SHOW CREATE TABLE t1;
      

      +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Table | Create Table                                                                                                                                                                |
      +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | t1    | CREATE TABLE `t1` (
        `c1` enum('??','??') CHARACTER SET utf32 DEFAULT '??',
        `c2` enum('??','??') CHARACTER SET utf32 DEFAULT '??'
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
      +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      

      Notice, the SHOW CREATE returns garbage instead of ENUM values.

      The problem happens in this piece of the code in table.cc:

          if (interval_nr && charset->mbminlen > 1)
          {
            /* Unescape UCS2 intervals from HEX notation */
            TYPELIB *interval= share->intervals + interval_nr - 1;
            unhex_type2(interval);
      

      As the two TYPELIBs are equal, only one copy of this TYPELIB is stored in the FRM file. But unhex_type() is called two times.

      Note, TYPELIBs for tricky character sets like utf32 are stored in HEX notation. So the same problem is repeatable if I use a latin1 ENUM column whose values are equal to HEX representations of the utf32 ENUM column:

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (
        c1 ENUM ('00000061','00000062') DEFAULT '00000061' COLLATE latin1_bin,
        c2 ENUM ('a','b') DEFAULT 'a' COLLATE utf32_general_ci
      );
      SHOW CREATE TABLE t1;
      

      +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Table | Create Table                                                                                                                                                                                                |
      +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | t1    | CREATE TABLE `t1` (
        `c1` enum('\0\0\0a','\0\0\0b') CHARACTER SET latin1 COLLATE latin1_bin DEFAULT '\0\0\0a',
        `c2` enum('a','b') CHARACTER SET utf32 DEFAULT 'a'
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
      +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      

      As in the previous example, only one copy of the TYPELIB is stored in the frm file (because they are binary equal).
      But the unhex_type2() is called for this TYPELIB to unescape the utf32 column value. But the latin1 columns points to the same TYPELIB.

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov created issue -
            bar Alexander Barkov made changes -
            Field Original Value New Value
            bar Alexander Barkov made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            bar Alexander Barkov made changes -
            Description
            I create a table with two similar ENUM columns, both using CHARACTER SET utf32:
            {code:sql}
            SET sql_mode='';
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              ENABLED ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a',
              HISTORY ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a'
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `ENABLED` enum('??','??') CHARACTER SET utf32 DEFAULT '??',
              `HISTORY` enum('??','??') CHARACTER SET utf32 DEFAULT '??'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            Notice, the SHOW CREATE returns garbage instead of ENUM values.

            The problem happens in this piece of the code in table.cc:
            {code:cpp}
                if (interval_nr && charset->mbminlen > 1)
                {
                  /* Unescape UCS2 intervals from HEX notation */
                  TYPELIB *interval= share->intervals + interval_nr - 1;
                  unhex_type2(interval);
            {code}
            As the two TYPELIBs are equal, only one copy of this TYPELIB is stored in the FRM file. But unhex_type() is called two times.


            Note, TYPELIBs for tricky character sets like utf32 are stored in HEX notation. So the same problem is repeatable if I use a latin1 ENUM column whose values are equal to HEX representations of the utf32 ENUM column:

            {code:sql}
            SET sql_mode='';
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              ENABLED ENUM ('00000061','00000062') DEFAULT '00000061' COLLATE latin1_bin,
              HISTORY ENUM ('a','b') DEFAULT 'a' COLLATE utf32_general_ci
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `ENABLED` enum('\0\0\0a','\0\0\0b') CHARACTER SET latin1 COLLATE latin1_bin DEFAULT '\0\0\0a',
              `HISTORY` enum('a','b') CHARACTER SET utf32 DEFAULT 'a'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}
            As in the previous example, only one copy of the TYPELIB is stored in the frm file (because they are binary equal).
            But the unhex_type2() is called for this TYPELIB to unescape the utf32 column value. But the latin1 columns points to the same TYPELIB.
            I create a table with two similar ENUM columns, both using CHARACTER SET utf32:
            {code:sql}
            SET sql_mode='';
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              c1 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a',
              c2 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a'
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `c1` enum('??','??') CHARACTER SET utf32 DEFAULT '??',
              `c2` enum('??','??') CHARACTER SET utf32 DEFAULT '??'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the SHOW CREATE returns garbage instead of ENUM values.

            The problem happens in this piece of the code in table.cc:
            {code:cpp}
                if (interval_nr && charset->mbminlen > 1)
                {
                  /* Unescape UCS2 intervals from HEX notation */
                  TYPELIB *interval= share->intervals + interval_nr - 1;
                  unhex_type2(interval);
            {code}
            As the two TYPELIBs are equal, only one copy of this TYPELIB is stored in the FRM file. But unhex_type() is called two times.


            Note, TYPELIBs for tricky character sets like utf32 are stored in HEX notation. So the same problem is repeatable if I use a latin1 ENUM column whose values are equal to HEX representations of the utf32 ENUM column:

            {code:sql}
            SET sql_mode='';
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              c1 ENUM ('00000061','00000062') DEFAULT '00000061' COLLATE latin1_bin,
              c2 ENUM ('a','b') DEFAULT 'a' COLLATE utf32_general_ci
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `c1` enum('\0\0\0a','\0\0\0b') CHARACTER SET latin1 COLLATE latin1_bin DEFAULT '\0\0\0a',
              `c2` enum('a','b') CHARACTER SET utf32 DEFAULT 'a'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            As in the previous example, only one copy of the TYPELIB is stored in the frm file (because they are binary equal).
            But the unhex_type2() is called for this TYPELIB to unescape the utf32 column value. But the latin1 columns points to the same TYPELIB.
            bar Alexander Barkov made changes -
            Description I create a table with two similar ENUM columns, both using CHARACTER SET utf32:
            {code:sql}
            SET sql_mode='';
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              c1 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a',
              c2 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a'
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `c1` enum('??','??') CHARACTER SET utf32 DEFAULT '??',
              `c2` enum('??','??') CHARACTER SET utf32 DEFAULT '??'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the SHOW CREATE returns garbage instead of ENUM values.

            The problem happens in this piece of the code in table.cc:
            {code:cpp}
                if (interval_nr && charset->mbminlen > 1)
                {
                  /* Unescape UCS2 intervals from HEX notation */
                  TYPELIB *interval= share->intervals + interval_nr - 1;
                  unhex_type2(interval);
            {code}
            As the two TYPELIBs are equal, only one copy of this TYPELIB is stored in the FRM file. But unhex_type() is called two times.


            Note, TYPELIBs for tricky character sets like utf32 are stored in HEX notation. So the same problem is repeatable if I use a latin1 ENUM column whose values are equal to HEX representations of the utf32 ENUM column:

            {code:sql}
            SET sql_mode='';
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              c1 ENUM ('00000061','00000062') DEFAULT '00000061' COLLATE latin1_bin,
              c2 ENUM ('a','b') DEFAULT 'a' COLLATE utf32_general_ci
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `c1` enum('\0\0\0a','\0\0\0b') CHARACTER SET latin1 COLLATE latin1_bin DEFAULT '\0\0\0a',
              `c2` enum('a','b') CHARACTER SET utf32 DEFAULT 'a'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            As in the previous example, only one copy of the TYPELIB is stored in the frm file (because they are binary equal).
            But the unhex_type2() is called for this TYPELIB to unescape the utf32 column value. But the latin1 columns points to the same TYPELIB.
            I create a table with two similar ENUM columns, both using CHARACTER SET utf32:
            {code:sql}
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              c1 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a',
              c2 ENUM ('a','b') CHARACTER SET utf32 DEFAULT 'a'
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `c1` enum('??','??') CHARACTER SET utf32 DEFAULT '??',
              `c2` enum('??','??') CHARACTER SET utf32 DEFAULT '??'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            Notice, the SHOW CREATE returns garbage instead of ENUM values.

            The problem happens in this piece of the code in table.cc:
            {code:cpp}
                if (interval_nr && charset->mbminlen > 1)
                {
                  /* Unescape UCS2 intervals from HEX notation */
                  TYPELIB *interval= share->intervals + interval_nr - 1;
                  unhex_type2(interval);
            {code}
            As the two TYPELIBs are equal, only one copy of this TYPELIB is stored in the FRM file. But unhex_type() is called two times.


            Note, TYPELIBs for tricky character sets like utf32 are stored in HEX notation. So the same problem is repeatable if I use a latin1 ENUM column whose values are equal to HEX representations of the utf32 ENUM column:

            {code:sql}
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (
              c1 ENUM ('00000061','00000062') DEFAULT '00000061' COLLATE latin1_bin,
              c2 ENUM ('a','b') DEFAULT 'a' COLLATE utf32_general_ci
            );
            SHOW CREATE TABLE t1;
            {code}
            {noformat}
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1 | CREATE TABLE `t1` (
              `c1` enum('\0\0\0a','\0\0\0b') CHARACTER SET latin1 COLLATE latin1_bin DEFAULT '\0\0\0a',
              `c2` enum('a','b') CHARACTER SET utf32 DEFAULT 'a'
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
            +-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            {noformat}

            As in the previous example, only one copy of the TYPELIB is stored in the frm file (because they are binary equal).
            But the unhex_type2() is called for this TYPELIB to unescape the utf32 column value. But the latin1 columns points to the same TYPELIB.
            bar Alexander Barkov made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Alexey Botchkov [ holyfoot ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            holyfoot Alexey Botchkov made changes -
            Assignee Alexey Botchkov [ holyfoot ] Alexander Barkov [ bar ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            bar Alexander Barkov made changes -
            issue.field.resolutiondate 2022-03-18 05:11:08.0 2022-03-18 05:11:08.22
            bar Alexander Barkov made changes -
            Fix Version/s 10.2.44 [ 27514 ]
            Fix Version/s 10.3.35 [ 27512 ]
            Fix Version/s 10.4.25 [ 27510 ]
            Fix Version/s 10.5.16 [ 27508 ]
            Fix Version/s 10.6.8 [ 27506 ]
            Fix Version/s 10.7.4 [ 27504 ]
            Fix Version/s 10.8.3 [ 27502 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.