Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-2000

varchar specified sizing is not in characters

    XMLWordPrintable

Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 1.1.5, 1.1.6
    • 5.5.1
    • DDLProc
    • None

    Description

      ColumnStore's string length handling is byte based and not character based. To store a 20 double-byte characters, you would need a 40-byte column. Also, functions such as length() handle multiple-byte character strings incorrectly. This violates SQL standards and needs to be corrected.

      For example: Inserting value "caractère accentué" in a VARCHAR(18) column, value is truncated.

      This causes problems because in MariaDB a VARCHAR(18) UTF8 column holds 18 characters, so it is an issue doing conversions and imports.

      Preamble

      The following sections describe

      • CHAR/VARCHAR/CLOB behaviour in the SQL Standard
      • How CHAR/VARCHAR/TEXT work in MariaDB with other engines (when ColumnStore is not involved)
      • How CHAR/VARCHAR/TEXT work with ColumnStore tables
      • What should be fixed with ColumnStore tables
      • How to test the proposed changes

      SQL standard

      The SQL standard allows to specify the size of CHAR, VARCHAR, CLOB data types either in characters or in octets. This is the relevant SQL standard grammar:

      <character string type> ::=
        CHAR [ <left paren> <character length> <right paren> ]
      | VARCHAR <left paren> <character length> <right paren>
      | CLOB [ <left paren> <character large object length> <right paren> ]
      ...
       
      <character length>              ::= <length>              [ <char length units> ]
       
      <character large object length> ::= <large object length> [ <char length units> ]
       
      <char length units> ::= CHARACTERS | OCTETS
      

      If <char length units> is not specified, then CHARACTERS is implicit.

      The difference between CHARACTERS and OCTETS does not matter for single-byte character sets, but it is important when we deal with multi-byte characters set such as utf8, utf8mb4, big5, etc.

      MariaDB and CHAR/VARCHAR

      MariaDB does not allow to specify <char length units> yet, but it follows the standard in the fact that it implicitly uses CHARACTERS as units.

      When working with InnoDB/MyISAM/HEAP, MariaDB works as follows with CHAR/VARCHAR:

      These columns:

      CHAR(10) CHARACTER SET utf8mb4
      VARCHAR(10) CHARACTER SET utf8mb4
      

      can store up to 10 characters. So it reserves space for 4*10=40 bytes to guarantee that it can store 10 characters of the longest length, where the longest length for a utf8mb4 character is 4.

      On INSERT to a CHAR/VARCHAR column, if the value is longer than the column width, MariaDB takes the leftmost 10 characters:

      • In case if the value consists of pure ASCII data it takes 10 bytes
      • In case if the value consists entirely of 2-byte characters, it takes 20 bytes
      • In case if the value consists entirely of 3-byte characters, it takes 30 bytes
      • In case if the value consists entirely of 4-byte characters, it will take 40 bytes
      • In case of a mixture of characters of a different octet length, the number of bytes depends on the exact data

      MariaDB and TEXT variants

      MariaDB does not support the CLOB data type, but it provides TEXT and its variants instead. The MariaDB *TEXT data types correspond to the following SQL standard data types:

      MariaDB data type  SQL Standard data type
      -----------------  ----------------------
      TINYTEXT           CLOB(255 OCTETS)
      TEXT               CLOB(65535 OCTETS)
      MEDIUMTEXT         CLOB(16777215 OCTETS)
      LONGTEXT           CLOB(4294967295 OCTETS)
      

      Notice, unlike CHAR/VARCHAR, TEXT variants limit in bytes rather than in characters.

      On insert to a *TEXT column, MariaDB takes the leftmost N bytes of the string (i.e. 255 bytes in case of TINYTEXT).

      Additionally, it makes sure that a multi-byte character does not break apart: if the cut point hits the middle of a multi-byte character, MariaDB moves towards the beginning of the string and skips all bytes of the current character.

      Note, this appeared to be broken in the latest MariaDB versions, see MDEV-24335 for details: MariaDB unexpectedly adds extra question marks at the end, but still MariaDB does not write incomplete multi-byte sequences into a column.

      MariaDB and TEXT(N)

      TEXT(N) is not a separate data type in MariaDB. It maps to one of the *TEXT variant, depending on the length specified, then forgets the length.
      Before mapping, N is multiplied to mbmaxlen of the column character set.

      In case of a single-byte character set, the mapping from TEXT(N) to a TEXT variant is simple:

      User specified data type             Actual created data type
      -----------------------------        ----------------------
      TEXT(255)   CHARACTER SET latin1     TINYTEXT
      TEXT(256)   CHARACTER SET latin1     TEXT
      TEXT(65535) CHARACTER SET latin1     TEXT
      TEXT(65536) CHARACTER SET latin1     MEDIUMTEXT
      TEXT(16777215) CHARACTER SET latin1  MEDIUMTEXT
      TEXT(16777216) CHARACTER SET latin1  LONGTEXT
      

      In case of a multi-byte character set, the mapping from TEXT(N) to a TEXT variant takes into account mbmaxlen of the character set. Using utf8mb4 (whose mbmaxlen is 4) as an example, the mapping looks as follows:

      User specified data type             Actual created data type Notes
      -----------------------------        ----------------------   -----
      TEXT(63) CHARACTER SET utf8mb4       TINYTEXT                 63*4=252
      TEXT(64) CHARACTER SET utf8mb4       TEXT                     64*4=256
      TEXT(16383) CHARACTER SET utf8mb4    TEXT                     16383*4=65532
      TEXT(16384) CHARACTER SET utf8mb4    MEDIUMTEXT               16384*4=65536
      TEXT(4194303) CHARACTER SET utf8mb4  MEDIUMTEXT               4194303*4=16777212
      TEXT(4194304) CHARACTER SET utf8mb4  MEDIUMTEXT               4194304*4=16777216
      

      MariaDB+ColumnStore current behavior

      ColumnStore currently works as follows:

      • It specifies CHAR/VARCHAR limits in OCTETS, which is neither SQL standard, nor MariaDB compatible.
      • It limits TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT in OCTETS, which is good (MariaDB compatible behaviour)
      • However, in TEXT(N), it does not multiply N to mbmaxlen before choosing a proper covering *TEXT variant (MariaDB incompatible behaviour)
      • It does not prevent multi-byte sequences from being broken apart, neither in CHAR, nor in VARCHAR, nor in *TEXT (MariaDB incompatible behaviour).

      SET sql_mode='';
      SET NAMES utf8;
      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a VARCHAR(3) CHARACTER SET utf8mb4) ENGINE=ColumnStore;
      INSERT INTO t1 VALUES (REPEAT('a',4));
      INSERT INTO t1 VALUES (REPEAT(_utf8 0xD184,4)); -- CYRILLIC SMALL LETTER F
      SHOW WARNINGS;
      

      +---------+------+-----------------------------------------------------+
      | Level   | Code | Message                                             |
      +---------+------+-----------------------------------------------------+
      | Warning | 1265 | Data truncated for column 'a' at row 1              |
      | Warning | 1264 | CAL0001: IDB-2025: Data truncated for column 'a'    |
      +---------+------+-----------------------------------------------------+
      

      SELECT a, HEX(a), OCTET_LENGTH(a), CHAR_LENGTH(a) FROM t1;
      

      +------+--------+-----------------+----------------+
      | a    | HEX(a) | OCTET_LENGTH(a) | CHAR_LENGTH(a) |
      +------+--------+-----------------+----------------+
      | aaa  | 616161 |               3 |              3 |
      | ф?   | D184D1 |               3 |              2 |
      +------+--------+-----------------+----------------+
      

      The above result looks wrong in the second record:

      • The value was cut on 3 bytes, instead of 3 characters.
      • The second cyrillic character 0xD184 was broken apart: only the multi-byte head 0xD1 was written to the column.

      Note, if we change ENGINE=ColumnStore to ENGINE=MyISAM (or InnoDB), the result of the SELECT looks like this:

      +--------+--------------+-----------------+----------------+
      | a      | HEX(a)       | OCTET_LENGTH(a) | CHAR_LENGTH(a) |
      +--------+--------------+-----------------+----------------+
      | aaa    | 616161       |               3 |              3 |
      | ффф    | D184D184D184 |               6 |              3 |
      +--------+--------------+-----------------+----------------+
      

      What should be done by this task

      Under terms of this task we'll fix ColumnStore tables as follows:

      • Make CHAR/VARCHAR columns limit the data in terms of characters rather than bytes
      • Make sure CHAR/VARCHAR columns do not break multi-byte sequences apart
      • Make sure TEXT variants still limit data in terms of bytes
      • Make sure TEXT variants do not break multi-byte sequences apart
      • Make sure TEXT(N) maps to a proper TEXT variant depending on the character set's mbmaxlen

      Testing

      After these changes, ColumnStore should generally work like other MariaDB engines for:

      • CHAR
      • VARCHAR
      • TEXT and its variants
      • TEXT(N)

      Exception: We won't reproduce the wrong behaviour described in MDEV-24335: no redundat trailing question marks will be inserted into *TEXT columns on truncation.

      Testing: DDL

      These statements should be tested:

      • CREATE TABLE
      • ALTER TABLE (with ADD, MODIFY, CHANGE)
      • CREATE TABLE .. LIKE ..
      • ALTER TABLE .. ENGINE=Columnstore
      • CREATE TABLE .. AS ..

      in combination with CHARACTER SET clause and in combination with testing for all other topics:

      • Testing: Character set inheritance
      • Testing: data truncation
      • Testing: CHAR/VARCHAR/TEXT columns and metadata views
        (see other "Testing:" topics below)

      Testing: Character set inheritance

      We should make sure that columns that do not have own explicit CHARACTER SET clause derive CHARACTER SET from the table level as explained above.
      Note, if the table does not have its own CHARACTER SET clause, then CHARACTER SET is derived from the database level, as shown in "SHOW CREATE TABLE".

      Testing: data truncation

      We'll make sure that:

      • CHAR/VARCHAR columns now apply the CHARACTER limit rather than OCTET limit.
      • *TEXT columns still apply the OCTET limit (according to the exact *TEXT variant limit).

      Let's use characters of different lengths for test data:

      'a'                 - U+0061 LATIN SMALL LETTER A          (1 byte in utf8mb4)
      _utf8mb4 0xD184     - U+0444 CYRILLIC SMALL LETTER EF      (2 bytes in utf8mb4)
      _utf8mb4 0xE1B482   - U+1D02 LATIN SMALL LETTER TURNED AE  (3 byte in utf8mb4)
      _utf8mb4 0xF09F988E - U+1F60E SMILING FACE WITH SUNGLASSES (4 bytes in utf8mb4)
      

      The idea is to generate various combinations of these characters and insert them into columns. All kind of sequences should be tested: shorter, equal, and longer than columns storage capability.

      After inserting, we'll test these characteristics of the data which was actually inserted:

      SELECT a, HEX(a), OCTET_LENGTH(a), CHAR_LENGTH(a) FROM t1;
      

      Testing: CHAR/VARCHAR/TEXT columns and metadata views

      We should make sure that INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS provide equal data type information.

      This query gets information from the ColumnStore system catalog:

      SELECT
        COLUMN_NAME, DATA_TYPE, COLUMN_LENGTH
      FROM
        INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS
      WHERE
        TABLE_SCHEMA='test' AND TABLE_NAME='t1';
      

      This query gets information from the MariaDB system catalog ("frm" files):

      SELECT
        COLUMN_NAME, DATA_TYPE, CHARACTER_OCTET_LENGTH
      FROM
        INFORMATION_SCHEMA.COLUMNS
      WHERE
        TABLE_SCHEMA='test' AND TABLE_NAME='t1';
      

      Example:

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a CHAR(8) CHARACTER SET utf8) ENGINE=ColumnStore;
      SELECT
        COLUMN_NAME, DATA_TYPE, COLUMN_LENGTH
      FROM
        INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS
      WHERE
        TABLE_SCHEMA='test' AND TABLE_NAME='t1';
      SELECT
        COLUMN_NAME, DATA_TYPE, CHARACTER_OCTET_LENGTH
      FROM
        INFORMATION_SCHEMA.COLUMNS
      WHERE
        TABLE_SCHEMA='test' AND TABLE_NAME='t1';
      

      should produce the following result:

      +-------------+-----------+---------------+
      | COLUMN_NAME | DATA_TYPE | COLUMN_LENGTH |
      +-------------+-----------+---------------+
      | a           | char      |            24 |
      +-------------+-----------+---------------+
      

      +-------------+-----------+------------------------+
      | COLUMN_NAME | DATA_TYPE | CHARACTER_OCTET_LENGTH |
      +-------------+-----------+------------------------+
      | a           | char      |                     24 |
      +-------------+-----------+------------------------+
      

      Notice, the two SELECT queries are expected to provide the same output. Before this task, the COLUMN_LENGTH value was 8.

      We'll do similar tests for CHAR, VARCHAR, TEXT for various length and character set combinations.

      Attachments

        Issue Links

          Activity

            People

              dleeyh Daniel Lee (Inactive)
              juan.vera Juan
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.