[MCOL-2000] varchar specified sizing is not in characters Created: 2018-12-06  Updated: 2021-06-21  Resolved: 2020-12-04

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc
Affects Version/s: 1.1.5, 1.1.6
Fix Version/s: 5.5.1

Type: Task Priority: Major
Reporter: Juan Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MDEV-21504 Collation: Create shared library for ... Closed
Problem/Incident
causes MCOL-4103 Columnstore LDI doesn't support chars... Closed
causes MCOL-4434 LDI multiple-byte table without speci... Open
is caused by MCOL-4434 LDI multiple-byte table without speci... Open
Relates
relates to MCOL-2221 reverse function not work properly wi... Closed
relates to MCOL-4496 INSERT..SELECT does not work for utf8... Closed
relates to MCOL-4412 Create Table with longblob column Int... Closed
relates to MCOL-4691 Major Regression: Selects with aggreg... Closed
Epic Link: ColumnStore Compatibility Improvements

 Description   

ColumnStore's string length handling is byte based and not character based. To store a 20 double-byte characters, you would need a 40-byte column. Also, functions such as length() handle multiple-byte character strings incorrectly. This violates SQL standards and needs to be corrected.

For example: Inserting value "caractère accentué" in a VARCHAR(18) column, value is truncated.

This causes problems because in MariaDB a VARCHAR(18) UTF8 column holds 18 characters, so it is an issue doing conversions and imports.

Preamble

The following sections describe

  • CHAR/VARCHAR/CLOB behaviour in the SQL Standard
  • How CHAR/VARCHAR/TEXT work in MariaDB with other engines (when ColumnStore is not involved)
  • How CHAR/VARCHAR/TEXT work with ColumnStore tables
  • What should be fixed with ColumnStore tables
  • How to test the proposed changes

SQL standard

The SQL standard allows to specify the size of CHAR, VARCHAR, CLOB data types either in characters or in octets. This is the relevant SQL standard grammar:

<character string type> ::=
  CHAR [ <left paren> <character length> <right paren> ]
| VARCHAR <left paren> <character length> <right paren>
| CLOB [ <left paren> <character large object length> <right paren> ]
...
 
<character length>              ::= <length>              [ <char length units> ]
 
<character large object length> ::= <large object length> [ <char length units> ]
 
<char length units> ::= CHARACTERS | OCTETS

If <char length units> is not specified, then CHARACTERS is implicit.

The difference between CHARACTERS and OCTETS does not matter for single-byte character sets, but it is important when we deal with multi-byte characters set such as utf8, utf8mb4, big5, etc.

MariaDB and CHAR/VARCHAR

MariaDB does not allow to specify <char length units> yet, but it follows the standard in the fact that it implicitly uses CHARACTERS as units.

When working with InnoDB/MyISAM/HEAP, MariaDB works as follows with CHAR/VARCHAR:

These columns:

CHAR(10) CHARACTER SET utf8mb4
VARCHAR(10) CHARACTER SET utf8mb4

can store up to 10 characters. So it reserves space for 4*10=40 bytes to guarantee that it can store 10 characters of the longest length, where the longest length for a utf8mb4 character is 4.

On INSERT to a CHAR/VARCHAR column, if the value is longer than the column width, MariaDB takes the leftmost 10 characters:

  • In case if the value consists of pure ASCII data it takes 10 bytes
  • In case if the value consists entirely of 2-byte characters, it takes 20 bytes
  • In case if the value consists entirely of 3-byte characters, it takes 30 bytes
  • In case if the value consists entirely of 4-byte characters, it will take 40 bytes
  • In case of a mixture of characters of a different octet length, the number of bytes depends on the exact data

MariaDB and TEXT variants

MariaDB does not support the CLOB data type, but it provides TEXT and its variants instead. The MariaDB *TEXT data types correspond to the following SQL standard data types:

MariaDB data type  SQL Standard data type
-----------------  ----------------------
TINYTEXT           CLOB(255 OCTETS)
TEXT               CLOB(65535 OCTETS)
MEDIUMTEXT         CLOB(16777215 OCTETS)
LONGTEXT           CLOB(4294967295 OCTETS)

Notice, unlike CHAR/VARCHAR, TEXT variants limit in bytes rather than in characters.

On insert to a *TEXT column, MariaDB takes the leftmost N bytes of the string (i.e. 255 bytes in case of TINYTEXT).

Additionally, it makes sure that a multi-byte character does not break apart: if the cut point hits the middle of a multi-byte character, MariaDB moves towards the beginning of the string and skips all bytes of the current character.

Note, this appeared to be broken in the latest MariaDB versions, see MDEV-24335 for details: MariaDB unexpectedly adds extra question marks at the end, but still MariaDB does not write incomplete multi-byte sequences into a column.

MariaDB and TEXT(N)

TEXT(N) is not a separate data type in MariaDB. It maps to one of the *TEXT variant, depending on the length specified, then forgets the length.
Before mapping, N is multiplied to mbmaxlen of the column character set.

In case of a single-byte character set, the mapping from TEXT(N) to a TEXT variant is simple:

User specified data type             Actual created data type
-----------------------------        ----------------------
TEXT(255)   CHARACTER SET latin1     TINYTEXT
TEXT(256)   CHARACTER SET latin1     TEXT
TEXT(65535) CHARACTER SET latin1     TEXT
TEXT(65536) CHARACTER SET latin1     MEDIUMTEXT
TEXT(16777215) CHARACTER SET latin1  MEDIUMTEXT
TEXT(16777216) CHARACTER SET latin1  LONGTEXT

In case of a multi-byte character set, the mapping from TEXT(N) to a TEXT variant takes into account mbmaxlen of the character set. Using utf8mb4 (whose mbmaxlen is 4) as an example, the mapping looks as follows:

User specified data type             Actual created data type Notes
-----------------------------        ----------------------   -----
TEXT(63) CHARACTER SET utf8mb4       TINYTEXT                 63*4=252
TEXT(64) CHARACTER SET utf8mb4       TEXT                     64*4=256
TEXT(16383) CHARACTER SET utf8mb4    TEXT                     16383*4=65532
TEXT(16384) CHARACTER SET utf8mb4    MEDIUMTEXT               16384*4=65536
TEXT(4194303) CHARACTER SET utf8mb4  MEDIUMTEXT               4194303*4=16777212
TEXT(4194304) CHARACTER SET utf8mb4  MEDIUMTEXT               4194304*4=16777216

MariaDB+ColumnStore current behavior

ColumnStore currently works as follows:

  • It specifies CHAR/VARCHAR limits in OCTETS, which is neither SQL standard, nor MariaDB compatible.
  • It limits TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT in OCTETS, which is good (MariaDB compatible behaviour)
  • However, in TEXT(N), it does not multiply N to mbmaxlen before choosing a proper covering *TEXT variant (MariaDB incompatible behaviour)
  • It does not prevent multi-byte sequences from being broken apart, neither in CHAR, nor in VARCHAR, nor in *TEXT (MariaDB incompatible behaviour).

SET sql_mode='';
SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(3) CHARACTER SET utf8mb4) ENGINE=ColumnStore;
INSERT INTO t1 VALUES (REPEAT('a',4));
INSERT INTO t1 VALUES (REPEAT(_utf8 0xD184,4)); -- CYRILLIC SMALL LETTER F
SHOW WARNINGS;

+---------+------+-----------------------------------------------------+
| Level   | Code | Message                                             |
+---------+------+-----------------------------------------------------+
| Warning | 1265 | Data truncated for column 'a' at row 1              |
| Warning | 1264 | CAL0001: IDB-2025: Data truncated for column 'a'    |
+---------+------+-----------------------------------------------------+

SELECT a, HEX(a), OCTET_LENGTH(a), CHAR_LENGTH(a) FROM t1;

+------+--------+-----------------+----------------+
| a    | HEX(a) | OCTET_LENGTH(a) | CHAR_LENGTH(a) |
+------+--------+-----------------+----------------+
| aaa  | 616161 |               3 |              3 |
| ф?   | D184D1 |               3 |              2 |
+------+--------+-----------------+----------------+

The above result looks wrong in the second record:

  • The value was cut on 3 bytes, instead of 3 characters.
  • The second cyrillic character 0xD184 was broken apart: only the multi-byte head 0xD1 was written to the column.

Note, if we change ENGINE=ColumnStore to ENGINE=MyISAM (or InnoDB), the result of the SELECT looks like this:

+--------+--------------+-----------------+----------------+
| a      | HEX(a)       | OCTET_LENGTH(a) | CHAR_LENGTH(a) |
+--------+--------------+-----------------+----------------+
| aaa    | 616161       |               3 |              3 |
| ффф    | D184D184D184 |               6 |              3 |
+--------+--------------+-----------------+----------------+

What should be done by this task

Under terms of this task we'll fix ColumnStore tables as follows:

  • Make CHAR/VARCHAR columns limit the data in terms of characters rather than bytes
  • Make sure CHAR/VARCHAR columns do not break multi-byte sequences apart
  • Make sure TEXT variants still limit data in terms of bytes
  • Make sure TEXT variants do not break multi-byte sequences apart
  • Make sure TEXT(N) maps to a proper TEXT variant depending on the character set's mbmaxlen

Testing

After these changes, ColumnStore should generally work like other MariaDB engines for:

  • CHAR
  • VARCHAR
  • TEXT and its variants
  • TEXT(N)

Exception: We won't reproduce the wrong behaviour described in MDEV-24335: no redundat trailing question marks will be inserted into *TEXT columns on truncation.

Testing: DDL

These statements should be tested:

  • CREATE TABLE
  • ALTER TABLE (with ADD, MODIFY, CHANGE)
  • CREATE TABLE .. LIKE ..
  • ALTER TABLE .. ENGINE=Columnstore
  • CREATE TABLE .. AS ..

in combination with CHARACTER SET clause and in combination with testing for all other topics:

  • Testing: Character set inheritance
  • Testing: data truncation
  • Testing: CHAR/VARCHAR/TEXT columns and metadata views
    (see other "Testing:" topics below)

Testing: Character set inheritance

We should make sure that columns that do not have own explicit CHARACTER SET clause derive CHARACTER SET from the table level as explained above.
Note, if the table does not have its own CHARACTER SET clause, then CHARACTER SET is derived from the database level, as shown in "SHOW CREATE TABLE".

Testing: data truncation

We'll make sure that:

  • CHAR/VARCHAR columns now apply the CHARACTER limit rather than OCTET limit.
  • *TEXT columns still apply the OCTET limit (according to the exact *TEXT variant limit).

Let's use characters of different lengths for test data:

'a'                 - U+0061 LATIN SMALL LETTER A          (1 byte in utf8mb4)
_utf8mb4 0xD184     - U+0444 CYRILLIC SMALL LETTER EF      (2 bytes in utf8mb4)
_utf8mb4 0xE1B482   - U+1D02 LATIN SMALL LETTER TURNED AE  (3 byte in utf8mb4)
_utf8mb4 0xF09F988E - U+1F60E SMILING FACE WITH SUNGLASSES (4 bytes in utf8mb4)

The idea is to generate various combinations of these characters and insert them into columns. All kind of sequences should be tested: shorter, equal, and longer than columns storage capability.

After inserting, we'll test these characteristics of the data which was actually inserted:

SELECT a, HEX(a), OCTET_LENGTH(a), CHAR_LENGTH(a) FROM t1;

Testing: CHAR/VARCHAR/TEXT columns and metadata views

We should make sure that INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS provide equal data type information.

This query gets information from the ColumnStore system catalog:

SELECT
  COLUMN_NAME, DATA_TYPE, COLUMN_LENGTH
FROM
  INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS
WHERE
  TABLE_SCHEMA='test' AND TABLE_NAME='t1';

This query gets information from the MariaDB system catalog ("frm" files):

SELECT
  COLUMN_NAME, DATA_TYPE, CHARACTER_OCTET_LENGTH
FROM
  INFORMATION_SCHEMA.COLUMNS
WHERE
  TABLE_SCHEMA='test' AND TABLE_NAME='t1';

Example:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(8) CHARACTER SET utf8) ENGINE=ColumnStore;
SELECT
  COLUMN_NAME, DATA_TYPE, COLUMN_LENGTH
FROM
  INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS
WHERE
  TABLE_SCHEMA='test' AND TABLE_NAME='t1';
SELECT
  COLUMN_NAME, DATA_TYPE, CHARACTER_OCTET_LENGTH
FROM
  INFORMATION_SCHEMA.COLUMNS
WHERE
  TABLE_SCHEMA='test' AND TABLE_NAME='t1';

should produce the following result:

+-------------+-----------+---------------+
| COLUMN_NAME | DATA_TYPE | COLUMN_LENGTH |
+-------------+-----------+---------------+
| a           | char      |            24 |
+-------------+-----------+---------------+

+-------------+-----------+------------------------+
| COLUMN_NAME | DATA_TYPE | CHARACTER_OCTET_LENGTH |
+-------------+-----------+------------------------+
| a           | char      |                     24 |
+-------------+-----------+------------------------+

Notice, the two SELECT queries are expected to provide the same output. Before this task, the COLUMN_LENGTH value was 8.

We'll do similar tests for CHAR, VARCHAR, TEXT for various length and character set combinations.



 Comments   
Comment by David Hall (Inactive) [ 2020-12-02 ]

Functions, such length() have been modified to handle multibyte sequences correctly.

Comment by Daniel Lee (Inactive) [ 2020-12-02 ]

1. What was the fix for this ticket?
2. Is there any information relating to this fix?
3. What area of ColumnStore is impacted by this fix?

Such low level change is not a simple change and it has serious system wide impact. We need to have impact analysis and design documents for this ticket.

Comment by Daniel Lee (Inactive) [ 2020-12-03 ]

Build tested: 5.5.1 (Drone 1251)

Found 3 issues so far

Issue #1

Create table with TEXT columnstore, both latin1 and utf8 tables

MariaDB [mytest]> CREATE TABLE ttext7latin1 (c1 TEXT(7)) ENGINE=columnstore CHARSET=latin1;
ERROR 1815 (HY000): Internal error: CAL0009: (3)Create table failed due to varbinary and blob length may not be less than 8

MariaDB [mytest]> CREATE TABLE ttext7utf8 (c1 TEXT(7)) ENGINE=columnstore CHARSET=utf8;
ERROR 1815 (HY000): Internal error: CAL0009: (3)Create table failed due to varbinary and blob length may not be less than 8

Issue #2

Value saturation for TEXT column, for latin1 table

Inserted 8 characters and returned 7

MariaDB [mytest]> CREATE TABLE ttext8latin1 (c1 TEXT(8)) ENGINE=columnstore CHARSET=utf8;
Query OK, 0 rows affected (0.120 sec)

MariaDB [mytest]> INSERT INTO ttext8latin1 values('aaaaaaaa');
Query OK, 1 row affected (0.135 sec)

MariaDB [mytest]> SELECT * FROM ttext8latin1;
----------

c1

----------

aaaaaaa

----------
1 row in set (0.071 sec)

#3

String length check and saturation for TEXT, utf8 table
For utf8 strings, insert has issues with length check and saturation

The following error is incorrect.
ERROR 1264 (22003): CAL0001: IDB-2025: Data truncated for column 'c1'

If data is too long (since the insert was rejected), it should return:
"ERROR 1406 (22001): Data too long for column 'c1' at row 1"

MariaDB [mytest]> set sql_mode='STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
Query OK, 0 rows affected (0.000 sec)

MariaDB [mytest]> CREATE TABLE ttext8utf8 (c1 TEXT(8)) ENGINE=columnstore CHARSET=utf8;
Query OK, 0 rows affected (0.129 sec)

MariaDB [mytest]> INSERT INTO ttext8utf8 values('aaaaaaaa');
Query OK, 1 row affected (0.199 sec)

MariaDB [mytest]> INSERT INTO ttext8utf8 values('美国福斯新闻旗下');
ERROR 1264 (22003): CAL0001: IDB-2025: Data truncated for column 'c1'
MariaDB [mytest]> set sql_mode='';
Query OK, 0 rows affected (0.000 sec)

MariaDB [mytest]> INSERT INTO ttext8utf8 values('美国福斯新闻旗下');
Query OK, 1 row affected, 1 warning (0.052 sec)

MariaDB [mytest]> SHOW WARNINGS;
-----------------------------------------------------------------

Level Code Message

-----------------------------------------------------------------

Warning 1264 CAL0001: IDB-2025: Data truncated for column 'c1'

-----------------------------------------------------------------
1 row in set (0.000 sec)

MariaDB [mytest]> SELECT * FROM ttext8utf8;
----------

c1

----------

aaaaaaaa
美国?

----------
2 rows in set (0.084 sec)

Comment by Daniel Lee (Inactive) [ 2020-12-03 ]

Reopen per last test result

Comment by Roman [ 2020-12-04 ]

Very unfortunate that you tested w/o the latest changes.
The tests work with the latest code in develop-1.5

MariaDB [test]> CREATE TABLE ttext7latin1 (c1 TEXT(7)) ENGINE=columnstore CHARSET=latin1;
Query OK, 0 rows affected (0.556 sec)
 
MariaDB [test]> CREATE TABLE ttext7utf8 (c1 TEXT(7)) ENGINE=columnstore CHARSET=utf8;
Query OK, 0 rows affected (0.387 sec)
 
MariaDB [test]> CREATE TABLE ttext8latin1 (c1 TEXT(8)) ENGINE=columnstore CHARSET=utf8;
Query OK, 0 rows affected (0.372 sec)
 
MariaDB [test]>  INSERT INTO ttext8latin1 values('aaaaaaaa');
Query OK, 1 row affected (0.200 sec)
 
MariaDB [test]> SELECT * FROM ttext8latin1;
+----------+
| c1       |
+----------+
| aaaaaaaa |
+----------+
1 row in set (0.066 sec)
 
MariaDB [test]> SELECT length(c1) FROM ttext8latin1;
+------------+
| length(c1) |
+------------+
|          8 |
+------------+
1 row in set (0.011 sec)
 
MariaDB [test]> show status like 'columnstore%';
+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| Columnstore_commit_hash | ef78ec927 |
| Columnstore_version     | 5.5.1     |
+-------------------------+-----------+
2 rows in set (0.001 sec)

Comment by Daniel Lee (Inactive) [ 2020-12-04 ]

Build tested: 5.5.1 (Drone 1265)

The latest build yesterday did not have the latest change. Retested this latest build and the reported issues above have been fixed.

Testing continues.

Comment by Daniel Lee (Inactive) [ 2020-12-04 ]

Build tested: 5.5.1 (Drone 1265)

Tested on the following:

Data types: CHAR, VARCHAR, TEXT, TEXT, TINYTEXT, MEDKUMTEXT, LONGTEXT
INSERT, UPDATE, DELETE, queries,
joins and string comparison
functions
cpimport
LDI, with and without batch insert on

There is an issue on LDI on multi-byte character sets, such as utf8
If running LDI without specify character set, garbage would be loaded

MariaDB [mytest]> load data infile '/tmp/t.txt' into table mcol2000utf8 columns terminated by "|";
Query OK, 3 rows affected, 18 warnings (0.406 sec)
Records: 3 Deleted: 0 Skipped: 0 Warnings: 18

MariaDB [mytest]> select * from mcol2000utf8;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 ç¾ ç¾Ž 美国福æ ç¾ ç¾Ž 美国福æ 美国福斯新闻旗下 美国福斯新闻旗下位于纽约 美国福斯新闻旗下位于纽约的第5 美国福斯新闻旗下位于纽约的第5频道报导, 美国福斯新闻旗下位于纽约的第5频道报导,美国福斯新
2 ç¾ ç¾Ž 美国福æ ç¾ ç¾Ž 美国福æ 美国福斯新闻旗下 美国福斯新闻旗下位于纽约 美国福斯新闻旗下位于纽约的第5 美国福斯新闻旗下位于纽约的第5频道报导, 美国福斯新闻旗下位于纽约的第5频道报导,美国福斯新
3 ç¾ ç¾Ž 美国福æ ç¾ ç¾Ž 美国福æ 美国福斯新闻旗下 美国福斯新闻旗下位于纽约 美国福斯新闻旗下位于纽约的第5 美国福斯新闻旗下位于纽约的第5频道报导, 美国福斯新闻旗下位于纽约的第5频道报导,美国福斯新

The work around is to specify character set.

load data infile '/tmp/t.txt' into table mcol2000utf8 character set utf8 columns terminated by "|";

The same LDI test worked on InnoDB table without specifying character set.

Comment by Roman [ 2020-12-04 ]

We need to compare the behavior against Innodb.

Comment by Daniel Lee (Inactive) [ 2020-12-04 ]

The identified LDI issue is being tracked by:

https://jira.mariadb.org/browse/MCOL-4434

Comment by Juan [ 2020-12-04 ]

Hi drrtuy

The problem is not present in MariaDB 10.5.4:

MariaDB [test]> insert into simple( testchar) values('123456789012345678');
Query OK, 1 row affected (0.032 sec)
 
MariaDB [test]> insert into simple( testchar) values('caractère accentué');
Query OK, 1 row affected (0.028 sec)
 
MariaDB [test]> select id, testchar from simple;
+----+----------------------+
| id | testchar             |
+----+----------------------+
|  1 | 123456789012345678   |
|  2 | caractère accentué   |
+----+----------------------+
2 rows in set (0.000 sec)
 
MariaDB [test]> show create table simple\G
*************************** 1. row ***************************
       Table: simple
Create Table: CREATE TABLE `simple` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `modified` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
  `val` int(10) unsigned DEFAULT NULL,
  `blah` varchar(511) DEFAULT NULL,
  `testchar` varchar(18) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `modified` (`modified`),
  KEY `blah100` (`blah`(100))
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
1 row in set (0.000 sec)

Generated at Thu Feb 08 02:32:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.