[MCOL-2000] varchar specified sizing is not in characters Created: 2018-12-06 Updated: 2021-06-21 Resolved: 2020-12-04 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | DDLProc |
| Affects Version/s: | 1.1.5, 1.1.6 |
| Fix Version/s: | 5.5.1 |
| Type: | Task | Priority: | Major |
| Reporter: | Juan | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Epic Link: | ColumnStore Compatibility Improvements | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
ColumnStore's string length handling is byte based and not character based. To store a 20 double-byte characters, you would need a 40-byte column. Also, functions such as length() handle multiple-byte character strings incorrectly. This violates SQL standards and needs to be corrected. For example: Inserting value "caractère accentué" in a VARCHAR(18) column, value is truncated. This causes problems because in MariaDB a VARCHAR(18) UTF8 column holds 18 characters, so it is an issue doing conversions and imports. PreambleThe following sections describe
SQL standardThe SQL standard allows to specify the size of CHAR, VARCHAR, CLOB data types either in characters or in octets. This is the relevant SQL standard grammar:
If <char length units> is not specified, then CHARACTERS is implicit. The difference between CHARACTERS and OCTETS does not matter for single-byte character sets, but it is important when we deal with multi-byte characters set such as utf8, utf8mb4, big5, etc. MariaDB and CHAR/VARCHARMariaDB does not allow to specify <char length units> yet, but it follows the standard in the fact that it implicitly uses CHARACTERS as units. When working with InnoDB/MyISAM/HEAP, MariaDB works as follows with CHAR/VARCHAR: These columns:
can store up to 10 characters. So it reserves space for 4*10=40 bytes to guarantee that it can store 10 characters of the longest length, where the longest length for a utf8mb4 character is 4. On INSERT to a CHAR/VARCHAR column, if the value is longer than the column width, MariaDB takes the leftmost 10 characters:
MariaDB and TEXT variantsMariaDB does not support the CLOB data type, but it provides TEXT and its variants instead. The MariaDB *TEXT data types correspond to the following SQL standard data types:
Notice, unlike CHAR/VARCHAR, TEXT variants limit in bytes rather than in characters. On insert to a *TEXT column, MariaDB takes the leftmost N bytes of the string (i.e. 255 bytes in case of TINYTEXT). Additionally, it makes sure that a multi-byte character does not break apart: if the cut point hits the middle of a multi-byte character, MariaDB moves towards the beginning of the string and skips all bytes of the current character.
MariaDB and TEXT(N)TEXT(N) is not a separate data type in MariaDB. It maps to one of the *TEXT variant, depending on the length specified, then forgets the length. In case of a single-byte character set, the mapping from TEXT(N) to a TEXT variant is simple:
In case of a multi-byte character set, the mapping from TEXT(N) to a TEXT variant takes into account mbmaxlen of the character set. Using utf8mb4 (whose mbmaxlen is 4) as an example, the mapping looks as follows:
MariaDB+ColumnStore current behaviorColumnStore currently works as follows:
The above result looks wrong in the second record:
Note, if we change ENGINE=ColumnStore to ENGINE=MyISAM (or InnoDB), the result of the SELECT looks like this:
What should be done by this taskUnder terms of this task we'll fix ColumnStore tables as follows:
TestingAfter these changes, ColumnStore should generally work like other MariaDB engines for:
Exception: We won't reproduce the wrong behaviour described in Testing: DDLThese statements should be tested:
in combination with CHARACTER SET clause and in combination with testing for all other topics:
Testing: Character set inheritanceWe should make sure that columns that do not have own explicit CHARACTER SET clause derive CHARACTER SET from the table level as explained above. Testing: data truncationWe'll make sure that:
Let's use characters of different lengths for test data:
The idea is to generate various combinations of these characters and insert them into columns. All kind of sequences should be tested: shorter, equal, and longer than columns storage capability. After inserting, we'll test these characteristics of the data which was actually inserted:
Testing: CHAR/VARCHAR/TEXT columns and metadata viewsWe should make sure that INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS provide equal data type information. This query gets information from the ColumnStore system catalog:
This query gets information from the MariaDB system catalog ("frm" files):
Example:
should produce the following result:
Notice, the two SELECT queries are expected to provide the same output. Before this task, the COLUMN_LENGTH value was 8. We'll do similar tests for CHAR, VARCHAR, TEXT for various length and character set combinations. |
| Comments |
| Comment by David Hall (Inactive) [ 2020-12-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Functions, such length() have been modified to handle multibyte sequences correctly. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2020-12-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
1. What was the fix for this ticket? Such low level change is not a simple change and it has serious system wide impact. We need to have impact analysis and design documents for this ticket. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2020-12-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: 5.5.1 (Drone 1251) Found 3 issues so far Issue #1 Create table with TEXT MariaDB [mytest]> CREATE TABLE ttext7latin1 (c1 TEXT(7)) ENGINE=columnstore CHARSET=latin1; MariaDB [mytest]> CREATE TABLE ttext7utf8 (c1 TEXT(7)) ENGINE=columnstore CHARSET=utf8; Issue #2 Value saturation for TEXT Inserted 8 characters and returned 7 MariaDB [mytest]> CREATE TABLE ttext8latin1 (c1 TEXT(8)) ENGINE=columnstore CHARSET=utf8; MariaDB [mytest]> INSERT INTO ttext8latin1 values('aaaaaaaa'); MariaDB [mytest]> SELECT * FROM ttext8latin1;
----------
---------- #3 String length check and saturation for TEXT The following error is incorrect. If data is too long (since the insert was rejected), it should return: MariaDB [mytest]> set sql_mode='STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION'; MariaDB [mytest]> CREATE TABLE ttext8utf8 (c1 TEXT(8)) ENGINE=columnstore CHARSET=utf8; MariaDB [mytest]> INSERT INTO ttext8utf8 values('aaaaaaaa'); MariaDB [mytest]> INSERT INTO ttext8utf8 values('美国福斯新闻旗下'); MariaDB [mytest]> INSERT INTO ttext8utf8 values('美国福斯新闻旗下'); MariaDB [mytest]> SHOW WARNINGS;
--------
-------- MariaDB [mytest]> SELECT * FROM ttext8utf8;
----------
---------- | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2020-12-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Reopen per last test result | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Very unfortunate that you tested w/o the latest changes.
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: 5.5.1 (Drone 1265) The latest build yesterday did not have the latest change. Retested this latest build and the reported issues above have been fixed. Testing continues. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: 5.5.1 (Drone 1265) Tested on the following: Data types: CHAR, VARCHAR, TEXT, TEXT There is an issue on LDI on multi-byte character sets, such as utf8 MariaDB [mytest]> load data infile '/tmp/t.txt' into table mcol2000utf8 columns terminated by "|"; MariaDB [mytest]> select * from mcol2000utf8; -----
-----
The work around is to specify character set. load data infile '/tmp/t.txt' into table mcol2000utf8 character set utf8 columns terminated by "|"; The same LDI test worked on InnoDB table without specifying character set. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
We need to compare the behavior against Innodb. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The identified LDI issue is being tracked by: | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Juan [ 2020-12-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi drrtuy The problem is not present in MariaDB 10.5.4:
|