ColumnStore's string length handling is byte based and not character based. To store a 20 double-byte characters, you would need a 40-byte column. Also, functions such as length() handle multiple-byte character strings incorrectly. This violates SQL standards and needs to be corrected.
For example: Inserting value "caractère accentué" in a VARCHAR(18) column, value is truncated.
This causes problems because in MariaDB a VARCHAR(18) UTF8 column holds 18 characters, so it is an issue doing conversions and imports.
The following sections describe
- CHAR/VARCHAR/CLOB behaviour in the SQL Standard
- How CHAR/VARCHAR/TEXT work in MariaDB with other engines (when ColumnStore is not involved)
- How CHAR/VARCHAR/TEXT work with ColumnStore tables
- What should be fixed with ColumnStore tables
- How to test the proposed changes
The SQL standard allows to specify the size of CHAR, VARCHAR, CLOB data types either in characters or in octets. This is the relevant SQL standard grammar:
If <char length units> is not specified, then CHARACTERS is implicit.
The difference between CHARACTERS and OCTETS does not matter for single-byte character sets, but it is important when we deal with multi-byte characters set such as utf8, utf8mb4, big5, etc.
MariaDB does not allow to specify <char length units> yet, but it follows the standard in the fact that it implicitly uses CHARACTERS as units.
When working with InnoDB/MyISAM/HEAP, MariaDB works as follows with CHAR/VARCHAR:
can store up to 10 characters. So it reserves space for 4*10=40 bytes to guarantee that it can store 10 characters of the longest length, where the longest length for a utf8mb4 character is 4.
On INSERT to a CHAR/VARCHAR column, if the value is longer than the column width, MariaDB takes the leftmost 10 characters:
- In case if the value consists of pure ASCII data it takes 10 bytes
- In case if the value consists entirely of 2-byte characters, it takes 20 bytes
- In case if the value consists entirely of 3-byte characters, it takes 30 bytes
- In case if the value consists entirely of 4-byte characters, it will take 40 bytes
- In case of a mixture of characters of a different octet length, the number of bytes depends on the exact data
MariaDB does not support the CLOB data type, but it provides TEXT and its variants instead. The MariaDB *TEXT data types correspond to the following SQL standard data types:
Notice, unlike CHAR/VARCHAR, TEXT variants limit in bytes rather than in characters.
On insert to a *TEXT column, MariaDB takes the leftmost N bytes of the string (i.e. 255 bytes in case of TINYTEXT).
Additionally, it makes sure that a multi-byte character does not break apart: if the cut point hits the middle of a multi-byte character, MariaDB moves towards the beginning of the string and skips all bytes of the current character.
Note, this appeared to be broken in the latest MariaDB versions, see MDEV-24335 for details: MariaDB unexpectedly adds extra question marks at the end, but still MariaDB does not write incomplete multi-byte sequences into a column.
TEXT(N) is not a separate data type in MariaDB. It maps to one of the *TEXT variant, depending on the length specified, then forgets the length.
Before mapping, N is multiplied to mbmaxlen of the column character set.
In case of a single-byte character set, the mapping from TEXT(N) to a TEXT variant is simple:
In case of a multi-byte character set, the mapping from TEXT(N) to a TEXT variant takes into account mbmaxlen of the character set. Using utf8mb4 (whose mbmaxlen is 4) as an example, the mapping looks as follows:
ColumnStore currently works as follows:
- It specifies CHAR/VARCHAR limits in OCTETS, which is neither SQL standard, nor MariaDB compatible.
- It limits TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT in OCTETS, which is good (MariaDB compatible behaviour)
- However, in TEXT(N), it does not multiply N to mbmaxlen before choosing a proper covering *TEXT variant (MariaDB incompatible behaviour)
- It does not prevent multi-byte sequences from being broken apart, neither in CHAR, nor in VARCHAR, nor in *TEXT (MariaDB incompatible behaviour).
The above result looks wrong in the second record:
- The value was cut on 3 bytes, instead of 3 characters.
- The second cyrillic character 0xD184 was broken apart: only the multi-byte head 0xD1 was written to the column.
Note, if we change ENGINE=ColumnStore to ENGINE=MyISAM (or InnoDB), the result of the SELECT looks like this:
Under terms of this task we'll fix ColumnStore tables as follows:
- Make CHAR/VARCHAR columns limit the data in terms of characters rather than bytes
- Make sure CHAR/VARCHAR columns do not break multi-byte sequences apart
- Make sure TEXT variants still limit data in terms of bytes
- Make sure TEXT variants do not break multi-byte sequences apart
- Make sure TEXT(N) maps to a proper TEXT variant depending on the character set's mbmaxlen
After these changes, ColumnStore should generally work like other MariaDB engines for:
- TEXT and its variants
Exception: We won't reproduce the wrong behaviour described in MDEV-24335: no redundat trailing question marks will be inserted into *TEXT columns on truncation.
These statements should be tested:
- CREATE TABLE
- ALTER TABLE (with ADD, MODIFY, CHANGE)
- CREATE TABLE .. LIKE ..
- ALTER TABLE .. ENGINE=Columnstore
- CREATE TABLE .. AS ..
in combination with CHARACTER SET clause and in combination with testing for all other topics:
- Testing: Character set inheritance
- Testing: data truncation
- Testing: CHAR/VARCHAR/TEXT columns and metadata views
(see other "Testing:" topics below)
We should make sure that columns that do not have own explicit CHARACTER SET clause derive CHARACTER SET from the table level as explained above.
Note, if the table does not have its own CHARACTER SET clause, then CHARACTER SET is derived from the database level, as shown in "SHOW CREATE TABLE".
We'll make sure that:
- CHAR/VARCHAR columns now apply the CHARACTER limit rather than OCTET limit.
- *TEXT columns still apply the OCTET limit (according to the exact *TEXT variant limit).
Let's use characters of different lengths for test data:
The idea is to generate various combinations of these characters and insert them into columns. All kind of sequences should be tested: shorter, equal, and longer than columns storage capability.
After inserting, we'll test these characteristics of the data which was actually inserted:
We should make sure that INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.COLUMNSTORE_COLUMNS provide equal data type information.
This query gets information from the ColumnStore system catalog:
This query gets information from the MariaDB system catalog ("frm" files):
should produce the following result:
Notice, the two SELECT queries are expected to provide the same output. Before this task, the COLUMN_LENGTH value was 8.
We'll do similar tests for CHAR, VARCHAR, TEXT for various length and character set combinations.