Details
-
Task
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
1.2.0, 1.4.0, 5.4.1, 5.5.1
-
None
Description
I run this SQL script:
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a CHAR(4)) ENGINE=ColumnStore; |
INSERT INTO t1 VALUES ('a'); |
SELECT * FROM t1 WHERE a='a'; |
I set a break-point in colCompare() in primitives/linux-port/column.cpp on this code:
utils::ConstString s1 = {reinterpret_cast<const char*>(&val1), 8}; |
utils::ConstString s2 = {reinterpret_cast<const char*>(&val2), 8}; |
return colCompareStr(typeHolder, COP, s1.rtrimZero(), s2.rtrimZero()); |
and test these variables:
(gdb) x /4b s1.mStr
|
0x7f6310fedcf8: 97 0 0 0
|
(gdb) x /4b s2.mStr
|
0x7f6310fedcf0: 97 0 0 0
|
it demonstrates that the values are padded with zero bytes 0x00 rather than spaces 0x20.
Standard compatibility
This is not according to the SQL standard, which assumes the values are padded with spaces. The section "Store assignment" says:
Let T be the TARGET and let V be the VALUE in an application of the General Rules of this Subclause. .
...
If the declared type of T is fixed-length character string with length in characters L and the length in characters M of V is less than L, then the first M characters of T are set to V and the last L–M characters of T are set to <space>s.
MariaDB collation library compatibility
MariaDB collation library also expects the values to be padded with spaces rather than zeros. The above code has to trim trailing zeros. Otherwise the collation library would compare the values in a wrong way.
This hack addresses many use cases. But not all.
0x00 is a normal data in MariaDB, but is not in ColumnStore
0x00 in MariaDB is a normal character which can be a part of the data.
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a CHAR(4)) ENGINE=InnoDB; |
INSERT INTO t1 VALUES ('a\0\0\0'); |
SELECT HEX(a) FROM t1; |
+----------+
|
| HEX(a) |
|
+----------+
|
| 61000000 |
|
+----------+
|
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a CHAR(4)) ENGINE=ColumnStore; |
INSERT INTO t1 VALUES ('a\0\0\0'); |
SELECT HEX(a) FROM t1; |
+--------+
|
| HEX(a) |
|
+--------+
|
| 61 |
|
+--------+
|
Notice InnoDB preserved the zero bytes, while ColumnStore removed them.
UTF16 compatibility (a possible future development direction)
UTF16 is essential for East Asian scripts, because many East Asian characters fit into two bytes in UTF16 encoding, while the same characters require 3 bytes in UTF8 representation. Users can get a 33.3% performance gain by simply choosing UTF16 instead of UTF8 to store data.
In UTF16, zero bytes are absolutely required. We need to get rid of the assumption that 0x00 means "end of the string".
The order of characters less than space is different in MariaDB vs ColumnStore
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a CHAR(4) CHARACTER SET latin1 COLLATE latin1_nopad_bin) ENGINE=InnoDB; |
INSERT INTO t1 VALUES ('a'),('a\t'); |
SELECT HEX(a) FROM t1 ORDER BY a; |
+--------+
|
| HEX(a) |
|
+--------+
|
| 6109 |
|
| 61 |
|
+--------+
|
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a CHAR(4) CHARACTER SET latin1 COLLATE latin1_nopad_bin) ENGINE=ColumnStore; |
INSERT INTO t1 VALUES ('a'),('a\t'); |
SELECT HEX(a) FROM t1 ORDER BY a; |
+--------+
|
| HEX(a) |
|
+--------+
|
| 61 |
|
| 6109 |
|
+--------+
|
Performance
The code in colCompare() has to trim trailing zero bytes before passing the values to MariaDB collation library for comparison. This takes some CPU, useless in most cases, because in the majority cases a difference between two strings is found in the very beginning of the two strings. So the trailing spaces are even not compared.
Proposal
In order to make ColumnStore compatible with MariaDB with respect of zero bytes and trailing spaces, let's change the way how we pad short CHAR values: from zero byte padding to space padding.