[MCOL-4568] Change short CHAR to pad values with spaces rather than zeros Created: 2021-03-02  Updated: 2023-07-01

Status: Open
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 1.2.0, 1.4.0, 5.4.1, 5.5.1
Fix Version/s: 23.10

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocks
blocks MCOL-4534 MariaDB collation library: improve co... Stalled
Relates
relates to MCOL-4403 Trailing spaces not behaving as InnoDB Closed
Epic Link: ColumnStore Compatibility Improvements

 Description   

I run this SQL script:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(4)) ENGINE=ColumnStore;
INSERT INTO t1 VALUES ('a');
SELECT * FROM t1 WHERE a='a';

I set a break-point in colCompare() in primitives/linux-port/column.cpp on this code:

utils::ConstString s1 = {reinterpret_cast<const char*>(&val1), 8};
utils::ConstString s2 = {reinterpret_cast<const char*>(&val2), 8};
return colCompareStr(typeHolder, COP, s1.rtrimZero(), s2.rtrimZero());   

and test these variables:

(gdb) x /4b s1.mStr
0x7f6310fedcf8:	97	0	0	0
(gdb) x /4b s2.mStr
0x7f6310fedcf0:	97	0	0	0

it demonstrates that the values are padded with zero bytes 0x00 rather than spaces 0x20.

Standard compatibility

This is not according to the SQL standard, which assumes the values are padded with spaces. The section "Store assignment" says:

Let T be the TARGET and let V be the VALUE in an application of the General Rules of this Subclause. .
...
If the declared type of T is fixed-length character string with length in characters L and the length in characters M of V is less than L, then the first M characters of T are set to V and the last L–M characters of T are set to <space>s.

MariaDB collation library compatibility

MariaDB collation library also expects the values to be padded with spaces rather than zeros. The above code has to trim trailing zeros. Otherwise the collation library would compare the values in a wrong way.

This hack addresses many use cases. But not all.

0x00 is a normal data in MariaDB, but is not in ColumnStore

0x00 in MariaDB is a normal character which can be a part of the data.

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(4)) ENGINE=InnoDB;
INSERT INTO t1 VALUES ('a\0\0\0');
SELECT HEX(a) FROM t1;

+----------+
| HEX(a)   |
+----------+
| 61000000 |
+----------+

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(4)) ENGINE=ColumnStore;
INSERT INTO t1 VALUES ('a\0\0\0');
SELECT HEX(a) FROM t1;

+--------+
| HEX(a) |
+--------+
| 61     |
+--------+

Notice InnoDB preserved the zero bytes, while ColumnStore removed them.

UTF16 compatibility (a possible future development direction)

UTF16 is essential for East Asian scripts, because many East Asian characters fit into two bytes in UTF16 encoding, while the same characters require 3 bytes in UTF8 representation. Users can get a 33.3% performance gain by simply choosing UTF16 instead of UTF8 to store data.

In UTF16, zero bytes are absolutely required. We need to get rid of the assumption that 0x00 means "end of the string".

The order of characters less than space is different in MariaDB vs ColumnStore

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(4) CHARACTER SET latin1 COLLATE latin1_nopad_bin) ENGINE=InnoDB;
INSERT INTO t1 VALUES ('a'),('a\t');
SELECT HEX(a) FROM t1 ORDER BY a;

+--------+
| HEX(a) |
+--------+
| 6109   |
| 61     |
+--------+

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(4) CHARACTER SET latin1 COLLATE latin1_nopad_bin) ENGINE=ColumnStore;
INSERT INTO t1 VALUES ('a'),('a\t');
SELECT HEX(a) FROM t1 ORDER BY a;

+--------+
| HEX(a) |
+--------+
| 61     |
| 6109   |
+--------+

Performance

The code in colCompare() has to trim trailing zero bytes before passing the values to MariaDB collation library for comparison. This takes some CPU, useless in most cases, because in the majority cases a difference between two strings is found in the very beginning of the two strings. So the trailing spaces are even not compared.

Proposal

In order to make ColumnStore compatible with MariaDB with respect of zero bytes and trailing spaces, let's change the way how we pad short CHAR values: from zero byte padding to space padding.


Generated at Thu Feb 08 02:51:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.