Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-4568

Change short CHAR to pad values with spaces rather than zeros

    XMLWordPrintable

Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 1.2.0, 1.4.0, 5.4.1, 5.5.1
    • 23.10
    • PrimProc
    • None

    Description

      I run this SQL script:

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a CHAR(4)) ENGINE=ColumnStore;
      INSERT INTO t1 VALUES ('a');
      SELECT * FROM t1 WHERE a='a';
      

      I set a break-point in colCompare() in primitives/linux-port/column.cpp on this code:

      utils::ConstString s1 = {reinterpret_cast<const char*>(&val1), 8};
      utils::ConstString s2 = {reinterpret_cast<const char*>(&val2), 8};
      return colCompareStr(typeHolder, COP, s1.rtrimZero(), s2.rtrimZero());   
      

      and test these variables:

      (gdb) x /4b s1.mStr
      0x7f6310fedcf8:	97	0	0	0
      (gdb) x /4b s2.mStr
      0x7f6310fedcf0:	97	0	0	0
      

      it demonstrates that the values are padded with zero bytes 0x00 rather than spaces 0x20.

      Standard compatibility

      This is not according to the SQL standard, which assumes the values are padded with spaces. The section "Store assignment" says:

      Let T be the TARGET and let V be the VALUE in an application of the General Rules of this Subclause. .
      ...
      If the declared type of T is fixed-length character string with length in characters L and the length in characters M of V is less than L, then the first M characters of T are set to V and the last L–M characters of T are set to <space>s.

      MariaDB collation library compatibility

      MariaDB collation library also expects the values to be padded with spaces rather than zeros. The above code has to trim trailing zeros. Otherwise the collation library would compare the values in a wrong way.

      This hack addresses many use cases. But not all.

      0x00 is a normal data in MariaDB, but is not in ColumnStore

      0x00 in MariaDB is a normal character which can be a part of the data.

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a CHAR(4)) ENGINE=InnoDB;
      INSERT INTO t1 VALUES ('a\0\0\0');
      SELECT HEX(a) FROM t1;
      

      +----------+
      | HEX(a)   |
      +----------+
      | 61000000 |
      +----------+
      

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a CHAR(4)) ENGINE=ColumnStore;
      INSERT INTO t1 VALUES ('a\0\0\0');
      SELECT HEX(a) FROM t1;
      

      +--------+
      | HEX(a) |
      +--------+
      | 61     |
      +--------+
      

      Notice InnoDB preserved the zero bytes, while ColumnStore removed them.

      UTF16 compatibility (a possible future development direction)

      UTF16 is essential for East Asian scripts, because many East Asian characters fit into two bytes in UTF16 encoding, while the same characters require 3 bytes in UTF8 representation. Users can get a 33.3% performance gain by simply choosing UTF16 instead of UTF8 to store data.

      In UTF16, zero bytes are absolutely required. We need to get rid of the assumption that 0x00 means "end of the string".

      The order of characters less than space is different in MariaDB vs ColumnStore

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a CHAR(4) CHARACTER SET latin1 COLLATE latin1_nopad_bin) ENGINE=InnoDB;
      INSERT INTO t1 VALUES ('a'),('a\t');
      SELECT HEX(a) FROM t1 ORDER BY a;
      

      +--------+
      | HEX(a) |
      +--------+
      | 6109   |
      | 61     |
      +--------+
      

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a CHAR(4) CHARACTER SET latin1 COLLATE latin1_nopad_bin) ENGINE=ColumnStore;
      INSERT INTO t1 VALUES ('a'),('a\t');
      SELECT HEX(a) FROM t1 ORDER BY a;
      

      +--------+
      | HEX(a) |
      +--------+
      | 61     |
      | 6109   |
      +--------+
      

      Performance

      The code in colCompare() has to trim trailing zero bytes before passing the values to MariaDB collation library for comparison. This takes some CPU, useless in most cases, because in the majority cases a difference between two strings is found in the very beginning of the two strings. So the trailing spaces are even not compared.

      Proposal

      In order to make ColumnStore compatible with MariaDB with respect of zero bytes and trailing spaces, let's change the way how we pad short CHAR values: from zero byte padding to space padding.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.