[MCOL-4791] Fix ColumnCommand fudged data type format to clearly identify CHAR vs VARCHAR Created: 2021-07-02  Updated: 2023-07-01

Status: Stalled
Project: MariaDB ColumnStore
Component/s: ExeMgr, PrimProc
Affects Version/s: 6.1.1
Fix Version/s: 23.10

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Roman
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocks
blocks MCOL-4691 Major Regression: Selects with aggreg... Closed
is blocked by MCOL-4823 WHERE varchar_col<char_col returns a ... Closed

 Description   

Under terms of MCOL-4691 we're going to replace the 0-terminated representation of the RowGroup VARCHAR format for short columns and replace it to:

  • One byte length
  • Followed by the actual string data

This will remove a lot of strnlen() calls used e.g. in the row aggregation code.

In order to do the format change easier we need to clearly distinguish CHAR vs VARCHAR on the PrimProc side.

Currently it's not possible to distinguish because ExeProc sends the data type in ColumnCommand in a "fudged" format as follows:

ExeMgr Real Type PrimProc Fudged Type  PrimProc isDict
---------------- --------------------  ---------------
VARCHAR(1)       VARCHAR(2)            false
VARCHAR(2)       VARCHAR(4)            false
VARCHAR(3)       VARCHAR(4)            false
VARCHAR(4)       CHAR(8)               false
VARCHAR(5)       CHAR(8)               false
VARCHAR(6)       CHAR(8)               false
VARCHAR(7)       CHAR(8)               false
VARCHAR(8)       VARCHAR(8)            true
VARCHAR(9)       VARCHAR(8)            true
VARCHAR(255)     VARCHAR(8)            true
VARCHAR(8000)    VARCHAR(8)            true
 
CHAR(1)          CHAR(1)               false
CHAR(2)          CHAR(2)               false
CHAR(3)          CHAR(4)               false
CHAR(4)          CHAR(4)               false
CHAR(5)          CHAR(8)               false
CHAR(6)          CHAR(8)               false
CHAR(7)          CHAR(8)               false
CHAR(8)          CHAR(8)               false
CHAR(9)          VARCHAR(8)            true
CHAR(255)        VARCHAR(8)            true

The current notation uses VARCHAR(8) to mean "a CHAR or VARCHAR dictionary column", no matter what the original data type is (CHAR or VARCHAR).
Additionally, some tweaks happen when sending VARCHAR(4)..VARCHAR(7). PrimProc sees them as CHAR(8).

Under terms of this task we'll change the code as follows:

  • PrimProc we'll see the exact ExeMgr side data type: true CHAR or true VARCHAR.
  • isDict will be serialized and deserialized (currently it's detected on the PrimProc side by testing the data type against VARCHAR(8)).

The new fudged data type mapping will look as follows:

ExeMgr Real Type PrimProc Fudged Type  PrimProc isDict
---------------- --------------------  ---------------
VARCHAR(1)       VARCHAR(2)            false
VARCHAR(2)       VARCHAR(4)            false
VARCHAR(3)       VARCHAR(4)            false
VARCHAR(4)       VARCHAR(8)            false
VARCHAR(5)       VARCHAR(8)            false
VARCHAR(6)       VARCHAR(8)            false
VARCHAR(7)       VARCHAR(8)            false
VARCHAR(8)       VARCHAR(8)            true
VARCHAR(9)       VARCHAR(8)            true
VARCHAR(255)     VARCHAR(8)            true
VARCHAR(8000)    VARCHAR(8)            true
 
CHAR(1)          CHAR(1)               false
CHAR(2)          CHAR(2)               false
CHAR(3)          CHAR(4)               false
CHAR(4)          CHAR(4)               false
CHAR(5)          CHAR(8)               false
CHAR(6)          CHAR(8)               false
CHAR(7)          CHAR(8)               false
CHAR(8)          CHAR(8)               false
CHAR(9)          CHAR(8)               true
CHAR(255)        CHAR(8)               true



 Comments   
Comment by Roman [ 2021-07-13 ]

The patch breaks dict columns filtering, e.g.

{format}
select count from lineitem where l_comment > l_shipinstruct;{format}
Generated at Thu Feb 08 02:53:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.