Joins involving string data types are implemented using so called Typeless join.
The idea is the following:
- 1. ExeMgr iterates rows in the small side RowGroup and pack them into "Typeless" representation (i.e. column values written in a single byte array).
- 2. ExeMgr sends Typeless row representations over the network to PrimProc.
- 3. PrimProc receives the small side row Typeless prepresentations and feeds them into a hash table.
- 4. PrimProc iterates through the large side RowGroup, converts every row into Typeless representation again, and searches this Typeless row representation in the hash table. So the large side row is included into the join result set if it is found in the hash table populated by the small side rows.
The 4th step is a problem. There is no sense to convert the large side from RowGroup/Row format into Typeless format. It's possible to use the Row representation directly.
The underlying hash and comparison routines should be extended to understand both Typeless and Row formats. So PrimProc can:
- Calculate the hash of the large side row directly on its Row representation
- Compare the small side Typeless representation directly to the large side Row representation
This change will be done on the PrimProc side. ExeMgr most likely won't change its behaviour in any ways.