In-memory records representation consists of two parts:
- columnar data interface class RowGroup
- data-agnostic storage class RGData
RGData has boost::shared_array<uint8_t> rowData member that is 2D matrix where rows represent a record in a group of records.
The layout is deffective for vectorized processing that needs a continues space for a single column values in most cases. It also reduces a number of copy operations made in both EM/PP, e.g scanning/filtering code from primitives/linux-port can fill in the columnar buffer that will be later handed over to an RGData instance to store it in the list. (It is worth to note the layout might be an advantage for some SQL operators that needs certain values to be in cache, e.g. GROUP BY, JOIN)
The suggested change is to replace RGData::rowData with a std::vector<boost::shared_array<uint8_t>> columnData. This change forces for significant changes in the interface RowGroup class that provides get/set methods to access the records data. The operations that prev were trivial will be become more complex, e.g. RowGroup::copyRow(), Row::equals.
The change affects the class Row::Pointer that is a uint8_t* to RGData::rowData + optional StringStore and UserStore ptrs. It is widely used as a key in some distinct maps in sorting(dbcon/joblist/limitedorderby.cpp), aggregation(dbcon/joblist/groupconcat.cpp, dbcon/joblist/tupleaggregatestep.cpp), window functions(utils/windowfunction/*), joins(utils/joiner/tuplejoiner.cpp). This might be the biggest design change challenge.
The change must be hidden behind the existing RowGroup iface so that by the end of this issue there should be not so much changes to the code that leverages RowGroup or RGData. Some are inevitable though.