Details
-
Task
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
There are some optimizations that can potentially improve the performance at no cost.
- Replace const char* markers with uint32_t. Cuts the amount of memory needed for the markers in half as we know the markers are never at an offset greater than what can fit into a 32-bit integer.
- Use the popcount instruction to preallocate the space that new markers need and use a pointer into the data to store them. This avoids the repeated std::vector::push_back() calls inside the loop that converts the bits into offsets.
- Remove use of static __m256i variables inside functions. The compiler does not optimize these away and they end up being initialized at runtime. Replacing the code to use a constexpr std::array enables the compiler to generate the arrays at compile time. A quick prototype shows that at least with GCC, it is generated as a vmovdqa ymm7, YMMWORD PTR .LC3[rip] instruction and the guard variable is not created.
There are also some micro-optimizations that could get rid of a few instructions.
- The code that checks if the rightmost character of the previous block was a identifier uses a logical OR instead of a bitwise OR, this seems to introduce an additional test instruction instead of just an or instruction:
bool rightmost_is_ident_char = pDigs[SIMD_BYTES - 1] || (ident_bitmask & 0x80000000);
In addition to these, there is a possibility of a more costly optimization where we would pre-allocate space for all markers at the start of the marker creation. The downside of this is that it must assume the worst-case scenario where every character in the SQL string ends up generating a marker. This needs 4 times the size of the SQL string in bytes of memory which without other optimizations is a theoretical maximum of 128MB per thread (64MB with pointers converted to uint32_t offsets). This of course could be optimized to assume that this is never reached and to abort the canonicalization if it ever is.
This would also allow the bitmask-to-marker conversion to be unrolled but given the relative rarity of markers being generated (about ~1.6 bits in a bitmask of 256 bits), this might not be ideal. If short literals are used a lot (e.g. WHERE val IN (1, 2, 3, 4, 5, 6, 7)) this might be worth doing.