[MXS-4590] Minor SIMD canonicalization optimizations Created: 2023-04-20  Updated: 2023-05-25  Resolved: 2023-05-25

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: None
Fix Version/s: 23.08.0

Type: Task Priority: Minor
Reporter: markus makela Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None


 Description   

There are some optimizations that can potentially improve the performance at no cost.

  1. Replace const char* markers with uint32_t. Cuts the amount of memory needed for the markers in half as we know the markers are never at an offset greater than what can fit into a 32-bit integer.
  2. Use the popcount instruction to preallocate the space that new markers need and use a pointer into the data to store them. This avoids the repeated std::vector::push_back() calls inside the loop that converts the bits into offsets.
  3. Remove use of static __m256i variables inside functions. The compiler does not optimize these away and they end up being initialized at runtime. Replacing the code to use a constexpr std::array enables the compiler to generate the arrays at compile time. A quick prototype shows that at least with GCC, it is generated as a vmovdqa ymm7, YMMWORD PTR .LC3[rip] instruction and the guard variable is not created.

There are also some micro-optimizations that could get rid of a few instructions.

  1. The code that checks if the rightmost character of the previous block was a identifier uses a logical OR instead of a bitwise OR, this seems to introduce an additional test instruction instead of just an or instruction:

    bool rightmost_is_ident_char = pDigs[SIMD_BYTES - 1] || (ident_bitmask & 0x80000000);

In addition to these, there is a possibility of a more costly optimization where we would pre-allocate space for all markers at the start of the marker creation. The downside of this is that it must assume the worst-case scenario where every character in the SQL string ends up generating a marker. This needs 4 times the size of the SQL string in bytes of memory which without other optimizations is a theoretical maximum of 128MB per thread (64MB with pointers converted to uint32_t offsets). This of course could be optimized to assume that this is never reached and to abort the canonicalization if it ever is.

This would also allow the bitmask-to-marker conversion to be unrolled but given the relative rarity of markers being generated (about ~1.6 bits in a bitmask of 256 bits), this might not be ideal. If short literals are used a lot (e.g. WHERE val IN (1, 2, 3, 4, 5, 6, 7)) this might be worth doing.


Generated at Thu Feb 08 04:29:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.