[MXS-4590] Minor SIMD canonicalization optimizations - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 23.08.0
Component/s: Core
Labels:
None

Description

There are some optimizations that can potentially improve the performance at no cost.

Replace const char* markers with uint32_t. Cuts the amount of memory needed for the markers in half as we know the markers are never at an offset greater than what can fit into a 32-bit integer.
Use the popcount instruction to preallocate the space that new markers need and use a pointer into the data to store them. This avoids the repeated std::vector::push_back() calls inside the loop that converts the bits into offsets.
Remove use of static __m256i variables inside functions. The compiler does not optimize these away and they end up being initialized at runtime. Replacing the code to use a constexpr std::array enables the compiler to generate the arrays at compile time. A quick prototype shows that at least with GCC, it is generated as a vmovdqa ymm7, YMMWORD PTR .LC3[rip] instruction and the guard variable is not created.

There are also some micro-optimizations that could get rid of a few instructions.

The code that checks if the rightmost character of the previous block was a identifier uses a logical OR instead of a bitwise OR, this seems to introduce an additional test instruction instead of just an or instruction:
bool rightmost_is_ident_char = pDigs[SIMD_BYTES - 1] || (ident_bitmask & 0x80000000);

In addition to these, there is a possibility of a more costly optimization where we would pre-allocate space for all markers at the start of the marker creation. The downside of this is that it must assume the worst-case scenario where every character in the SQL string ends up generating a marker. This needs 4 times the size of the SQL string in bytes of memory which without other optimizations is a theoretical maximum of 128MB per thread (64MB with pointers converted to uint32_t offsets). This of course could be optimized to assume that this is never reached and to abort the canonicalization if it ever is.

This would also allow the bitmask-to-marker conversion to be unrolled but given the relative rarity of markers being generated (about ~1.6 bits in a bitmask of 256 bits), this might not be ideal. If short literals are used a lot (e.g. WHERE val IN (1, 2, 3, 4, 5, 6, 7)) this might be worth doing.

Attachments

Activity

People

Assignee:: markus makela

Reporter:: markus makela

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2023-04-20 11:19

Updated:: 2023-05-25 07:33

Resolved:: 2023-05-25 07:33

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.