Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-4590

Minor SIMD canonicalization optimizations

    XMLWordPrintable

Details

    • Task
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • None
    • 23.08.0
    • Core
    • None

    Description

      There are some optimizations that can potentially improve the performance at no cost.

      1. Replace const char* markers with uint32_t. Cuts the amount of memory needed for the markers in half as we know the markers are never at an offset greater than what can fit into a 32-bit integer.
      2. Use the popcount instruction to preallocate the space that new markers need and use a pointer into the data to store them. This avoids the repeated std::vector::push_back() calls inside the loop that converts the bits into offsets.
      3. Remove use of static __m256i variables inside functions. The compiler does not optimize these away and they end up being initialized at runtime. Replacing the code to use a constexpr std::array enables the compiler to generate the arrays at compile time. A quick prototype shows that at least with GCC, it is generated as a vmovdqa ymm7, YMMWORD PTR .LC3[rip] instruction and the guard variable is not created.

      There are also some micro-optimizations that could get rid of a few instructions.

      1. The code that checks if the rightmost character of the previous block was a identifier uses a logical OR instead of a bitwise OR, this seems to introduce an additional test instruction instead of just an or instruction:

        bool rightmost_is_ident_char = pDigs[SIMD_BYTES - 1] || (ident_bitmask & 0x80000000);

      In addition to these, there is a possibility of a more costly optimization where we would pre-allocate space for all markers at the start of the marker creation. The downside of this is that it must assume the worst-case scenario where every character in the SQL string ends up generating a marker. This needs 4 times the size of the SQL string in bytes of memory which without other optimizations is a theoretical maximum of 128MB per thread (64MB with pointers converted to uint32_t offsets). This of course could be optimized to assume that this is never reached and to abort the canonicalization if it ever is.

      This would also allow the bitmask-to-marker conversion to be unrolled but given the relative rarity of markers being generated (about ~1.6 bits in a bitmask of 256 bits), this might not be ideal. If short literals are used a lot (e.g. WHERE val IN (1, 2, 3, 4, 5, 6, 7)) this might be worth doing.

      Attachments

        Activity

          People

            markus makela markus makela
            markus makela markus makela
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.