Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27266

Improve UCA collation performance for utf8mb3 and utf8mb4




      Recently in MDEV-26572 we significantly improved performance of some simple multi-byte collations on the ASCII range. The idea of the improvement was to handle multiple ASCII characters (4 or 8) at the same time.

      Similar style of improvement can be done for UCA collations for utf8mb3 and utf8mb4.

      It's hard to handle 4 or 8 bytes at the same time, because UCA is much more complex than simple collations improved in MDEV-26572. However, it's possible to handle at least 2 bytes at the same time. It will improve performance for:

      • 2-byte sequences representing two consequent ASCII characters
      • 2-byte sequences representing a single 2-byte character (such as accented Latin letters, Greek, Cyrillic, Armenian, Hebrew, Arabic).

      Performance improvement, level 1:

      For every bytes pair [00..FF][00..FF] which:

      • a. consists of two ASCII characters or makes a well-formed two-byte character
      • b. whose total weight string fits into 4 weights (concatenated weight string in case of two ASCII characters, or a single weight string in case of a two-byte character)
      • c. whose weight is context independent (i.e. does not depend on contractions or previous context pairs)

      let's store weights in a new separate array of 64K elements of a new data type MY_UCA_2BYTES_ITEM, defined as follows:

      #define MY_UCA_2BYTES_MAX_WEIGHT_SIZE (4+1) /* Including 0 terminator */
      typedef struct my_uca_2bytes_item_t
        uint16 weight[MY_UCA_2BYTES_MAX_WEIGHT_SIZE];

      so during scanner_next() we can scan two bytes at a time. Byte pairs that do not match the conditions a-c should be marked in this array as not applicable for optimization, so they can be scanned as before.

      Performance improvement, level 2:

      For every byte pair which is applicable for optimization in #1, and which produces only one or two weights, let's store weights in one more array of 64K elements of a new data type MY_UCA_WEIGHT2, defined as follows:

      typedef struct my_uca_weight2_t
        uint16 weight[2];
      } MY_UCA_WEIGHT2;

      So in the beginning of strnncoll*() we can skip equal prefixes using an even more efficient loop. This loop will consume two bytes at a time. The loop will scan while the two bytes on both sides produce weight strings of equal length (i.e. one weight on both sides, or two weights on both sides).

      This will allow to compare efficiently:

      • Context independent sequences consisting of two ASCII characters
      • Context independent 2-byte characters
      • Contractions consisting of two ASCII characters, e.g. Czech "ch".
      • Some tricky cases: "ss" vs "SHARP S" ("ss" produces two weights, 0xC39F also produces two weights)

      Other Unicode character sets

      Under terms of this patch we'll improve only utf8mb3 and utf8mb4. Other Unicode character sets (ucs2, utf16le, utf16, utf32) can also reuse the same optimization, however this will need some additional code tweaks. Let's do it later under terms of a separate task later.


          Issue Links



              bar Alexander Barkov
              bar Alexander Barkov
              1 Vote for this issue
              6 Start watching this issue



                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.