Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27266

Improve UCA collation performance for utf8mb3 and utf8mb4

Details

    Description

      Recently in MDEV-26572 we significantly improved performance of some simple multi-byte collations on the ASCII range. The idea of the improvement was to handle multiple ASCII characters (4 or 8) at the same time.

      Similar style of improvement can be done for UCA collations for utf8mb3 and utf8mb4.

      It's hard to handle 4 or 8 bytes at the same time, because UCA is much more complex than simple collations improved in MDEV-26572. However, it's possible to handle at least 2 bytes at the same time. It will improve performance for:

      • 2-byte sequences representing two consequent ASCII characters
      • 2-byte sequences representing a single 2-byte character (such as accented Latin letters, Greek, Cyrillic, Armenian, Hebrew, Arabic).

      Performance improvement, level 1:

      For every bytes pair [00..FF][00..FF] which:

      • a. consists of two ASCII characters or makes a well-formed two-byte character
      • b. whose total weight string fits into 4 weights (concatenated weight string in case of two ASCII characters, or a single weight string in case of a two-byte character)
      • c. whose weight is context independent (i.e. does not depend on contractions or previous context pairs)

      let's store weights in a new separate array of 64K elements of a new data type MY_UCA_2BYTES_ITEM, defined as follows:

      #define MY_UCA_2BYTES_MAX_WEIGHT_SIZE (4+1) /* Including 0 terminator */
       
      typedef struct my_uca_2bytes_item_t
      {
        uint16 weight[MY_UCA_2BYTES_MAX_WEIGHT_SIZE];
      } MY_UCA_2BYTES_ITEM;

      so during scanner_next() we can scan two bytes at a time. Byte pairs that do not match the conditions a-c should be marked in this array as not applicable for optimization, so they can be scanned as before.

      Performance improvement, level 2:

      For every byte pair which is applicable for optimization in #1, and which produces only one or two weights, let's store weights in one more array of 64K elements of a new data type MY_UCA_WEIGHT2, defined as follows:

      typedef struct my_uca_weight2_t
      {
        uint16 weight[2];
      } MY_UCA_WEIGHT2;

      So in the beginning of strnncoll*() we can skip equal prefixes using an even more efficient loop. This loop will consume two bytes at a time. The loop will scan while the two bytes on both sides produce weight strings of equal length (i.e. one weight on both sides, or two weights on both sides).

      This will allow to compare efficiently:

      • Context independent sequences consisting of two ASCII characters
      • Context independent 2-byte characters
      • Contractions consisting of two ASCII characters, e.g. Czech "ch".
      • Some tricky cases: "ss" vs "SHARP S" ("ss" produces two weights, 0xC39F also produces two weights)

      Other Unicode character sets

      Under terms of this patch we'll improve only utf8mb3 and utf8mb4. Other Unicode character sets (ucs2, utf16le, utf16, utf32) can also reuse the same optimization, however this will need some additional code tweaks. Let's do it later under terms of a separate task later.

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov added a comment - - edited

            Benchmarking

            JIRA does not allow to put characters outside of BMP.
            So in the below text this character:

            U+1F44D THUMBS UP SIGN (_utf8 F09F918D)
            

            was replaced to Z.

            utf8mb4_general_ci (not changed - just for reference)

            -- Warning up
            SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
             
            -- Benchmarking
            SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa'));
            DO BENCHMARK(10000000,strcmp('яяяя','яяяя'));
            DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ'));
            DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ'));
            

            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            Query OK, 0 rows affected (0.107 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa'));
            Query OK, 0 rows affected (0.109 sec)
             
            MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            Query OK, 0 rows affected (0.000 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            Query OK, 0 rows affected (0.103 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('яяяя','яяяя'));
            Query OK, 0 rows affected (0.192 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ'));
            Query OK, 0 rows affected (0.266 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ'));
            Query OK, 0 rows affected (0.260 sec)
            

            Summary

            A         B          Time   Comment
            ----      -----      ----   --------
            aaaa      aaaa       0.103  ASCII
            aaaaaaaa  aaaaaaaa   0.109  ASCII
            яяяя      яяяя       0.192  2-byte Cyrillic
            ắắắắ      ắắắắ       0.266  3-byte Vietnamese
            ZZZZ      ZZZZ       0.260  4-byte Emoji (see comment above)
            

            Old utf8mb4_unicode_ci (before the patch)

            -- Warning up
            SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
             
            -- Benchmarking
            SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
            DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa'));
            DO BENCHMARK(10000000,strcmp('яяяя','яяяя'));
            DO BENCHMARK(10000000,strcmp('ssss','ßß'));
            DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ'));
            DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ'));
            

            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            Query OK, 0 rows affected (0.259 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa'));
            Query OK, 0 rows affected (0.405 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('яяяя','яяяя'));
            Query OK, 0 rows affected (0.384 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ssss','ßß'));
            Query OK, 0 rows affected (0.265 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ'));
            Query OK, 0 rows affected (0.413 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ'));
            Query OK, 0 rows affected (0.417 sec)
            

            Summary

            A         B          Time    % of general_ci      Comment
            ----      -----      ----    -------------------  -------
            aaaa      aaaa       0.259   251   (259/103*100)  ASCII
            aaaaaaaa  aaaaaaaa   0.405   371   (405/109*100)  ASCII
            яяяя      яяяя       0.384   200   (384/192*100)  2-byte Cyrillic
            ssss      ßß         0.265   N/A                  ASCII vs 2-byte Latin with expansion
            ắắắắ      ắắắắ       0.414   155   (414/266*100)  3-byte Vietnamese
            ZZZZ      ZZZZ       0.417   160   (417/260*100)  4-byte Emoji (see comment above)
            

            New utf8mb4_unicode_ci (after the patch)

            -- Warning up
            SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
             
            -- Benchmarking
            SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
            DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa'));
            DO BENCHMARK(10000000,strcmp('яяяя','яяяя'));
            DO BENCHMARK(10000000,strcmp('ssss','ßß'));
            DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ'));
            DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ'));
            

            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa'));
            Query OK, 0 rows affected (0.160 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa'));
            Query OK, 0 rows affected (0.181 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('яяяя','яяяя'));
            Query OK, 0 rows affected (0.177 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ssss','ßß'));
            Query OK, 0 rows affected (0.156 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ'));
            Query OK, 0 rows affected (0.433 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ'));
            Query OK, 0 rows affected (0.476 sec)
            

            Summary

            A        B        Time    % of utf8mb4_general_ci  Comment
            ----     ----     ----    -----------------------  -------
            aaaa     aaaa     0.160   155  (160/103*100)       ASCII
            aaaaaaaa aaaaaaaa 0.181   166 (181/109*100)        ASCII
            яяяя     яяяя     0.177   92  (177/192*100)        2-byte Cyrillic
            ssss     ßß       0.156   N/A                      ASCII vs 2-byte Latin with expansion
            ắắắắ     ắắắắ     0.433   163 (433/266*100)        3-byte Vietnamese
            ZZZZ     ZZZZ     0.476  182 (476/260*100)         4-byte Emoji (see comment above)
            

            Full summary

            utf8mb4_general_ci - old utf8mb4_unicode_ci - new utf8mb4_unicode_ci

            A         B        % New/Old   OldTime  % Old/general_ci   NewTime % New/general_ci    Comment
            ----      -----    ---------   -------  -----------------  ------- -----------------  -------
            aaaa      aaaa            62   0.259    251 (259/103*100)  0.160   155 (160/103*100)  ASCII
            aaaaaaaa  aaaaaaaa        45   0.405    371 (405/109*100)  0.181   166 (181/109*100)  ASCII
            яяяя      яяяя            46   0.384    200 (384/192*100)  0.177   98  (188/192*100)  2-byte Cyrillic
            ssss      ßß              59   0.265    N/A                0.156   N/A                ASCII vs 2-byte Latin with expansion
            ắắắắ      ắắắắ           105   0.414    155 (414/266*100)  0.433   162 (433/266*100)  3-byte Vietnamese
            ZZZZ      ZZZZ           114   0.417    160 (417/260*100)  0.476   182 (476/260*100)  Emoji, see comment above
            

            Obvervations:

            Performance significantly improved:

            • On 4 byte ASCII strings the new implementation takes 62% of the old time (155% of utf8mb4_general_ci)
            • On 8 byte ASCII strings the new implementation takes 45% of the old time (166% of utf8mb4_general_ci)
            • On 4 character (8 byte) Cyrillic strings the new implementation takes 46% of the old time (98% of utf8mb4_general_ci)
            • On 'ssss' vs 'ßß' the new implementation take 59% of the old time

            Performance slightly degraded:

            • On 4 character (4*3=12 byte) Vietnamese strings the new implementation takes 105% of the old time (and 162% of utf8mb4_general_ci)
            • On 4 characher (4*4=16 byte) Emoji strings the new implementations takes 114% of the old time (and 182% of utf8mb4_general_ci)

            The slight slow-down on 3-byte and 4-byte characters is expected: it now tries to go the optimized way, then fails, then goes the old non-optimized way.

            bar Alexander Barkov added a comment - - edited Benchmarking JIRA does not allow to put characters outside of BMP. So in the below text this character: U+1F44D THUMBS UP SIGN (_utf8 F09F918D) was replaced to Z. utf8mb4_general_ci (not changed - just for reference) -- Warning up SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; DO BENCHMARK(10000000,strcmp( 'xxxx' , 'xxxx' ));   -- Benchmarking SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; DO BENCHMARK(10000000,strcmp( 'aaaa' , 'aaaa' )); DO BENCHMARK(10000000,strcmp( 'aaaaaaaa' , 'aaaaaaaa' )); DO BENCHMARK(10000000,strcmp( 'яяяя' , 'яяяя' )); DO BENCHMARK(10000000,strcmp( 'ắắắắ' , 'ắắắắ' )); DO BENCHMARK(10000000,strcmp( 'ZZZZ' , 'ZZZZ' )); MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa')); Query OK, 0 rows affected (0.107 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa')); Query OK, 0 rows affected (0.109 sec)   MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; Query OK, 0 rows affected (0.000 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa')); Query OK, 0 rows affected (0.103 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('яяяя','яяяя')); Query OK, 0 rows affected (0.192 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ')); Query OK, 0 rows affected (0.266 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ')); Query OK, 0 rows affected (0.260 sec) Summary A B Time Comment ---- ----- ---- -------- aaaa aaaa 0.103 ASCII aaaaaaaa aaaaaaaa 0.109 ASCII яяяя яяяя 0.192 2-byte Cyrillic ắắắắ ắắắắ 0.266 3-byte Vietnamese ZZZZ ZZZZ 0.260 4-byte Emoji (see comment above) Old utf8mb4_unicode_ci (before the patch) -- Warning up SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; DO BENCHMARK(10000000,strcmp( 'xxxx' , 'xxxx' ));   -- Benchmarking SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci; DO BENCHMARK(10000000,strcmp( 'aaaa' , 'aaaa' )); DO BENCHMARK(10000000,strcmp( 'aaaaaaaa' , 'aaaaaaaa' )); DO BENCHMARK(10000000,strcmp( 'яяяя' , 'яяяя' )); DO BENCHMARK(10000000,strcmp( 'ssss' , 'ßß' )); DO BENCHMARK(10000000,strcmp( 'ắắắắ' , 'ắắắắ' )); DO BENCHMARK(10000000,strcmp( 'ZZZZ' , 'ZZZZ' )); MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa')); Query OK, 0 rows affected (0.259 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa')); Query OK, 0 rows affected (0.405 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('яяяя','яяяя')); Query OK, 0 rows affected (0.384 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ssss','ßß')); Query OK, 0 rows affected (0.265 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ')); Query OK, 0 rows affected (0.413 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ')); Query OK, 0 rows affected (0.417 sec) Summary A B Time % of general_ci Comment ---- ----- ---- ------------------- ------- aaaa aaaa 0.259 251 (259/103*100) ASCII aaaaaaaa aaaaaaaa 0.405 371 (405/109*100) ASCII яяяя яяяя 0.384 200 (384/192*100) 2-byte Cyrillic ssss ßß 0.265 N/A ASCII vs 2-byte Latin with expansion ắắắắ ắắắắ 0.414 155 (414/266*100) 3-byte Vietnamese ZZZZ ZZZZ 0.417 160 (417/260*100) 4-byte Emoji (see comment above) New utf8mb4_unicode_ci (after the patch) -- Warning up SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; DO BENCHMARK(10000000,strcmp( 'xxxx' , 'xxxx' ));   -- Benchmarking SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci; DO BENCHMARK(10000000,strcmp( 'aaaa' , 'aaaa' )); DO BENCHMARK(10000000,strcmp( 'aaaaaaaa' , 'aaaaaaaa' )); DO BENCHMARK(10000000,strcmp( 'яяяя' , 'яяяя' )); DO BENCHMARK(10000000,strcmp( 'ssss' , 'ßß' )); DO BENCHMARK(10000000,strcmp( 'ắắắắ' , 'ắắắắ' )); DO BENCHMARK(10000000,strcmp( 'ZZZZ' , 'ZZZZ' )); MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaa','aaaa')); Query OK, 0 rows affected (0.160 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('aaaaaaaa','aaaaaaaa')); Query OK, 0 rows affected (0.181 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('яяяя','яяяя')); Query OK, 0 rows affected (0.177 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ssss','ßß')); Query OK, 0 rows affected (0.156 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ắắắắ','ắắắắ')); Query OK, 0 rows affected (0.433 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ZZZZ','ZZZZ')); Query OK, 0 rows affected (0.476 sec) Summary A B Time % of utf8mb4_general_ci Comment ---- ---- ---- ----------------------- ------- aaaa aaaa 0.160 155 (160/103*100) ASCII aaaaaaaa aaaaaaaa 0.181 166 (181/109*100) ASCII яяяя яяяя 0.177 92 (177/192*100) 2-byte Cyrillic ssss ßß 0.156 N/A ASCII vs 2-byte Latin with expansion ắắắắ ắắắắ 0.433 163 (433/266*100) 3-byte Vietnamese ZZZZ ZZZZ 0.476 182 (476/260*100) 4-byte Emoji (see comment above) Full summary utf8mb4_general_ci - old utf8mb4_unicode_ci - new utf8mb4_unicode_ci A B % New/Old OldTime % Old/general_ci NewTime % New/general_ci Comment ---- ----- --------- ------- ----------------- ------- ----------------- ------- aaaa aaaa 62 0.259 251 (259/103*100) 0.160 155 (160/103*100) ASCII aaaaaaaa aaaaaaaa 45 0.405 371 (405/109*100) 0.181 166 (181/109*100) ASCII яяяя яяяя 46 0.384 200 (384/192*100) 0.177 98 (188/192*100) 2-byte Cyrillic ssss ßß 59 0.265 N/A 0.156 N/A ASCII vs 2-byte Latin with expansion ắắắắ ắắắắ 105 0.414 155 (414/266*100) 0.433 162 (433/266*100) 3-byte Vietnamese ZZZZ ZZZZ 114 0.417 160 (417/260*100) 0.476 182 (476/260*100) Emoji, see comment above Obvervations: Performance significantly improved: On 4 byte ASCII strings the new implementation takes 62% of the old time (155% of utf8mb4_general_ci) On 8 byte ASCII strings the new implementation takes 45% of the old time (166% of utf8mb4_general_ci) On 4 character (8 byte) Cyrillic strings the new implementation takes 46% of the old time (98% of utf8mb4_general_ci) On 'ssss' vs 'ßß' the new implementation take 59% of the old time Performance slightly degraded: On 4 character (4*3=12 byte) Vietnamese strings the new implementation takes 105% of the old time (and 162% of utf8mb4_general_ci) On 4 characher (4*4=16 byte) Emoji strings the new implementations takes 114% of the old time (and 182% of utf8mb4_general_ci) The slight slow-down on 3-byte and 4-byte characters is expected: it now tries to go the optimized way, then fails, then goes the old non-optimized way.

            It's in this branch: preview-10.10-uca14.

            serg Sergei Golubchik added a comment - It's in this branch: preview-10.10-uca14 .

            Environment:

            Linux  5.13.0-52-generic #59-Ubuntu SMP  x86_64 x86_64 x86_64 GNU/Linux
            memory         64GiB System Memory
            processor      11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
            

            Summary

            A B general_ci utf8mb4_unicode_ci (old) utf8mb4_unicode_ci (new) % New/Old % Old/general_ci % New/general_ci
            aaaa aaaa 0.363 0.933 0.545 54.4 257.0 150.1
            aaaaaaaa aaaaaaaa 0.396 1.452 0.651 44.8 366.7 164.4
            яяяя яяяя 0.702 1.278 0.652 51.0 182.1 92.9
            ssss ßß 0.487 0.949 0.545 57.4 194.9 111,9
            ắắắắ ắắắắ 0.885 1.361 1.685 123.8 153.8 190.4
            0.665 1.132 1.529 135.0 170.2 229.9

            On my laptop the result is a little less optimistic, but in line with expectations.

            lstartseva Lena Startseva added a comment - Environment: Linux 5.13.0-52-generic #59-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux memory 64GiB System Memory processor 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz Summary A B general_ci utf8mb4_unicode_ci (old) utf8mb4_unicode_ci (new) % New/Old % Old/general_ci % New/general_ci aaaa aaaa 0.363 0.933 0.545 54.4 257.0 150.1 aaaaaaaa aaaaaaaa 0.396 1.452 0.651 44.8 366.7 164.4 яяяя яяяя 0.702 1.278 0.652 51.0 182.1 92.9 ssss ßß 0.487 0.949 0.545 57.4 194.9 111,9 ắắắắ ắắắắ 0.885 1.361 1.685 123.8 153.8 190.4 0.665 1.132 1.529 135.0 170.2 229.9 On my laptop the result is a little less optimistic, but in line with expectations.

            Ok to push

            lstartseva Lena Startseva added a comment - Ok to push
            rjasdfiii Rick James added a comment -

            As for "other" charsets (ucs2, etc), I suggest that can be very low on the priority list. I have not heard of anyone creating a table with such. Importing is an unrelated topic; I would encourage converting to utf8mb4 during importation, thereby avoiding the need for collation speedups.

            rjasdfiii Rick James added a comment - As for "other" charsets (ucs2, etc), I suggest that can be very low on the priority list. I have not heard of anyone creating a table with such. Importing is an unrelated topic; I would encourage converting to utf8mb4 during importation, thereby avoiding the need for collation speedups.

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.