Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27265

Improve contraction performance in UCA collations

Details

    Description

      Currently, when searching for a contraction, the code in the function my_uca_contraction_weight() simply iterates through all items in a MY_CONTRACTIONS array.
      So the performance depends on where the searched contraction resides in the list:

      • it is fast for contractions which are in the very beginning of the list
      • it is slow for contractions which are in the end of the list

      Some collations can have many contractions. For example, utf8mb4_thai_520_w2 has 231 contractions. Searching for contractions that are in the end of the list can be really slow.

      In the future versions of UCA collations we're going to enable built-in DUCET contractions. It has near 900 built-in contractions. It's important to have this problem fixed.

      Benchmarking utf8mb4_thai_520_w2

      Let's check contractions from the array thai_contractions in ctype-uca.c.

      This is a relevant snapshop from ctype-uca.c:

        /* thai_contraction[0] */
        { /* <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> */
          { 0x0E40, 0x0E01, 0 },
          { 0x1F70, 0x1FAA, 0 },
          FALSE
        },
       
         ...
       
        /* thai_contraction[7] */
        { /* <THAI CHARACTER SARA O, THAI CHARACTER KHO KHAI> */
          { 0x0E42, 0x0E02, 0 },
          { 0x1F71, 0x1FAC, 0 },
          FALSE
        },
       
        ...
       
        /* thai_contraction[229] */
        { /* <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER HO NOKHUK> */
          { 0x0E44, 0x0E2E, 0 },
          { 0x1F9D, 0x1FAE, 0 },
          FALSE
        },

      Now let's check the pefromance of the above three contractions:

      • The very first one
      • The contraction on the 7th position in the array
      • The contraction on the 229th position (the previous before the very last one)

      The SQL code below intentionally adds digits 1 and 2 in the end of the left and right strings, to make the two strings primarily different, to avoid the comparison function going to the secondary level.

      SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
      DO BENCHMARK(10000000,strcmp('เก1','เก2'));
      Query OK, 0 rows affected (0.439 sec)

      SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
      DO BENCHMARK(10000000,strcmp('โข1','โข2'));
      Query OK, 0 rows affected (0.940 sec)

      SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
      DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
      Query OK, 0 rows affected (15.805 sec)

      Summary:

      Ord Contraction Time
      --- ----------- ----
      0   เก           0.439
      7   โข           0.940
      229 ไฮ           15.805

      Notice:

      • The contraction on the 7th position is twice slower than the very first one.
      • The contraction on the 229th position is 36 times slower than the very fitst one!

      Benchmarking utf8mb4_croatian_ci

      This collation has 9th contractions. Let's check their performance in the order of their appearance in the tailoring:

      SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
      DO BENCHMARK(10000000,strcmp('dž1','dž2'));
      DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
      DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
      DO BENCHMARK(10000000,strcmp('lj1','lj2'));
      DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
      DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
      DO BENCHMARK(10000000,strcmp('nj1','nj2'));
      DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
      DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));

      This is the output:

      MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž1','dž2'));
      Query OK, 0 rows affected (0.397 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
      Query OK, 0 rows affected (0.452 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
      Query OK, 0 rows affected (0.500 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj1','lj2'));
      Query OK, 0 rows affected (0.535 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
      Query OK, 0 rows affected (0.593 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
      Query OK, 0 rows affected (0.647 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj1','nj2'));
      Query OK, 0 rows affected (0.700 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
      Query OK, 0 rows affected (0.753 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));
      Query OK, 0 rows affected (0.811 sec)

      The summary:

      Ord  Contraction Time
      ---  ----------- ----
      0    dž          0.397
      1    Dž          0.452
      2    DŽ          0.500
      3    lj          0.535
      4    Lj          0.593
      5    LJ          0.647
      6    nj          0.700
      7    Nj          0.753
      8    NJ          0.811

      Proposal

      The code should be fixed to use a hash table for contractions instead of the simple list iteration.

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov added a comment - - edited

            Benchmarking

            Scripts

            -- Warning up
            SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
             
            -- Benchmarking
            SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
            DO BENCHMARK(10000000,strcmp('เก1','เก2'));
            DO BENCHMARK(10000000,strcmp('โข1','โข2'));
            DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
            

            -- Warning up
            SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
            DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
             
            -- Benchmarking
            SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
            DO BENCHMARK(10000000,strcmp('dž','dž'));
            DO BENCHMARK(10000000,strcmp('Dž','Dž'));
            DO BENCHMARK(10000000,strcmp('DŽ','DŽ'));
            DO BENCHMARK(10000000,strcmp('lj','lj'));
            DO BENCHMARK(10000000,strcmp('Lj','Lj'));
            DO BENCHMARK(10000000,strcmp('LJ','LJ'));
            DO BENCHMARK(10000000,strcmp('nj','nj'));
            DO BENCHMARK(10000000,strcmp('Nj','Nj'));
            DO BENCHMARK(10000000,strcmp('NJ','NJ'));
            

            Old (before the patch)

            MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
            Query OK, 0 rows affected (0.001 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('เก1','เก2'));
            Query OK, 0 rows affected (0.432 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('โข1','โข2'));
            Query OK, 0 rows affected (0.884 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
            Query OK, 0 rows affected (15.566 sec)
            

            MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
            Query OK, 0 rows affected (0.001 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž','dž'));
            Query OK, 0 rows affected (0.366 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž','Dž'));
            Query OK, 0 rows affected (0.425 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ','DŽ'));
            Query OK, 0 rows affected (0.472 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj','lj'));
            Query OK, 0 rows affected (0.509 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj','Lj'));
            Query OK, 0 rows affected (0.561 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ','LJ'));
            Query OK, 0 rows affected (0.622 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj','nj'));
            Query OK, 0 rows affected (0.673 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj','Nj'));
            Query OK, 0 rows affected (0.727 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ','NJ'));
            Query OK, 0 rows affected (0.786 sec)
            

            New (after the patch)

            MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
            Query OK, 0 rows affected (0.001 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('เก1','เก2'));
            Query OK, 0 rows affected (0.531 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('โข1','โข2'));
            Query OK, 0 rows affected (0.530 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
            Query OK, 0 rows affected (0.531 sec)
            

            MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
            Query OK, 0 rows affected (0.001 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž','dž'));
            Query OK, 0 rows affected (0.503 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž','Dž'));
            Query OK, 0 rows affected (0.507 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ','DŽ'));
            Query OK, 0 rows affected (0.501 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj','lj'));
            Query OK, 0 rows affected (0.154 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj','Lj'));
            Query OK, 0 rows affected (0.155 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ','LJ'));
            Query OK, 0 rows affected (0.152 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj','nj'));
            Query OK, 0 rows affected (0.152 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj','Nj'));
            Query OK, 0 rows affected (0.153 sec)
             
            MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ','NJ'));
            Query OK, 0 rows affected (0.151 sec)
            

            Summary

            utf8mb4_thai_520_w2

            Ord  A    B    Old time  New Time  % New/Old
            ---  ---  ---  --------  --------  ---------
            0    เก1   เก2     0.432     0.531       122
            7    โข1   โข2     0.884     0.530        59
            229  ไฮ1   ไฮ2    15.566     0.531         3
            

            utf8mb4_croatian_ci

            Ord A   B     Old time New time  % New/Old
            --- --- ---   -------- --------  ---------
            0   dž  dž       0.366    0.503        137
            1   Dž  Dž       0.425    0.507        119
            2   DŽ  DŽ       0.472    0.501        106
            3   lj  lj       0.509    0.154         30
            4   Lj  Lj       0.561    0.155         27
            5   LJ  LJ       0.622    0.152         24
            6   nj  nj       0.673    0.152         22
            7   Nj  Nj       0.727    0.153         21
            8   NJ  NJ       0.786    0.151         19
            

            Observation:

            • After the change, the contraction search time does not depend on the contraction ordinal position in the tailoring (it only depends on "ASCII or not ASCII")
            • Searching for the very last (but one) contraction in utf8mb4_thai_520_w2 now takes 3% of the old search time (30 times faster!)
            • Searching for the very last contraction in utf8mb4_croatian_ci now takes 19% of the old search time (5 times faster)
            bar Alexander Barkov added a comment - - edited Benchmarking Scripts -- Warning up SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; DO BENCHMARK(10000000,strcmp( 'xxxx' , 'xxxx' ));   -- Benchmarking SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; DO BENCHMARK(10000000,strcmp( 'เก1' , 'เก2' )); DO BENCHMARK(10000000,strcmp( 'โข1' , 'โข2' )); DO BENCHMARK(10000000,strcmp( 'ไฮ1' , 'ไฮ2' )); -- Warning up SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; DO BENCHMARK(10000000,strcmp( 'xxxx' , 'xxxx' ));   -- Benchmarking SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci; DO BENCHMARK(10000000,strcmp( 'dž' , 'dž' )); DO BENCHMARK(10000000,strcmp( 'Dž' , 'Dž' )); DO BENCHMARK(10000000,strcmp( 'DŽ' , 'DŽ' )); DO BENCHMARK(10000000,strcmp( 'lj' , 'lj' )); DO BENCHMARK(10000000,strcmp( 'Lj' , 'Lj' )); DO BENCHMARK(10000000,strcmp( 'LJ' , 'LJ' )); DO BENCHMARK(10000000,strcmp( 'nj' , 'nj' )); DO BENCHMARK(10000000,strcmp( 'Nj' , 'Nj' )); DO BENCHMARK(10000000,strcmp( 'NJ' , 'NJ' )); Old (before the patch) MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; Query OK, 0 rows affected (0.001 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('เก1','เก2')); Query OK, 0 rows affected (0.432 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('โข1','โข2')); Query OK, 0 rows affected (0.884 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2')); Query OK, 0 rows affected (15.566 sec) MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci; Query OK, 0 rows affected (0.001 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž','dž')); Query OK, 0 rows affected (0.366 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž','Dž')); Query OK, 0 rows affected (0.425 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ','DŽ')); Query OK, 0 rows affected (0.472 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj','lj')); Query OK, 0 rows affected (0.509 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj','Lj')); Query OK, 0 rows affected (0.561 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ','LJ')); Query OK, 0 rows affected (0.622 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj','nj')); Query OK, 0 rows affected (0.673 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj','Nj')); Query OK, 0 rows affected (0.727 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ','NJ')); Query OK, 0 rows affected (0.786 sec) New (after the patch) MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; Query OK, 0 rows affected (0.001 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('เก1','เก2')); Query OK, 0 rows affected (0.531 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('โข1','โข2')); Query OK, 0 rows affected (0.530 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2')); Query OK, 0 rows affected (0.531 sec) MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci; Query OK, 0 rows affected (0.001 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž','dž')); Query OK, 0 rows affected (0.503 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž','Dž')); Query OK, 0 rows affected (0.507 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ','DŽ')); Query OK, 0 rows affected (0.501 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj','lj')); Query OK, 0 rows affected (0.154 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj','Lj')); Query OK, 0 rows affected (0.155 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ','LJ')); Query OK, 0 rows affected (0.152 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj','nj')); Query OK, 0 rows affected (0.152 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj','Nj')); Query OK, 0 rows affected (0.153 sec)   MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ','NJ')); Query OK, 0 rows affected (0.151 sec) Summary utf8mb4_thai_520_w2 Ord A B Old time New Time % New/Old --- --- --- -------- -------- --------- 0 เก1 เก2 0.432 0.531 122 7 โข1 โข2 0.884 0.530 59 229 ไฮ1 ไฮ2 15.566 0.531 3 utf8mb4_croatian_ci Ord A B Old time New time % New/Old --- --- --- -------- -------- --------- 0 dž dž 0.366 0.503 137 1 Dž Dž 0.425 0.507 119 2 DŽ DŽ 0.472 0.501 106 3 lj lj 0.509 0.154 30 4 Lj Lj 0.561 0.155 27 5 LJ LJ 0.622 0.152 24 6 nj nj 0.673 0.152 22 7 Nj Nj 0.727 0.153 21 8 NJ NJ 0.786 0.151 19 Observation: After the change, the contraction search time does not depend on the contraction ordinal position in the tailoring (it only depends on "ASCII or not ASCII") Searching for the very last (but one) contraction in utf8mb4_thai_520_w2 now takes 3% of the old search time (30 times faster!) Searching for the very last contraction in utf8mb4_croatian_ci now takes 19% of the old search time (5 times faster)

            It's in this branch: preview-10.10-uca14.

            serg Sergei Golubchik added a comment - It's in this branch: preview-10.10-uca14 .

            Environment:

            Linux  5.13.0-52-generic #59-Ubuntu SMP  x86_64 x86_64 x86_64 GNU/Linux
            memory         64GiB System Memory
            processor      11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
            

            Summary
            utf8mb4_thai_520_w2

            Ord A B Old time New Time % New/Old
            0 เก1 เก2 1.437 1.537 107
            7 โข1 โข2 2.138 1.517 71
            229 ไฮ1 ไฮ2 28.365 1.506 5

            utf8mb4_croatian_ci

            Ord A B Old time New Time % New/Old
            0 1.131 1.363 120
            1 1.220 1.318 108
            2 1.337 1.321 99
            3 lj lj 1.407 0.558 40
            4 Lj Lj 1.499 0.558 37
            5 LJ LJ 1.598 0.562 35
            6 nj nj 1.719 0.561 33
            7 Nj Nj 1.808 0.559 31
            8 NJ NJ 1.906 0.561 29

            On my laptop searching for a contraction has showed a slightly worse result, but the improvements are still observable.

            I also checked new collation uca1400_a(i\s)_c(i\s). For thai is everything is good:

            Ord A B ai_ci as_ci ai_cs as_cs
            0 เก1 เก2 1.452 1.450 1.450 1.454
            7 โข1 โข2 1.442 1.450 1.447 1.442
            229 ไฮ1 ไฮ2 1.428 1.433 1.432 1.435

            but for uca1400_croatian_a(i\s)_c(i\s) the time increses when accent sensitive or case sensitive is used:

            Ord A B ai_ci as_ci ai_cs as_cs
            0 1.216 2.144 2.144 3.193
            8 NJ NJ 0.511 2.090 2.070 3.107

            This doesn't look like a defect of the current task, but it might be improved in the future.

            lstartseva Lena Startseva added a comment - Environment: Linux 5.13.0-52-generic #59-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux memory 64GiB System Memory processor 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz Summary utf8mb4_thai_520_w2 Ord A B Old time New Time % New/Old 0 เก1 เก2 1.437 1.537 107 7 โข1 โข2 2.138 1.517 71 229 ไฮ1 ไฮ2 28.365 1.506 5 utf8mb4_croatian_ci Ord A B Old time New Time % New/Old 0 dž dž 1.131 1.363 120 1 Dž Dž 1.220 1.318 108 2 DŽ DŽ 1.337 1.321 99 3 lj lj 1.407 0.558 40 4 Lj Lj 1.499 0.558 37 5 LJ LJ 1.598 0.562 35 6 nj nj 1.719 0.561 33 7 Nj Nj 1.808 0.559 31 8 NJ NJ 1.906 0.561 29 On my laptop searching for a contraction has showed a slightly worse result, but the improvements are still observable. I also checked new collation uca1400_a(i\s)_c(i\s) . For thai is everything is good: Ord A B ai_ci as_ci ai_cs as_cs 0 เก1 เก2 1.452 1.450 1.450 1.454 7 โข1 โข2 1.442 1.450 1.447 1.442 229 ไฮ1 ไฮ2 1.428 1.433 1.432 1.435 but for uca1400_croatian_a(i\s)_c(i\s) the time increses when accent sensitive or case sensitive is used: Ord A B ai_ci as_ci ai_cs as_cs 0 dž dž 1.216 2.144 2.144 3.193 8 NJ NJ 0.511 2.090 2.070 3.107 This doesn't look like a defect of the current task, but it might be improved in the future.

            Ok to push

            lstartseva Lena Startseva added a comment - Ok to push

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.