Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27265

Improve contraction performance in UCA collations

    XMLWordPrintable

    Details

      Description

      Currently, when searching for a contraction, the code in the function my_uca_contraction_weight() simply iterates through all items in a MY_CONTRACTIONS array.
      So the performance depends on where the searched contraction resides in the list:

      • it is fast for contractions which are in the very beginning of the list
      • it is slow for contractions which are in the end of the list

      Some collations can have many contractions. For example, utf8mb4_thai_520_w2 has 231 contractions. Searching for contractions that are in the end of the list can be really slow.

      In the future versions of UCA collations we're going to enable built-in DUCET contractions. It has near 900 built-in contractions. It's important to have this problem fixed.

      Benchmarking utf8mb4_thai_520_w2

      Let's check contractions from the array thai_contractions in ctype-uca.c.

      This is a relevant snapshop from ctype-uca.c:

        /* thai_contraction[0] */
        { /* <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> */
          { 0x0E40, 0x0E01, 0 },
          { 0x1F70, 0x1FAA, 0 },
          FALSE
        },
       
         ...
       
        /* thai_contraction[7] */
        { /* <THAI CHARACTER SARA O, THAI CHARACTER KHO KHAI> */
          { 0x0E42, 0x0E02, 0 },
          { 0x1F71, 0x1FAC, 0 },
          FALSE
        },
       
        ...
       
        /* thai_contraction[229] */
        { /* <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER HO NOKHUK> */
          { 0x0E44, 0x0E2E, 0 },
          { 0x1F9D, 0x1FAE, 0 },
          FALSE
        },
      

      Now let's check the pefromance of the above three contractions:

      • The very first one
      • The contraction on the 7th position in the array
      • The contraction on the 229th position (the previous before the very last one)

      The SQL code below intentionally adds digits 1 and 2 in the end of the left and right strings, to make the two strings primarily different, to avoid the comparison function going to the secondary level.

      SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
      DO BENCHMARK(10000000,strcmp('เก1','เก2'));
      Query OK, 0 rows affected (0.439 sec)
      

      SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
      DO BENCHMARK(10000000,strcmp('โข1','โข2'));
      Query OK, 0 rows affected (0.940 sec)
      

      SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
      DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
      Query OK, 0 rows affected (15.805 sec)
      

      Summary:

      Ord Contraction Time
      --- ----------- ----
      0   เก           0.439
      7   โข           0.940
      229 ไฮ           15.805
      

      Notice:

      • The contraction on the 7th position is twice slower than the very first one.
      • The contraction on the 229th position is 36 times slower than the very fitst one!

      Benchmarking utf8mb4_croatian_ci

      This collation has 9th contractions. Let's check their performance in the order of their appearance in the tailoring:

      SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
      DO BENCHMARK(10000000,strcmp('dž1','dž2'));
      DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
      DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
      DO BENCHMARK(10000000,strcmp('lj1','lj2'));
      DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
      DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
      DO BENCHMARK(10000000,strcmp('nj1','nj2'));
      DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
      DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));
      

      This is the output:

      MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž1','dž2'));
      Query OK, 0 rows affected (0.397 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
      Query OK, 0 rows affected (0.452 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
      Query OK, 0 rows affected (0.500 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj1','lj2'));
      Query OK, 0 rows affected (0.535 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
      Query OK, 0 rows affected (0.593 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
      Query OK, 0 rows affected (0.647 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj1','nj2'));
      Query OK, 0 rows affected (0.700 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
      Query OK, 0 rows affected (0.753 sec)
       
      MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));
      Query OK, 0 rows affected (0.811 sec)
      

      The summary:

      Ord  Contraction Time
      ---  ----------- ----
      0    dž          0.397
      1    Dž          0.452
      2    DŽ          0.500
      3    lj          0.535
      4    Lj          0.593
      5    LJ          0.647
      6    nj          0.700
      7    Nj          0.753
      8    NJ          0.811
      

      Proposal

      The code should be fixed to use a hash table for contractions instead of the simple list iteration.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              lstartseva Lena Startseva
              Reporter:
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:

                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.