[MDEV-17474] Change Unicode collation implementation from "handler" to "inline" style Created: 2018-10-16  Updated: 2018-10-19  Resolved: 2018-10-18

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 10.4.0

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-16413 test performance of distinct range qu... Closed
Relates
relates to MDEV-17502 Change Unicode xxx_general_ci and xxx... Closed

 Description   

axel found a performance bottleneck in the Unicode collation implementation (MDEV-16413).

Under terms of this task, we'll reorganize the code implementing Unicode collations (such as utf8_unicode_ci) to a new style which will change virtual function calls to inline-ing:

The old style

  • There is one copy of every collation function (strnncoll(), strnncollsp(), hash_sort(), strnnxfrm()), e.g.:

    static int my_strnncollsp_uca(CHARSET_INFO *cs, 
                                  my_uca_scanner_handler *scanner_handler,
                                  const uchar *s, size_t slen,
                                  const uchar *t, size_t tlen)
    

  • Character set-specific routines are passed in scanner_handler
  • scanner_handler->next() is called virtually (i.e. via a pointer to a function)
  • scanner_handler->next() itself calls cs->cset->mb_wc() virtually

The new style

  • There are multiple implementations of the functions, one function per character set.
  • There is a shared file ctype-uca.ic, which is included multiple times, one time per each character set.
  • Character set specific information is passed in macros:

    #include "ctype-utf8.h"
    #define MY_FUNCTION_NAME(x)   my_uca_ ## x ## _utf8mb3
    #define MY_MB_WC(scanner, wc, beg, end) (my_mb_wc_utf8mb3_quick(wc, beg, end))
    #define MY_LIKE_RANGE my_like_range_mb
    #include "ctype-uca.ic"
    

  • There are inline my_mb_wc_CSNAME_quick() implementations in new header files: ctype-utf8.h, ctype-ucs2.h, ctype-utf16.h, ctype-utf32.h

The old version generated smaller amount of executable code, but was slower.
The new version will generate more code, but will be much faster: there will be no any virtual function calls. All calls inside new functions will be done either using inline or at least statically.

Part#2: additional changes:

  • Add fast paths to handle ASCII characters
  • Add dedicated MY_COLLATION_HANDLERs for collations with no contractions (for utf8 and for utf8mb4 character sets). The choice between the full-featured handler and the "no contraction" handler should be made at the collation initialization time.


 Comments   
Comment by Alexander Barkov [ 2018-10-17 ]

Performance comparison (after the main change and part#2):

utf8_unicode_ci

SET NAMES utf8 COLLATE utf8_unicode_ci;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.33 sec
MySQL-8.0     2.31 sec
MariaDB-10.4  2.79 sec  -- before the change
MariaDB-10.4  1.27 sec  -- after the change

utf8_german2_ci

SET NAMES utf8 COLLATE utf8_german2_ci;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.21 sec
MySQL-8.0     2.25 sec
MariaDB-10.4  2.79 sec  -- before the change
MariaDB-10.4  1.27 sec  -- after the change

utf8_spanish2_ci

SET NAMES utf8 COLLATE utf8_spanish2_ci;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     4.62 sec
MySQL-8.0     2.55 sec
MariaDB-10.4  3.45 sec  -- before the change
MariaDB-10.4  1.78 sec  -- after the change

utf8_thai_520_w2 (difference on the primary level)

SET NAMES utf8 COLLATE utf8_thai_520_w2;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <
                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab') AS a;

MariaDB-10.4  3.78 sec  -- before the change
MariaDB-10.4  2.33 sec  -- after the change

utf8_thai_520_w2 (equality on the primary level)

SET NAMES utf8 COLLATE utf8_thai_520_w2;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <
                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') AS a;

MariaDB-10.4  7.28 sec  -- before the change
MariaDB-10.4  4.71 sec  -- after the change

utf8mb4_unicode_ci

SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.20 sec
MySQL-8.0     1.59 sec
MariaDB-10.4  2.79 sec  -- before the change
MariaDB-10.4  1.25 sec  -- after the change

utf8mb4_german2_ci

SET NAMES utf8mb4 COLLATE utf8mb4_german2_ci;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.19 sec
MySQL-8.0     1.58 sec
MariaDB-10.4  2.80 sec  -- before the change
MariaDB-10.4  1.29 sec  -- after the change

utf8mb4_spanish2_ci

SET NAMES utf8mb4 COLLATE utf8mb4_spanish2_ci;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     4.41 sec
MySQL-8.0     1.68 sec
MariaDB-10.4  3.45 sec  -- before the change
MariaDB-10.4  1.76 sec  -- after the change

utf8mb4_thai_520_w2 (difference on the primary level)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <
                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab') AS a;

MariaDB-10.4  3.49 sec  -- before the change
MariaDB-10.4  2.31 sec  -- after the change

utf8mb4_thai_520_w2 (equality on the primary level)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <
                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') AS a;

MariaDB-10.4  7.00 sec  -- before the change
MariaDB-10.4  4.61 sec  -- after the change

Generated at Thu Feb 08 08:36:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.