[MDEV-17502] Change Unicode xxx_general_ci and xxx_bin collation implementation to "inline" style Created: 2018-10-19  Updated: 2018-10-30  Resolved: 2018-10-19

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 10.4.0

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-16413 test performance of distinct range qu... Closed
Relates
relates to MDEV-17474 Change Unicode collation implementati... Closed

 Description   

This task is similar to MDEV-17474, but for general_ci and _bin collations.

The current implementation my_strnxfrm_unicode_internal() has some bottlenecks:

  • it uses cs->cset->mb_wc() virtual calls
  • it accesses to cs->state and cs->caseinfo

We'll change the code by adding new strnxfrm-family function templates into strings/strcoll.ic.
Functions my_strnxfrm_unicode_internal(), my_strnxfrm_unicode(), my_strnxfrm_unicode_nopad() will migrate from strings/ctype-utf8.c to such function templates in strings/strcoll.ic.

Every collation will include strings/strcoll.ic and pass specific parameters, such as mb_wc() and UNICASE data related.

Additionally, we'll add fast paths for ASCII data.

After these changes, the template instantiation (e.g. for utf8_general_ci) will look like this:

#define MY_FUNCTION_NAME(x)      my_ ## x ## _utf8_general_ci
#define DEFINE_STRNXFRM_UNICODE
#define DEFINE_STRNXFRM_UNICODE_NOPAD
#define MY_MB_WC(cs, pwc, s, e)  my_mb_wc_utf8mb3_quick(pwc, s, e)
#define OPTIMIZE_ASCII           1
#define UNICASE_MAXCHAR          MY_UNICASE_INFO_DEFAULT_MAXCHAR
#define UNICASE_PAGE0            my_unicase_default_page00
#define UNICASE_PAGES            my_unicase_default_pages
...
#include "strcoll.ic"

The template included in this example will:

  • use my_mb_wc_utf8mb3_quick() directly (inline or at least statically), instead of a virtual call.
  • use MY_UNICASE_INFO_DEFAULT_MAXCHAR, my_unicase_default_page00, my_unicase_default_pages directly, without dereferencing members of CHARSET_INFO.
  • enable fast path for ASCII


 Comments   
Comment by Alexander Barkov [ 2018-10-19 ]

Performance statistics:

Short range searches with ORDER BY

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (pk SERIAL, field CHAR(120) CHARACTER SET utf8 COLLATE utf8_general_ci);
INSERT INTO t1 (field) VALUES ('a'),('b'),('c'),('d');
INSERT t1 (field)
WITH  RECURSIVE int_seq AS (
  SELECT 1 AS val
  UNION ALL
  SELECT val + 1
  FROM int_seq
  WHERE val < 1000
) SELECT 'a' FROM int_seq;
 
DROP PROCEDURE IF EXISTS p1;
DELIMITER $$
CREATE PROCEDURE p1()
BEGIN
  DECLARE a INT DEFAULT 100000;
  WHILE (a > 0)
  DO
    SELECT DISTINCT field INTO @a FROM t1 WHERE pk BETWEEN 1 AND 11 ORDER BY field LIMIT 1;
    SET a=a-1;
  END WHILE;
END;
$$
DELIMITER ;
CALL p1;

  • 7.99 sec - MySQL-8.0
  • 7.83 sec - MariaDB-10.4 before MDEV-17502
  • 7.59 sec - MariaDB-10.4 after MDEV-17502

Micro benchmark for WEIGHT_STRING() for utf8_general_ci

SET NAMES utf8 COLLATE utf8_general_ci;
SET @a=CONCAT('a', REPEAT(' ',359));
SELECT BENCHMARK(500000, WEIGHT_STRING(@a,1024,960,128));

  • 0.74 sec - MySQL-8.0
  • 0.84 sec - MariaDB-10.4 before MDEV-17502
  • 0.41 sec - MariaDB-10.4 after MDEV-17502

Micro benchmark for WEIGHT_STRING() for utf8_bin

SET NAMES utf8 COLLATE utf8_bin;
SET @a=CONCAT('a', REPEAT(' ',359));
SELECT BENCHMARK(500000, WEIGHT_STRING(@a,1024,960,128));

  • 0.55 sec - MySQL-8.0
  • 0.60 sec - MariaDB-10.4 before MDEV-17502
  • 0.37 sec - MariaDB-10.4 after MDEV-17502
Generated at Thu Feb 08 08:36:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.