Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4425

REGEXP enhancements

    XMLWordPrintable

Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.5
    • None

    Description

      MariaDB has the RLIKE predicate which allows to specify a
      regular expression pattern for a complex search, e.g.:

       SELECT * FROM t1 WHERE col1 RLIKE 'pattern';

      RLIKE uses the Henry Spencer's implementation of regular expressions,
      which is aimed at conformance with POSIX 1003.2.
      This library has a number of limitations:

      • it works byte-wise thus supports only 8-bit character sets
      • it does not work well with accent insensitive MariaDB collations
      • it does not have features that present in the modern regex
        libraries such as look-ahead, look-behind, non-greedy modifier,
        recursion, etc.

      The goal of this task is to replace the old regex library
      with some modern Unicode-aware library.

      Current implementation pointers

      The current regexp library is bundled with the MariaDB sources
      and resides in the regex/ sub-directory.
      The RLIKE predicate is implemented with help of the class Item_func_regex
      which is defined in the files sql/item_cmpfunc.h and
      sql/item_cmpfunc.cc

      Possible library candidates

      The list of the possible candidates is a subject to extending.
      Currently we consider the following libraries:

      • mb_regex (from PHP)
      • TRE
      • RE2
      • pcre
      • ICU regex
      • oniguruma

      One of the goals of the task is to choose the library which
      suites better our needs, taking into account that it will
      need some coding to integrate with MariaDB character sets and collations.

      The implementer should provide feature and performance comparison
      and proves that the library of the choice suites our needs better
      than the other candidates.

      Library License requirements

      The library should be distributed under some permissive open source license,
      such as LGPL, BSD, MIT, etc. GPL is not desirable.

      Library portability requirements

      The library should be a thread-safe C or C++ library that compiles
      on all operating systems supported by MariaDB
      (e.g. Linux, Windows, *BSD, Mac, Solaris, etc).

      Library technical requirements

      After replacing the regexp library, the RLIKE predicate should be able to:

      1. Support modern regexp features.

      • look-ahead
      • look-behind
      • non-greedy modifiers
      • recursion would be nice
      • Unicode features [10] would be nice:
        • Unicode code point specification
        • Unicode Character Properties
        • Unicode Scripts

      2. Work with all MariaDB character sets

      • 8bit character sets (e.g. latin1, cp1250, koi8r, etc)
      • Unicode character sets: utf8, utf16, utf16le, utf32
      • Chinese-Japanese-Korean (CJK) multi-byte character sets: sjis, cp932, ujis, eucjpms, gbk, gb2312, euckr

      Preferably, the library should be able to handle all character sets natively.
      Another way would be to convert non-Unicode data to Unicode and then pass to
      the library. For example, if a library supports utf16le but does not support
      cp932, we can convert cp932 to utf16le, and then pass utf16le data to the
      library.

      3. Follow the case sensitivity rules defined by the MariaDB collations.

      That is, take into account case sensitivity of the collation of the operation:

      SELECT 'o' RLIKE 'O' COLLATE utf8_general_ci;   -> true
      SELECT 'o' RLIKE 'O' COLLATE utf8_bin;          -> false

      It was decided to use PCRE as a new library

      The REGEXP_LIKE function (deferred)

      This functions won't be implemented under terms of this MDEV. See the explanation in the end of this section, or just skip this section
      Currently MariaDB supports regexp based
      comparison in the form of the RLIKE predicate:

        SELECT a RLIKE b;

      which can only accept two arguments.

      This task should introduce a new function REGEXP_LIKE with
      the optional third parameter:

      REGEXP_LIKE(text, pattern [, mode])

      If the mode parameter is not specified, then the behavior of REGEXP_LIKE
      should be similar to RLIKE, so these two expressions bring the same result:

        SELECT 'text' RLIKE 'pattern';
        SELECT REGEXP_LIKE('text', 'pattern');

      The mode argument of REGEXP_LIKE will be a string literal
      that allows to change the default matching behavior.
      It will understand one or more letter flags as follows:

      n force dot (.) to match a newline character (by default dot should not match newline).
      m treat the source string as multiple lines, so ^ and $ match the beginning and the end of any line inside text. The default behavior should be to match ^ and $ only at the beginning and the end of the entire text value.
      x ignore white-spaces (by default white-spaces should match themselves).
      i force case insensitive match (even if the collation is case sensitive)
      c force case sensitive match (even if the collation is case insensitive)

      Examples:

      SELECT REGEXP_LIKE('a\nb', 'a.b');           -> false
      SELECT REGEXP_LIKE('a\nb', 'a.b', 'n');      -> true
      SELECT REGEXP_LIKE('a\nb', '^a$');           -> false
      SELECT REGEXP_LIKE('a\nb', '^a$', 'm');      -> true
      SELECT REGEXP_LIKE('a b',' ab');             -> false
      SELECT REGEXP_LIKE('a b',' ab','x');         -> true
      SELECT REGEXP_LIKE('a b\nc d', '^cd$', 'mx') -> true

      Note 1: Some other letter flags may be added as well, depending on features
      provided by the library of the choice. This will be discussed after
      the decision of which library to use is made.

      Note 2: mode will be a string literal (i.e. a constant).
      Expressions will not be supported. The parser will return
      a syntax error on attempt to use non-literal in the third parameter
      to REGEXP_LIKE:

        SELECT REGEXP_LIKE(column1, column2, 'mx') FROM t1;    -> OK

      but

        SELECT REGEXP_LIKE(column1, column2, column3) FROM t1; ->
        ERROR: You have an error in your SQL syntax at 'column3) FROM t1'

      Note 3: If some unknown letter is given in the 'mode' parameter to
      REGEXP_LIKE, then an error should be returned:

        SELECT REGEXP_LIKE(column1, column2, 'U') FROM t1;  ->
        ERROR: Incorrect argument to REGEXP_LIKE

      Note

      It was decided not to implement the REGEXP_LIKE function,
      because PCRE supports flags in the pattern syntax. There is
      no a need for an extra argument that would contain flags.

      For example:

      mysql> SELECT 'a' RLIKE 'A' AS case_insensitive, 'a' RLIKE '(?-i)A' AS case_sensitive;
      +------------------+----------------+
      | case_insensitive | case_sensitive |
      +------------------+----------------+
      |                1 |              0 |
      +------------------+----------------+
      1 row in set (0.00 sec)

      The REGEXP_REPLACE function

      Additionally, a much requested REGEXP_REPLACE function should be implemented
      under terms of this task, with the following arguments:

      REGEXP_REPLACE(text, pattern, replace [,position [,occurrence [,mode]])

      where:

      text the search value (string)
      pattern the regular expression (string)
      replace the replacement (string), with sub-expressions in the form of \n, where n is a number from 1 to 9.
      position a positive integer indicating the position in text where to start search from. The default value will be 1 (the first character).
      occurrence a non-negative number indicating the occurrence to replace: 0 to replace all occurrences, or a positive number n to replace the n-th occurrence.
      mode a string literal that changes the default match behavior, with exactly the same set of flag letters that REGEXP_LIKE has.

      The first version will be implemented only with the first three parameters "text", "pattern" and "replace" and won't include the optional parameters "position", "occurrence" and "mode", and it will replace all occurrences of the searched pattern, starting from the beginning of the text

      Examples:

      SELECT REGEXP_REPLACE('James Bond', '(.*) (.*)', '\\2, \\1'); -> 'Bond, James'

      The REGEXP_INSTR function

      The REGEXP_REPLACE function will be implemented,
      with the following arguments:

      REGEXP_INST(text, pattern)

      It will return the position of the regexp "pattern" inside "text".
      The positions will start with 1.

      Examples:

      mysql> SELECT REGEXP_INSTR('abc2xyz','[0-9]');
      +---------------------------------+
      | REGEXP_INSTR('abc2xyz','[0-9]') |
      +---------------------------------+
      |                               4 |
      +---------------------------------+
      1 row in set (0.00 sec)

      The REGEXP_SUBSTR function

      The REGEXP_SUBSTR function will be be implemented,
      with the following arguments:

      REGEXP_SUBSTR(text, pattern)

      It will return the part of "text" that matches "pattern".

      Examples:

      mysql> SELECT REGEXP_SUBSTR('abc123xyz','[0-9]+');
      +-------------------------------------+
      | REGEXP_SUBSTR('abc123xyz','[0-9]+') |
      +-------------------------------------+
      | 123                                 |
      +-------------------------------------+
      1 row in set (0.00 sec)

      Links

      1. The description of the RLIKE predicate in the MySQL manual:
        http://dev.mysql.com/doc/refman/5.6/en/regexp.html
      2. pcre:
        http://www.pcre.org/\\http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
      3. Oniguruma:
        http://www.geocities.jp/kosako3/oniguruma/\\http://en.wikipedia.org/wiki/Oniguruma
      4. Regular expression UDFs for MySQL
        http://sourceforge.net/projects/udf-regexp/ https://launchpad.net/mysql-udf-regexp
      5. TRE:
        http://laurikari.net/tre/\\http://en.wikipedia.org/wiki/TRE_%28computing%29
      6. RE2:
        http://code.google.com/p/re2/
      7. ICU regexp:
        http://userguide.icu-project.org/strings/regexp
      8. Comparison of a number of regexp implementations:
        http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
      9. Benchmark of Regex Libraries
        http://lh3lh3.users.sourceforge.net/reb.shtml
      10. Unicode features in regexp:
        http://www.regular-expressions.info/unicode.html
      11. Performance comparison of regular expression engines:
        http://sljit.sourceforge.net/regex_perf.html
      12. Regular Expression Performance Comparison:
        http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html

      Attachments

        Activity

          People

            bar Alexander Barkov
            bar Alexander Barkov
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.