[MDEV-4425] REGEXP enhancements - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.0.5
Component/s: None
Labels:
- gsoc13

Description

MariaDB has the RLIKE predicate which allows to specify a
regular expression pattern for a complex search, e.g.:

 SELECT * FROM t1 WHERE col1 RLIKE 'pattern';

RLIKE uses the Henry Spencer's implementation of regular expressions,
which is aimed at conformance with POSIX 1003.2.
This library has a number of limitations:

it works byte-wise thus supports only 8-bit character sets
it does not work well with accent insensitive MariaDB collations
it does not have features that present in the modern regex
libraries such as look-ahead, look-behind, non-greedy modifier,
recursion, etc.

The goal of this task is to replace the old regex library
with some modern Unicode-aware library.

Current implementation pointers

The current regexp library is bundled with the MariaDB sources
and resides in the regex/ sub-directory.
The RLIKE predicate is implemented with help of the class Item_func_regex
which is defined in the files sql/item_cmpfunc.h and
sql/item_cmpfunc.cc

Possible library candidates

The list of the possible candidates is a subject to extending.
Currently we consider the following libraries:

mb_regex (from PHP)
TRE
RE2
pcre
ICU regex
oniguruma

One of the goals of the task is to choose the library which
suites better our needs, taking into account that it will
need some coding to integrate with MariaDB character sets and collations.

The implementer should provide feature and performance comparison
and proves that the library of the choice suites our needs better
than the other candidates.

Library License requirements

The library should be distributed under some permissive open source license,
such as LGPL, BSD, MIT, etc. GPL is not desirable.

Library portability requirements

The library should be a thread-safe C or C++ library that compiles
on all operating systems supported by MariaDB
(e.g. Linux, Windows, *BSD, Mac, Solaris, etc).

Library technical requirements

After replacing the regexp library, the RLIKE predicate should be able to:

1. Support modern regexp features.

look-ahead
look-behind
non-greedy modifiers
recursion would be nice
Unicode features [10] would be nice:
- Unicode code point specification
- Unicode Character Properties
- Unicode Scripts

2. Work with all MariaDB character sets

8bit character sets (e.g. latin1, cp1250, koi8r, etc)
Unicode character sets: utf8, utf16, utf16le, utf32
Chinese-Japanese-Korean (CJK) multi-byte character sets: sjis, cp932, ujis, eucjpms, gbk, gb2312, euckr

Preferably, the library should be able to handle all character sets natively.
Another way would be to convert non-Unicode data to Unicode and then pass to
the library. For example, if a library supports utf16le but does not support
cp932, we can convert cp932 to utf16le, and then pass utf16le data to the
library.

3. Follow the case sensitivity rules defined by the MariaDB collations.

That is, take into account case sensitivity of the collation of the operation:

SELECT 'o' RLIKE 'O' COLLATE utf8_general_ci;   -> true

SELECT 'o' RLIKE 'O' COLLATE utf8_bin;          -> false

It was decided to use PCRE as a new library

The REGEXP_LIKE function (deferred)

This functions won't be implemented under terms of this MDEV. See the explanation in the end of this section, or just skip this section
Currently MariaDB supports regexp based
comparison in the form of the RLIKE predicate:

  SELECT a RLIKE b;

which can only accept two arguments.

This task should introduce a new function REGEXP_LIKE with
the optional third parameter:

REGEXP_LIKE(text, pattern [, mode])

If the mode parameter is not specified, then the behavior of REGEXP_LIKE
should be similar to RLIKE, so these two expressions bring the same result:

  SELECT 'text' RLIKE 'pattern';

  SELECT REGEXP_LIKE('text', 'pattern');

The mode argument of REGEXP_LIKE will be a string literal
that allows to change the default matching behavior.
It will understand one or more letter flags as follows:

`n`	force dot (`.`) to match a newline character (by default dot should not match newline).
`m`	treat the source string as multiple lines, so `^` and `$` match the beginning and the end of any line inside text. The default behavior should be to match `^` and `$` only at the beginning and the end of the entire text value.
`x`	ignore white-spaces (by default white-spaces should match themselves).
`i`	force case insensitive match (even if the collation is case sensitive)
`c`	force case sensitive match (even if the collation is case insensitive)

Examples:

SELECT REGEXP_LIKE('a\nb', 'a.b');           -> false

SELECT REGEXP_LIKE('a\nb', 'a.b', 'n');      -> true

SELECT REGEXP_LIKE('a\nb', '^a$');           -> false

SELECT REGEXP_LIKE('a\nb', '^a$', 'm');      -> true

SELECT REGEXP_LIKE('a b',' ab');             -> false

SELECT REGEXP_LIKE('a b',' ab','x');         -> true

SELECT REGEXP_LIKE('a b\nc d', '^cd$', 'mx') -> true

Note 1: Some other letter flags may be added as well, depending on features
provided by the library of the choice. This will be discussed after
the decision of which library to use is made.

Note 2: mode will be a string literal (i.e. a constant).
Expressions will not be supported. The parser will return
a syntax error on attempt to use non-literal in the third parameter
to REGEXP_LIKE:

  SELECT REGEXP_LIKE(column1, column2, 'mx') FROM t1;    -> OK

but

  SELECT REGEXP_LIKE(column1, column2, column3) FROM t1; ->

  ERROR: You have an error in your SQL syntax at 'column3) FROM t1'

Note 3: If some unknown letter is given in the 'mode' parameter to
REGEXP_LIKE, then an error should be returned:

  SELECT REGEXP_LIKE(column1, column2, 'U') FROM t1;  ->

  ERROR: Incorrect argument to REGEXP_LIKE

Note

It was decided not to implement the REGEXP_LIKE function,
because PCRE supports flags in the pattern syntax. There is
no a need for an extra argument that would contain flags.

For example:

mysql> SELECT 'a' RLIKE 'A' AS case_insensitive, 'a' RLIKE '(?-i)A' AS case_sensitive;

+------------------+----------------+

| case_insensitive | case_sensitive |

+------------------+----------------+

|                1 |              0 |

+------------------+----------------+

1 row in set (0.00 sec)

The REGEXP_REPLACE function

Additionally, a much requested REGEXP_REPLACE function should be implemented
under terms of this task, with the following arguments:

REGEXP_REPLACE(text, pattern, replace [,position [,occurrence [,mode]])

where:

`text`	the search value (string)
`pattern`	the regular expression (string)
`replace`	the replacement (string), with sub-expressions in the form of `\n`, where `n` is a number from 1 to 9.
`position`	a positive integer indicating the position in text where to start search from. The default value will be 1 (the first character).
`occurrence`	a non-negative number indicating the occurrence to replace: 0 to replace all occurrences, or a positive number n to replace the n-th occurrence.
`mode`	a string literal that changes the default match behavior, with exactly the same set of flag letters that REGEXP_LIKE has.

The first version will be implemented only with the first three parameters "text", "pattern" and "replace" and won't include the optional parameters "position", "occurrence" and "mode", and it will replace all occurrences of the searched pattern, starting from the beginning of the text

Examples:

SELECT REGEXP_REPLACE('James Bond', '(.*) (.*)', '\\2, \\1'); -> 'Bond, James'

The REGEXP_INSTR function

The REGEXP_REPLACE function will be implemented,
with the following arguments:

REGEXP_INST(text, pattern)

It will return the position of the regexp "pattern" inside "text".
The positions will start with 1.

Examples:

mysql> SELECT REGEXP_INSTR('abc2xyz','[0-9]');

+---------------------------------+

| REGEXP_INSTR('abc2xyz','[0-9]') |

+---------------------------------+

|                               4 |

+---------------------------------+

1 row in set (0.00 sec)

The REGEXP_SUBSTR function

The REGEXP_SUBSTR function will be be implemented,
with the following arguments:

REGEXP_SUBSTR(text, pattern)

It will return the part of "text" that matches "pattern".

Examples:

mysql> SELECT REGEXP_SUBSTR('abc123xyz','[0-9]+');

+-------------------------------------+

| REGEXP_SUBSTR('abc123xyz','[0-9]+') |

+-------------------------------------+

| 123                                 |

+-------------------------------------+

1 row in set (0.00 sec)

MariaDB Server

REGEXP enhancements

Details

Description

Current implementation pointers

Possible library candidates

Library License requirements

Library portability requirements

Library technical requirements

1. Support modern regexp features.

2. Work with all MariaDB character sets

3. Follow the case sensitivity rules defined by the MariaDB collations.

The REGEXP_LIKE function (deferred)

Examples:

The REGEXP_REPLACE function

Examples:

The REGEXP_INSTR function

Examples:

The REGEXP_SUBSTR function

Examples:

Links

Attachments

Activity

People

Dates

Git Integration