Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
None
Description
MariaDB has the RLIKE predicate which allows to specify a
regular expression pattern for a complex search, e.g.:
SELECT * FROM t1 WHERE col1 RLIKE 'pattern';
|
RLIKE uses the Henry Spencer's implementation of regular expressions,
which is aimed at conformance with POSIX 1003.2.
This library has a number of limitations:
- it works byte-wise thus supports only 8-bit character sets
- it does not work well with accent insensitive MariaDB collations
- it does not have features that present in the modern regex
libraries such as look-ahead, look-behind, non-greedy modifier,
recursion, etc.
The goal of this task is to replace the old regex library
with some modern Unicode-aware library.
Current implementation pointers
The current regexp library is bundled with the MariaDB sources
and resides in the regex/ sub-directory.
The RLIKE predicate is implemented with help of the class Item_func_regex
which is defined in the files sql/item_cmpfunc.h and
sql/item_cmpfunc.cc
Possible library candidates
The list of the possible candidates is a subject to extending.
Currently we consider the following libraries:
- mb_regex (from PHP)
- TRE
- RE2
- pcre
- ICU regex
- oniguruma
One of the goals of the task is to choose the library which
suites better our needs, taking into account that it will
need some coding to integrate with MariaDB character sets and collations.
The implementer should provide feature and performance comparison
and proves that the library of the choice suites our needs better
than the other candidates.
Library License requirements
The library should be distributed under some permissive open source license,
such as LGPL, BSD, MIT, etc. GPL is not desirable.
Library portability requirements
The library should be a thread-safe C or C++ library that compiles
on all operating systems supported by MariaDB
(e.g. Linux, Windows, *BSD, Mac, Solaris, etc).
Library technical requirements
After replacing the regexp library, the RLIKE predicate should be able to:
1. Support modern regexp features.
- look-ahead
- look-behind
- non-greedy modifiers
- recursion would be nice
- Unicode features [10] would be nice:
- Unicode code point specification
- Unicode Character Properties
- Unicode Scripts
2. Work with all MariaDB character sets
- 8bit character sets (e.g. latin1, cp1250, koi8r, etc)
- Unicode character sets: utf8, utf16, utf16le, utf32
- Chinese-Japanese-Korean (CJK) multi-byte character sets: sjis, cp932, ujis, eucjpms, gbk, gb2312, euckr
Preferably, the library should be able to handle all character sets natively.
Another way would be to convert non-Unicode data to Unicode and then pass to
the library. For example, if a library supports utf16le but does not support
cp932, we can convert cp932 to utf16le, and then pass utf16le data to the
library.
3. Follow the case sensitivity rules defined by the MariaDB collations.
That is, take into account case sensitivity of the collation of the operation:
SELECT 'o' RLIKE 'O' COLLATE utf8_general_ci; -> true
|
SELECT 'o' RLIKE 'O' COLLATE utf8_bin; -> false
|
It was decided to use PCRE as a new library
The REGEXP_LIKE function (deferred)
This functions won't be implemented under terms of this MDEV. See the explanation in the end of this section, or just skip this section
Currently MariaDB supports regexp based
comparison in the form of the RLIKE predicate:
SELECT a RLIKE b;
|
which can only accept two arguments.
This task should introduce a new function REGEXP_LIKE with
the optional third parameter:
REGEXP_LIKE(text, pattern [, mode])
|
If the mode parameter is not specified, then the behavior of REGEXP_LIKE
should be similar to RLIKE, so these two expressions bring the same result:
SELECT 'text' RLIKE 'pattern';
|
SELECT REGEXP_LIKE('text', 'pattern');
|
The mode argument of REGEXP_LIKE will be a string literal
that allows to change the default matching behavior.
It will understand one or more letter flags as follows:
n | force dot (.) to match a newline character (by default dot should not match newline). |
m | treat the source string as multiple lines, so ^ and $ match the beginning and the end of any line inside text. The default behavior should be to match ^ and $ only at the beginning and the end of the entire text value. |
x | ignore white-spaces (by default white-spaces should match themselves). |
i | force case insensitive match (even if the collation is case sensitive) |
c | force case sensitive match (even if the collation is case insensitive) |
Examples:
SELECT REGEXP_LIKE('a\nb', 'a.b'); -> false
|
SELECT REGEXP_LIKE('a\nb', 'a.b', 'n'); -> true
|
SELECT REGEXP_LIKE('a\nb', '^a$'); -> false
|
SELECT REGEXP_LIKE('a\nb', '^a$', 'm'); -> true
|
SELECT REGEXP_LIKE('a b',' ab'); -> false
|
SELECT REGEXP_LIKE('a b',' ab','x'); -> true
|
SELECT REGEXP_LIKE('a b\nc d', '^cd$', 'mx') -> true
|
Note 1: Some other letter flags may be added as well, depending on features
provided by the library of the choice. This will be discussed after
the decision of which library to use is made.
Note 2: mode will be a string literal (i.e. a constant).
Expressions will not be supported. The parser will return
a syntax error on attempt to use non-literal in the third parameter
to REGEXP_LIKE:
SELECT REGEXP_LIKE(column1, column2, 'mx') FROM t1; -> OK
|
but
SELECT REGEXP_LIKE(column1, column2, column3) FROM t1; ->
|
ERROR: You have an error in your SQL syntax at 'column3) FROM t1'
|
Note 3: If some unknown letter is given in the 'mode' parameter to
REGEXP_LIKE, then an error should be returned:
SELECT REGEXP_LIKE(column1, column2, 'U') FROM t1; ->
|
ERROR: Incorrect argument to REGEXP_LIKE
|
It was decided not to implement the REGEXP_LIKE function,
because PCRE supports flags in the pattern syntax. There is
no a need for an extra argument that would contain flags.
For example:
mysql> SELECT 'a' RLIKE 'A' AS case_insensitive, 'a' RLIKE '(?-i)A' AS case_sensitive;
|
+------------------+----------------+
|
| case_insensitive | case_sensitive |
|
+------------------+----------------+
|
| 1 | 0 |
|
+------------------+----------------+
|
1 row in set (0.00 sec)
|
The REGEXP_REPLACE function
Additionally, a much requested REGEXP_REPLACE function should be implemented
under terms of this task, with the following arguments:
REGEXP_REPLACE(text, pattern, replace [,position [,occurrence [,mode]])
|
where:
text | the search value (string) |
pattern | the regular expression (string) |
replace | the replacement (string), with sub-expressions in the form of \n, where n is a number from 1 to 9. |
position | a positive integer indicating the position in text where to start search from. The default value will be 1 (the first character). |
occurrence | a non-negative number indicating the occurrence to replace: 0 to replace all occurrences, or a positive number n to replace the n-th occurrence. |
mode | a string literal that changes the default match behavior, with exactly the same set of flag letters that REGEXP_LIKE has. |
The first version will be implemented only with the first three parameters "text", "pattern" and "replace" and won't include the optional parameters "position", "occurrence" and "mode", and it will replace all occurrences of the searched pattern, starting from the beginning of the text
Examples:
SELECT REGEXP_REPLACE('James Bond', '(.*) (.*)', '\\2, \\1'); -> 'Bond, James'
|
The REGEXP_INSTR function
The REGEXP_REPLACE function will be implemented,
with the following arguments:
REGEXP_INST(text, pattern)
|
It will return the position of the regexp "pattern" inside "text".
The positions will start with 1.
Examples:
mysql> SELECT REGEXP_INSTR('abc2xyz','[0-9]');
|
+---------------------------------+
|
| REGEXP_INSTR('abc2xyz','[0-9]') |
|
+---------------------------------+
|
| 4 |
|
+---------------------------------+
|
1 row in set (0.00 sec)
|
The REGEXP_SUBSTR function
The REGEXP_SUBSTR function will be be implemented,
with the following arguments:
REGEXP_SUBSTR(text, pattern)
|
It will return the part of "text" that matches "pattern".
Examples:
mysql> SELECT REGEXP_SUBSTR('abc123xyz','[0-9]+');
|
+-------------------------------------+
|
| REGEXP_SUBSTR('abc123xyz','[0-9]+') |
|
+-------------------------------------+
|
| 123 |
|
+-------------------------------------+
|
1 row in set (0.00 sec)
|
Links
- The description of the RLIKE predicate in the MySQL manual:
http://dev.mysql.com/doc/refman/5.6/en/regexp.html - pcre:
http://www.pcre.org/\\http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions - Oniguruma:
http://www.geocities.jp/kosako3/oniguruma/\\http://en.wikipedia.org/wiki/Oniguruma - Regular expression UDFs for MySQL
http://sourceforge.net/projects/udf-regexp/ https://launchpad.net/mysql-udf-regexp - TRE:
http://laurikari.net/tre/\\http://en.wikipedia.org/wiki/TRE_%28computing%29 - RE2:
http://code.google.com/p/re2/ - ICU regexp:
http://userguide.icu-project.org/strings/regexp - Comparison of a number of regexp implementations:
http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines - Benchmark of Regex Libraries
http://lh3lh3.users.sourceforge.net/reb.shtml - Unicode features in regexp:
http://www.regular-expressions.info/unicode.html - Performance comparison of regular expression engines:
http://sljit.sourceforge.net/regex_perf.html - Regular Expression Performance Comparison:
http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html