[MCOL-3771] Benchmark RE2 for replacing regex Created: 2020-02-10  Updated: 2023-10-25  Resolved: 2023-10-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 1.4.3
Fix Version/s: Icebox

Type: Task Priority: Minor
Reporter: David Hall (Inactive) Assignee: Sergey Zefirov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by MCOL-3778 Replace glibc with google's re2 for r... Closed

 Description   

We have discovered that there is significant performance differences between various regex implementations. This is important during LIKE processing. During a search as to why, we discovered that Google has nice lib we can use – RE2 (https://github.com/google/re2). We need to benchmark this to see if it's worth adopting.



 Comments   
Comment by Sergey Zefirov [ 2020-10-14 ]

Snowflake docs: https://docs.snowflake.com/en/sql-reference/functions-regexp.html#backreferences

Quote from current state: "Snowflake does not support backreferences in regular expression patterns (known as “squares” in formal language theory); however, backreferences are supported in the replacement string of the REGEXP_REPLACE function."

Redshift documentation does not mention anything beyond POSIX regular expressions in pattern matching: https://docs.amazonaws.cn/en_us/redshift/latest/dg/pattern-matching-conditions-posix.html

This means that for some parts of expressions a more restricted version can be used.

Comment by Sergey Zefirov [ 2020-10-15 ]

http://www.stringology.org/cgi-bin/getfile.cgi?t=pdf&c=-&y=2017&n=04 - regular expression matching for expressions with backreferences is NP-complete for every conceivable alphabet.

That makes this class of grammars to be equal to context-sensitive grammars of any kind (two-level/Vijngaarten grammars, direct CS grammars, etc).

PS
Edited link to a file.

Comment by Sergey Zefirov [ 2020-10-19 ]

There are two goals: benchmark against change to re2 instead of boost::regex and (if re2 proves worth it) add reference to CMakeLists as external project.

Right now working on use of re2 as external dependency.

Comment by Sergey Zefirov [ 2020-10-23 ]

The structure of code required some refactoring, which required 1) factoring out operations into a class and 2) changing code to use these operations.

The (1) is done and (2) is done for primitive processor part of storage engine.

There is code (mostly in execplan part) that also uses regexes and also #ifdefs on POSIX_REGEX. The goal for today to make this code to use new regexes.

Then I'll add re2 as a external project and then I will benchmark.

Comment by Sergey Zefirov [ 2020-10-26 ]

Adding libre2 as external project now.

Comment by Sergey Zefirov [ 2020-10-26 ]

Added, mostly successfully.

As I noted above, there are two cases of regex use: RLIKE operator does not require substitutions and does not support back references and REGEXP_REPLACE which supports backreferences and substitutions.

This means we need to keep POSIX/Boost regexes and add re2 alongside.

Comment by Sergey Zefirov [ 2020-10-27 ]

I am investigating possible difference in features - it looks like MariaDB with columnstore does not support REGEXP_REPLACE functionality.

I cannot find where or even whether columnstore implements REGEXP_REPLACE.

Comment by Sergey Zefirov [ 2020-10-27 ]

Columnstore does not support REGEXP_REPLACE.

I can safely drop fallback to POSIX/Boost regexes then.

Comment by Sergey Zefirov [ 2020-10-28 ]

Right now debugging and fixing build problems.

There are two: 1) libre2 wants to be installed at /usr prefix somewhere and 2) there are .so libraries in columnstore that need to be linked with libre2.

Comment by Sergey Zefirov [ 2020-10-28 ]

Solved both.

Going to test.

Comment by Sergey Zefirov [ 2020-10-28 ]

I will make a PR to columnstore so that CI can run the build and tell me what is wrong.

Comment by Sergey Zefirov [ 2020-10-28 ]

Relevant PR: https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1552

Comment by Sergey Zefirov [ 2020-10-29 ]

"smoke" part of the build test fails because it cannot find libre2.so. Which we build and copy into library dir.

Weird.

Comment by Sergey Zefirov [ 2020-10-30 ]

Work on the critique by drrtuy:

  1. remove work on compiler warnings
  2. perform squash
  3. rebase over HEAD develop

The work on compiler warnings has been undone.

Comment by Sergey Zefirov [ 2020-11-04 ]

Preproduced installation problems (missing libre2.so) on a clean machine in GC (centos8).

mdb-server]$ git remote get-url origin
https://github.com/mariadb-SergeyZefirov/server.git

I am clearly building on my repository with the columnstore with re2 regexes. Going to investigate that.

Comment by Sergey Zefirov [ 2020-11-05 ]

It turned out that juggling repositories, branches and submodules can be tricky.

The problem in previous comment turned out to be exactly that: after switching to branch I did not updated submodules and got installation error due to absence of code to install libre2.so (but code to build it was present).

Yesterday I succeed in building the columnstore engine and successfully executed "create table cs1(i bigint) engine=columnstore;".

Now I am about to run a set of queries to check out re2 performance.

Comment by Sergey Zefirov [ 2020-11-05 ]

And I should note that OpenSUSE build builds everything and tests pass.

Comment by Sergey Zefirov [ 2020-11-06 ]

Preliminary tests on google clound compute-intensive instances c16

Setup

We use regression tests (data) as a base. We use test000 in mariadb-columnstore-regression-test/mysql/queries/nightly/alltest/ to setup tpch1 database with the lineitem table (sudo ./go.sh --tests=test000.sh).

The table contains l_comment column which is generated from some repeated and shuffled text.

After completion of run of "./go.sh --tests=test000.sh" I run "INSERT into lineitem SELECT * FROM lineitem;" statement three times, so that total count of rows is slightly above 48M.

Results

POSIX or RE2 Query Result Time (s)
POSIX SELECT COUNT( * ) FROM lineitem; 48009720 0.223
POSIX SELECT COUNT( * ) FROM lineitem; 48009720 0.121
POSIX SELECT COUNT( * ) FROM lineitem; 48009720 0.114
POSIX SELECT COUNT( * ) FROM lineitem; 48009720 0.112
POSIX SELECT COUNT( * ) FROM lineitem; 48009720 0.113
re2 SELECT COUNT( * ) FROM lineitem; 48009720 0.221
re2 SELECT COUNT( * ) FROM lineitem; 48009720 0.126
re2 SELECT COUNT( * ) FROM lineitem; 48009720 0.119
re2 SELECT COUNT( * ) FROM lineitem; 48009720 0.132
re2 SELECT COUNT( * ) FROM lineitem; 48009720 0.135
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 8.275
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.639
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.612
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.651
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.664
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 8.098
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.141
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.151
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.183
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%'; 3077840 3.171
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3,433
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.407
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.428
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.429
POSIX SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.380
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.109
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.132
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.150
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.149
re2 SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%ironic%'; 243576 3.097
Comment by Sergey Zefirov [ 2020-11-09 ]

With the "SELECT COUNT ( * ) FROM lineitem WHERE l_comment RLIKE 'quickly';" performance of RE2 version is several times worse than of POSIX regexes: 85 seconds (that was result for single copy of lineitem) for RE2 and 14 seconds for POSIX.

Comment by Sergey Zefirov [ 2020-11-09 ]

It's the difference between LIKE and RLIKE (with RE2):

MariaDB [tpch1]> SELECT COUNT(*) FROM lineitem WHERE l_comment RLIKE 'quickly';
+----------+
| COUNT(*) |
+----------+
|  3077840 |
+----------+
1 row in set (7 min 0.416 sec)
 
MariaDB [tpch1]> SELECT COUNT(*) FROM lineitem WHERE l_comment LIKE '%quickly%';
+----------+
| COUNT(*) |
+----------+
|  3077840 |
+----------+
1 row in set (3.107 sec)
 
MariaDB [tpch1]> 

Comment by Sergey Zefirov [ 2020-11-09 ]

What is interesting here is that LIKE uses regexes inside - SQL patterns are turned into regular expressions under the hood and then executed.

Comment by Sergey Zefirov [ 2020-11-09 ]

Now I try to do same with POSIX regexes. Exactly the same: compile anew, run test000.sh, copy lineitem three times and execute queries above.

Comment by Sergey Zefirov [ 2020-11-09 ]

The difference is much less pronounced in POSIX case:

MariaDB [tpch1]> SELECT COUNT( * ) FROM lineitem WHERE l_comment RLIKE 'quickly';
+------------+
| COUNT( * ) |
+------------+
|    3077840 |
+------------+
1 row in set (13.813 sec)
 
MariaDB [tpch1]> SELECT COUNT( * ) FROM lineitem WHERE l_comment LIKE '%quickly%';
+------------+
| COUNT( * ) |
+------------+
|    3077840 |
+------------+
1 row in set (3.602 sec)

But it is still there.

Comment by Sergey Zefirov [ 2020-11-09 ]

I did some peeking into execution plan (using RPM packages from build 1063).

Here's filters for LIKE operator:

Nov 09 15:28:26 sergueyz-devel-2 env[1966]: >>Filters
Nov 09 15:28:26 sergueyz-devel-2 env[1966]: SimpleFilter(indexflag=0 joinFlag= 0 card= 0)
Nov 09 15:28:26 sergueyz-devel-2 env[1966]:   SimpleColumn `tpch1`.`lineitem`.`l_comment`
Nov 09 15:28:26 sergueyz-devel-2 env[1966]:   s/t/c/v/o/ct/TA/CA/RA/#/card/join/source/engine/colPos: tpch1/lineitem/l_comment//4110/varchar/lineitem//0/-1/0/0/0/ColumnStore/-1
Nov 09 15:28:26 sergueyz-devel-2 env[1966]:   Operator: like fOp=10 opType=12  ConstantColumn: %quickly% intVal=0 uintVal=0(l) resultType=varchar

And here's filters for RLIKE:

Nov 09 15:29:13 sergueyz-devel-2 env[1966]: >>Filters
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: FunctionColumn: regexp
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: expressionId=0
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: joinInfo=0 returnAll=0 sequence#=-1
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: resultType=bigint|8
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: operationType=bigint
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: function parm:
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: SimpleColumn `tpch1`.`lineitem`.`l_comment`
Nov 09 15:29:13 sergueyz-devel-2 env[1966]:   s/t/c/v/o/ct/TA/CA/RA/#/card/join/source/engine/colPos: tpch1/lineitem/l_comment//4110/varchar/lineitem//0/-1/0/0/0/ColumnStore/-1
Nov 09 15:29:13 sergueyz-devel-2 env[1966]: ConstantColumn: quickly intVal=8175556574285953280 uintVal=8175556574285953280(l) resultType=varchar

It applies function regexp (in funcexp.cpp/h) to a constant column in case of RLIKE (translated into REGEXP internally). Source code examination shows that regexp applcation (getBool) recompiles regular expression for every application.

In case of LIKE there's special constant column where regular expression is cached. In this case we see cool difference between execution speed of different engines.

Comment by Sergey Zefirov [ 2020-11-10 ]

An obvious fix would be to not recompile regexes on every call.

So I added "virtual bool isConstant()

{ return false; }

" virtual method to tree node and made ConstantColumn to return true. Then Func_regexp class would use that flag to compile pattern only once.

I measured timing for all these changes with POSIX regular expression library on and found performance regression, about 40% loss.

As I do not have time to debug what is causing that, I will stop at that point.

Comment by Sergey Zefirov [ 2020-11-10 ]

Current situation:

  1. libre2 by Google offers speed benefits for LIKE operator,
  2. all other uses (RLIKE, REGEXP and so on) suffer substantial performance loss at the moment,
  3. possible mitigation: fall back to POSIX for RLIKE cases.
Comment by Sergey Zefirov [ 2020-11-12 ]

The nodes in expression tree are, effectively, singletons. They are constructed using getFunctor(string& name) method of FuncExp class. That method returns an already created instance of a function implementation object (instance of a class) and if you use several calls to the same function (as in "REGEXP(a,'a') AND REGEXP(a,'b')"), the object for both of them will be the same. The parameters will change, although.

Comment by Sergey Zefirov [ 2020-11-13 ]

I've optimized the construction of regular expression automata.

Here are results for RE2 (48M rows in "lineitem" table, warmed up DB):

MariaDB [tpch1]> select count(*) from lineitem where l_comment like "%quickly%";
+----------+
| count(*) |
+----------+
|  3077840 |
+----------+
1 row in set (3.125 sec)
 
MariaDB [tpch1]> select count(*) from lineitem where l_comment rlike "quickly";
+----------+
| count(*) |
+----------+
|  3077840 |
+----------+
1 row in set (13.737 sec)

Much better than 8 minutes, but still no cigar.

I will have results for POSIX regexes shortly.

Comment by Sergey Zefirov [ 2020-11-13 ]

And results for POSIX regular expressions:

MariaDB [tpch1]> select count(*) from lineitem where l_comment like "%quickly%";
+----------+
| count(*) |
+----------+
|  3077840 |
+----------+
1 row in set (3.630 sec)
 
MariaDB [tpch1]> select count(*) from lineitem where l_comment rlike "quickly";
+----------+
| count(*) |
+----------+
|  3077840 |
+----------+
1 row in set (13.788 sec)

It looks like there's no difference in performance of RLIKE operator between re2 and POSIX and some performance benefits in LIKE operator execution.

Comment by Sergey Zefirov [ 2020-11-13 ]

There's absence of error checking in the implementation of Func_regexp. The old code and new code are same in this regard.

Generated at Thu Feb 08 02:45:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.