[MDEV-12403] REGEXP: performance can be very very poor Created: 2017-03-30  Updated: 2017-04-02  Resolved: 2017-03-30

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Pascal HOARAU Assignee: Sergei Golubchik
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

centos 7, CPU E5-1630 v3 @ 3.70GHz, 64 GB RAM, RAID1 SSD



 Description   

Horrible performances when using REGEXP

 
SELECT count(*) FROM `jbm_clients`.`michelin`;
+----------+
| count(*) |
+----------+
|    13042 |
+----------+
1 row in set (0.01 sec)
 
MariaDB [jbm]> SELECT distinct `ip` FROM `jbm_clients`.`michelin` WHERE `return_path` like '%amazonses%' and `return_path` REGEXP '(?:\.|@|-)*.amazonses.com(\.[a-z]+)?$' ;
+-------------+
| ip          |
+-------------+
| 54.240.4.10 |
| 54.240.4.11 |
| 54.240.4.12 |
| 54.240.4.7  |
| 54.240.4.8  |
| 54.240.4.9  |
+-------------+
6 rows in set (0.03 sec)
 
MariaDB [jbm]> SELECT distinct `ip` FROM `jbm_clients`.`michelin` WHERE `return_path` REGEXP '(?:\.|@|-)*.amazonses.com(\.[a-z]+)?$' ;
+-------------+
| ip          |
+-------------+
| 54.240.4.10 |
| 54.240.4.11 |
| 54.240.4.12 |
| 54.240.4.7  |
| 54.240.4.8  |
| 54.240.4.9  |
+-------------+
6 rows in set (3 min 11.02 sec)



 Comments   
Comment by Sergei Golubchik [ 2017-03-30 ]

Well, sure. Regular expressions can be slow. There are lots of articles in the internet and many books about how to write fast regexes. Also, you can first pre-filter your rows with a fast LIKE — just as you did! LIKE on a constant substring uses Boyer–Moore algorithm and should, probably, be faster than anything you can do with a regex.

Comment by Sergei Golubchik [ 2017-03-30 ]

Technically, MariaDB could look inside a regex and automatically apply LIKE optimization. But it generally treats regexes as black boxes, and directly invokes pcre for them. If pcre would have an API to parse a regex and extract substrings that must always be present, we could've use it for optimizing regexes.

Comment by Pascal HOARAU [ 2017-03-30 ]

more than 3mn for a powerful server and a table containing 13 000 rows is a real bug for me !

Comment by Daniel Black [ 2017-03-31 ]

I suggest you use a https://mariadb.com/kb/en/mariadb/virtual-computed-columns/ to extract (using REGEX_REPLACE) the amazonses.com part of the from the return_path column as a new column and index this. Then change the query to lookup this value in the generated column will use the new index.

Generated at Thu Feb 08 07:57:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.