[MDEV-8637] sysvar to control pcre match limit Created: 2015-08-17  Updated: 2020-08-25  Resolved: 2017-10-06

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Alexander Barkov
Resolution: Won't Fix Votes: 1
Labels: upstream-not-affected


 Description   

Under some conditions, the regexp function will raise a warning about some internal error:

MariaDB [(none)]> SELECT 'mail@abcdefghijklmnopqr. com' REGEXP '@([[:alnum:]]*|[-]*|[.]*)*[.](com|net|us|tv|vg|qs|org|cc|tc|info|ws|ms|biz|bz|net|edu|es|fr|ru|nl|it|de|am|cl)';
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| 'mail@abcdefghijklmnopqr. com' REGEXP '@([[:alnum:]]*|[-]*|[.]*)*[.](com|net|us|tv|vg|qs|org|cc|tc|info|ws|ms|biz|bz|net|edu|es|fr|ru|nl|it|de|am|cl)' |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                      0 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.18 sec)
 
MariaDB [(none)]> SHOW WARNINGS;
+---------+------+--------------------------------------------------------+
| Level   | Code | Message                                                |
+---------+------+--------------------------------------------------------+
| Warning | 1139 | Got error 'pcre_exec: Internal error (-8)' from regexp |
+---------+------+--------------------------------------------------------+
1 row in set (0.00 sec)

It might be related to string length. Changing the length of the string makes the message go away:

MariaDB [(none)]> SELECT 'mail@abcdefghijklmnopq. com' REGEXP '@([[:alnum:]]*|[-]*|[.]*)*[.](com|net|us|tv|vg|qs|org|cc|tc|info|ws|ms|biz|bz|net|edu|es|fr|ru|nl|it|de|am|cl)';
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| 'mail@abcdefghijklmnopq. com' REGEXP '@([[:alnum:]]*|[-]*|[.]*)*[.](com|net|us|tv|vg|qs|org|cc|tc|info|ws|ms|biz|bz|net|edu|es|fr|ru|nl|it|de|am|cl)' |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                     0 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.14 sec)



 Comments   
Comment by Sergei Golubchik [ 2015-08-17 ]

Just FYI:
The error is

#define PCRE_ERROR_MATCHLIMIT       (-8)

and it means

man pcreapi

       Internally,  pcre_exec() uses a function called match(), which it calls
       repeatedly (sometimes recursively). The limit  set  by  match_limit  is
       imposed  on the number of times this function is called during a match,
       which has the effect of limiting the amount of  backtracking  that  can
       take place. For patterns that are not anchored, the count restarts from
       zero for each position in the subject string.
...
       The  default  value  for  the  limit can be set when PCRE is built; the
       default default is 10 million, which handles all but the  most  extreme
       cases.  You  can  override  the  default by suppling pcre_exec() with a
       pcre_extra    block    in    which    match_limit    is    set,     and
       PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.

Comment by Geoff Montee (Inactive) [ 2015-10-12 ]

Thanks for the response!

It looks like the server sets match_limit to 10 million at compile time:

https://github.com/MariaDB/server/blob/0d54cb12eb77c4c8c90e9833ec3df47c70c0ade7/pcre/config-cmake.h.in#L50

And it gets set here:

https://github.com/MariaDB/server/blob/0d54cb12eb77c4c8c90e9833ec3df47c70c0ade7/pcre/pcrecpp.cc#L517

Do you think this should stay as-is?

Maybe we could implement a run-time setting that allows PCRE's match_limit to be set higher for users who want to use really complicated regular expressions?

Comment by Alexander Barkov [ 2015-10-12 ]

MySQL-5.7.8 is not affected.

Comment by Alexander Barkov [ 2015-10-12 ]

A workaround would be to rewrite the regular expression in a little bit more efficient way,
to join three "*" repetition operators into a single one, like this:

MariaDB [test]> SELECT 'mail@abcdefghijklmnopqr. com' REGEXP '@([[:alnum:]]|[-]|[.])*[.](com|net|us|tv|vg|qs|org|cc|tc|info|ws|ms|biz|bz|net|edu|es|fr|ru|nl|it|de|am|cl)';
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| 'mail@abcdefghijklmnopqr. com' REGEXP '@([[:alnum:]]|[-]|[.])*[.](com|net|us|tv|vg|qs|org|cc|tc|info|ws|ms|biz|bz|net|edu|es|fr|ru|nl|it|de|am|cl)' |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                   0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+

It seems to be handled by pcre without problems this way.

Comment by Geoff Montee (Inactive) [ 2015-10-12 ]

It looks like MySQL 5.7 still uses Henry Spencer's implementation of regular expressions, rather than PCRE, if the manual is correct, so it probably doesn't call pcre_exec().

https://dev.mysql.com/doc/refman/5.7/en/regexp.html

Comment by Alexander Barkov [ 2015-10-12 ]

Geoff, right, I think MySQL still uses the old library.
I just checked if MySQL fails as we should try not to be worse than MySQL
So although there is a workaround with a modified regexp, we should probably still think about a way to pass a bigger match_limit. It could be a session variable, perhaps.
Say:

SET @@pcre_match_limit=10*1000*1000;
SELECT 'subject' REGEXP 'complex-pattern';

Comment by Geoff Montee (Inactive) [ 2015-10-12 ]

Ah, I see. A session variable sounds like a good fix to me. Thanks!

Comment by Claudio Nanni [ 2015-10-13 ]

Hi,

Controlling the variable from MariaDB sounds good.
Anyway we should be aware that when using regular expressions to match columns of a table the performance decreases rapidly with the number of compared rows.
In my tests comparing a regexp that needs more than 10 Millions combinations(or whatever "internal error (-8)' from regexp" means) against a 1000 records table can take up to 5 minutes.
The best case I can imagine is when due to other conditions(ideally a PK match) the number of records is reduced to one or very few.
Different is of course if the regexp is very simple.

Comment by Alexander Barkov [ 2017-10-06 ]

Sergei thinks we should not add this variable. With a too much complex regular expression, performance becomes too bad anyway.
One should consider rewriting regular expressions in a more efficient way, or use some other text processing functions.

Generated at Thu Feb 08 07:28:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.