[MDEV-8102] REGEXP function fails to match hex values when expression is stored as a variable Created: 2015-05-05  Updated: 2015-05-14  Resolved: 2015-05-14

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.0.17
Fix Version/s: 10.0.20

Type: Bug Priority: Major
Reporter: Mark Punak Assignee: Alexander Barkov
Resolution: Not a Bug Votes: 0
Labels: function, regexp, verified
Environment:

CentOS 7



 Description   

The regexp function works differently when evaluating a string stored in a variable than it does when evaluating a string literal.

Example:

select 
      0xE001A90213021002,
      @regCheck:= '\\xE0\\x01',
      '\\xE0\\x01'= @regCheck,
      0xE001A90213021002 REGEXP @regCheck,
      0xE001A90213021002 REGEXP '\\xE0\\x01';

The select statement above successfully matches via the regular expression only when the string literal is used. The evaluation of the variable fails to match, even though Maria is saying the string literal and the variable have matching values.

The problem appears to be related to the use of the hex values in the regular expression, as the following statement evaluates without issue:

select 
	@regCheck :='D'
	@testvar:='Dude',
        @testvar regexp @regCheck,
        @testvar regexp 'D';

I tested this against MySQL 5.5, where the evaluation fails for both the string literal and the variable (completely non-functional in all situations), so I decided to report the bug here instead.



 Comments   
Comment by Alexander Barkov [ 2015-05-07 ]

A smaller test reproducing the problem:

SET NAMES latin1;
SELECT
  @regCheck:= '\\xE0\\x01',
  '\\xE0\\x01'= @regCheck,
  0xE001 REGEXP @regCheck,
  0xE001 REGEXP '\\xE0\\x01'\G

The result is:

  @regCheck:= '\\xE0\\x01': \xE0\x01
   '\\xE0\\x01'= @regCheck: 1
   0xE001 REGEXP @regCheck: 0
0xE001 REGEXP '\\xE0\\x01': 1

Comment by Alexander Barkov [ 2015-05-07 ]

The difference happens because user variables have IMPLICIT collation derivation and therefore have precedence over literals (which have COERCIBLE collation derivation).

So this script:

SET NAMES latin1;
SET @regCheck= '\\xE0\\x01';
SELECT 0xE001 REGEXP @regCheck;

chooses latin1_swedish_ci as an effective collation for the operation, because a user variable is stronger than a literal.

Then, RLIKE converts the operands to utf8 (which is internal character set for PCRE when comparing text strings)

  • the value of the user variable gets successfully converted from latin1 to utf8
  • the value of 0xE001 is not converted (because it is a binary string rather than a text string), so RLIKE tries to re-interpret it as an utf8 string as is, which fails because 0xE001 is a bad utf8 sequence.
    As a result, RLIKE returns "false".

With a string literal given in the pattern it works differently:

SET NAMES latin1;
SELECT 0xE001 REGEXP '\\xE0\\x01';

Both operands have COERCIBLE derivation, so it chooses "binary" as a common collation for the two operands, because "binary" has precedence over a text string in case if derivations of two operands are the same.
Then, it calls PCRE routines in binary mode rather than text mode (i.e. without any utf8 association or conversion).

The difference between a literal and a user variable is usually very subtle, but it apparently gets quite sensitive in case of RLIKE.

Comment by Alexander Barkov [ 2015-05-07 ]

A number of workarounds are possible:

# 1. This makes the pattern to be a binary string:
SET NAMES latin1;
SET @regCheck= X'E001';
SELECT 0xE001 REGEXP @regCheck;
 
# 2. This also makes the pattern to be a binary string, using a different syntax:
SET NAMES latin1;
SET @regCheck= _binary '\\xE0\\x01';
SELECT 0xE001 REGEXP @regCheck;
 
# 3. This makes derivation of the subject string stronger (IMLICIT instead of COERCIBLE)
SET NAMES latin1;
SET @regCheck= '\\xE0\\x01';
SELECT CAST(0xE001 AS BINARY) REGEXP @regCheck;

Comment by Alexander Barkov [ 2015-05-14 ]

We won't do any fixes that change the behaviour.
We have just made regex functions display errors returned from pcre_exec() as MariaDB warnings.

SET NAMES latin1;
SET @regCheck= '\\xE0\\x01';
SELECT 0xE001 REGEXP @regCheck;
0xE001 REGEXP @regCheck
0
Warnings:
Warning        1139    Got error 'pcre_exec: Invalid utf8 byte sequence in the subject string' from regexp

Generated at Thu Feb 08 07:24:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.