Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8102

REGEXP function fails to match hex values when expression is stored as a variable

Details

    Description

      The regexp function works differently when evaluating a string stored in a variable than it does when evaluating a string literal.

      Example:

      select 
            0xE001A90213021002,
            @regCheck:= '\\xE0\\x01',
            '\\xE0\\x01'= @regCheck,
            0xE001A90213021002 REGEXP @regCheck,
            0xE001A90213021002 REGEXP '\\xE0\\x01';

      The select statement above successfully matches via the regular expression only when the string literal is used. The evaluation of the variable fails to match, even though Maria is saying the string literal and the variable have matching values.

      The problem appears to be related to the use of the hex values in the regular expression, as the following statement evaluates without issue:

      select 
      	@regCheck :='D'
      	@testvar:='Dude',
              @testvar regexp @regCheck,
              @testvar regexp 'D';

      I tested this against MySQL 5.5, where the evaluation fails for both the string literal and the variable (completely non-functional in all situations), so I decided to report the bug here instead.

      Attachments

        Activity

          bar Alexander Barkov added a comment - - edited

          A smaller test reproducing the problem:

          SET NAMES latin1;
          SELECT
            @regCheck:= '\\xE0\\x01',
            '\\xE0\\x01'= @regCheck,
            0xE001 REGEXP @regCheck,
            0xE001 REGEXP '\\xE0\\x01'\G

          The result is:

            @regCheck:= '\\xE0\\x01': \xE0\x01
             '\\xE0\\x01'= @regCheck: 1
             0xE001 REGEXP @regCheck: 0
          0xE001 REGEXP '\\xE0\\x01': 1

          bar Alexander Barkov added a comment - - edited A smaller test reproducing the problem: SET NAMES latin1; SELECT @regCheck:= '\\xE0\\x01', '\\xE0\\x01'= @regCheck, 0xE001 REGEXP @regCheck, 0xE001 REGEXP '\\xE0\\x01'\G The result is: @regCheck:= '\\xE0\\x01': \xE0\x01 '\\xE0\\x01'= @regCheck: 1 0xE001 REGEXP @regCheck: 0 0xE001 REGEXP '\\xE0\\x01': 1

          The difference happens because user variables have IMPLICIT collation derivation and therefore have precedence over literals (which have COERCIBLE collation derivation).

          So this script:

          SET NAMES latin1;
          SET @regCheck= '\\xE0\\x01';
          SELECT 0xE001 REGEXP @regCheck;

          chooses latin1_swedish_ci as an effective collation for the operation, because a user variable is stronger than a literal.

          Then, RLIKE converts the operands to utf8 (which is internal character set for PCRE when comparing text strings)

          • the value of the user variable gets successfully converted from latin1 to utf8
          • the value of 0xE001 is not converted (because it is a binary string rather than a text string), so RLIKE tries to re-interpret it as an utf8 string as is, which fails because 0xE001 is a bad utf8 sequence.
            As a result, RLIKE returns "false".

          With a string literal given in the pattern it works differently:

          SET NAMES latin1;
          SELECT 0xE001 REGEXP '\\xE0\\x01';

          Both operands have COERCIBLE derivation, so it chooses "binary" as a common collation for the two operands, because "binary" has precedence over a text string in case if derivations of two operands are the same.
          Then, it calls PCRE routines in binary mode rather than text mode (i.e. without any utf8 association or conversion).

          The difference between a literal and a user variable is usually very subtle, but it apparently gets quite sensitive in case of RLIKE.

          bar Alexander Barkov added a comment - The difference happens because user variables have IMPLICIT collation derivation and therefore have precedence over literals (which have COERCIBLE collation derivation). So this script: SET NAMES latin1; SET @regCheck= '\\xE0\\x01'; SELECT 0xE001 REGEXP @regCheck; chooses latin1_swedish_ci as an effective collation for the operation, because a user variable is stronger than a literal. Then, RLIKE converts the operands to utf8 (which is internal character set for PCRE when comparing text strings) the value of the user variable gets successfully converted from latin1 to utf8 the value of 0xE001 is not converted (because it is a binary string rather than a text string), so RLIKE tries to re-interpret it as an utf8 string as is, which fails because 0xE001 is a bad utf8 sequence. As a result, RLIKE returns "false". With a string literal given in the pattern it works differently: SET NAMES latin1; SELECT 0xE001 REGEXP '\\xE0\\x01'; Both operands have COERCIBLE derivation, so it chooses "binary" as a common collation for the two operands, because "binary" has precedence over a text string in case if derivations of two operands are the same. Then, it calls PCRE routines in binary mode rather than text mode (i.e. without any utf8 association or conversion). The difference between a literal and a user variable is usually very subtle, but it apparently gets quite sensitive in case of RLIKE.

          A number of workarounds are possible:

          # 1. This makes the pattern to be a binary string:
          SET NAMES latin1;
          SET @regCheck= X'E001';
          SELECT 0xE001 REGEXP @regCheck;
           
          # 2. This also makes the pattern to be a binary string, using a different syntax:
          SET NAMES latin1;
          SET @regCheck= _binary '\\xE0\\x01';
          SELECT 0xE001 REGEXP @regCheck;
           
          # 3. This makes derivation of the subject string stronger (IMLICIT instead of COERCIBLE)
          SET NAMES latin1;
          SET @regCheck= '\\xE0\\x01';
          SELECT CAST(0xE001 AS BINARY) REGEXP @regCheck;

          bar Alexander Barkov added a comment - A number of workarounds are possible: # 1. This makes the pattern to be a binary string: SET NAMES latin1; SET @regCheck= X'E001'; SELECT 0xE001 REGEXP @regCheck;   # 2. This also makes the pattern to be a binary string, using a different syntax: SET NAMES latin1; SET @regCheck= _binary '\\xE0\\x01'; SELECT 0xE001 REGEXP @regCheck;   # 3. This makes derivation of the subject string stronger (IMLICIT instead of COERCIBLE) SET NAMES latin1; SET @regCheck= '\\xE0\\x01'; SELECT CAST(0xE001 AS BINARY) REGEXP @regCheck;

          We won't do any fixes that change the behaviour.
          We have just made regex functions display errors returned from pcre_exec() as MariaDB warnings.

          SET NAMES latin1;
          SET @regCheck= '\\xE0\\x01';
          SELECT 0xE001 REGEXP @regCheck;
          0xE001 REGEXP @regCheck
          0
          Warnings:
          Warning        1139    Got error 'pcre_exec: Invalid utf8 byte sequence in the subject string' from regexp

          bar Alexander Barkov added a comment - We won't do any fixes that change the behaviour. We have just made regex functions display errors returned from pcre_exec() as MariaDB warnings. SET NAMES latin1; SET @regCheck= '\\xE0\\x01'; SELECT 0xE001 REGEXP @regCheck; 0xE001 REGEXP @regCheck 0 Warnings: Warning 1139 Got error 'pcre_exec: Invalid utf8 byte sequence in the subject string' from regexp

          People

            bar Alexander Barkov
            MarkP Mark Punak
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.