Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26338

BINARY data differences after SELECT INTO OUTFILE / LOAD DATA INFILE

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 10.2.40, 10.3.31, 10.4.21, 10.5.12, 10.6.4
    • Fix Version/s: N/A
    • Component/s: N/A
    • Labels:

      Description

      When having a BINARY or VARBINARY column and trying to export contents with SELECT INTO OUTFILE, then reading back with LOAD DATA INFILE, data can become mangled depending on character set / encoding.

      E.g.:

      DROP TABLE IF EXISTS t1;
      DROP TABLE IF EXISTS t2;
       
      CREATE TABLE t1(
        b1 VARBINARY(16),
        i1 INT
      ) DEFAULT CHARSET=utf8;
       
      INSERT INTO t1 VALUES(UNHEX("00"), 23);
      INSERT INTO t1 VALUES(UNHEX("E0"), 42);
       
      SELECT * FROM t1 INTO OUTFILE 'data.txt' CHARACTER SET  utf8;
       
      CREATE TABLE t2 LIKE t1;
       
      LOAD DATA INFILE 'data.txt'
           INTO TABLE t2
           CHARACTER SET utf8;
       
      SHOW WARNINGS;
       
      SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
      SELECT LENGTH(b1), HEX(b1), i1 FROM t2;
      

      I would expect both SELECT statements to return the same result, but instead I'm getting:

      MariaDB [test]> SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
      +------------+---------+------+
      | LENGTH(b1) | HEX(b1) | i1   |
      +------------+---------+------+
      |          1 | 00      |   23 |
      |          1 | E0      |   42 |
      +------------+---------+------+
      2 rows in set (0.000 sec)
      

      On the 2nd table, the one filled via LOAD DATA, I would assume to see the same result, seeing a byte value of E0 in the HEX(b1) column. But instead I'm seeing an empty b1 column in MariaDB versions up to 10.0, and a value of 3F ('?') starting with MariaDB 10.1.

      Result on MariaDB 5.5 and 10.0:

      MariaDB [test]> SELECT LENGTH(b1), HEX(b1), i1 FROM t2;
      +------------+---------+------+
      | LENGTH(b1) | HEX(b1) | i1   |
      +------------+---------+------+
      |          1 | 00      |   23 |
      |          0 |            |   42 |
      +------------+---------+------+
      2 rows in set (0.000 sec)
      

      Result on MariaDB 10.1 and beyond:

      MariaDB [test]> SELECT LENGTH(b1), HEX(b1), i1 FROM t2;
      +------------+---------+------+
      | LENGTH(b1) | HEX(b1) | i1   |
      +------------+---------+------+
      |          1 | 00      |   23 |
      |          1 | 3F      |   42 |
      +------------+---------+------+
      2 rows in set (0.000 sec)
      

      The problem here is that the E0 byte is interpreted as the beginning of an UTF8 encoded multi byte sequence, but with the b1 column being of type VARBINARY no UTF8 sequence handling should be applied, the E0 byte should be interpreted verbatim instead.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              hholzgra Hartmut Holzgraefe
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.