[MDEV-26338] BINARY data differences after SELECT INTO OUTFILE / LOAD DATA INFILE Created: 2021-08-10  Updated: 2021-11-01  Resolved: 2021-10-25

Status: Closed
Project: MariaDB Server
Component/s: N/A
Affects Version/s: 10.2.40, 10.3.31, 10.4.21, 10.5.12, 10.6.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Hartmut Holzgraefe Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: need_feedback

Issue Links:
Relates
relates to MDEV-23471 LOAD DATA: partial utf8 Sequence in b... Closed

 Description   

When having a BINARY or VARBINARY column and trying to export contents with SELECT INTO OUTFILE, then reading back with LOAD DATA INFILE, data can become mangled depending on character set / encoding.

E.g.:

DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
 
CREATE TABLE t1(
  b1 VARBINARY(16),
  i1 INT
) DEFAULT CHARSET=utf8;
 
INSERT INTO t1 VALUES(UNHEX("00"), 23);
INSERT INTO t1 VALUES(UNHEX("E0"), 42);
 
SELECT * FROM t1 INTO OUTFILE 'data.txt' CHARACTER SET  utf8;
 
CREATE TABLE t2 LIKE t1;
 
LOAD DATA INFILE 'data.txt'
     INTO TABLE t2
     CHARACTER SET utf8;
 
SHOW WARNINGS;
 
SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
SELECT LENGTH(b1), HEX(b1), i1 FROM t2;

I would expect both SELECT statements to return the same result, but instead I'm getting:

MariaDB [test]> SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
+------------+---------+------+
| LENGTH(b1) | HEX(b1) | i1   |
+------------+---------+------+
|          1 | 00      |   23 |
|          1 | E0      |   42 |
+------------+---------+------+
2 rows in set (0.000 sec)

On the 2nd table, the one filled via LOAD DATA, I would assume to see the same result, seeing a byte value of E0 in the HEX(b1) column. But instead I'm seeing an empty b1 column in MariaDB versions up to 10.0, and a value of 3F ('?') starting with MariaDB 10.1.

Result on MariaDB 5.5 and 10.0:

MariaDB [test]> SELECT LENGTH(b1), HEX(b1), i1 FROM t2;
+------------+---------+------+
| LENGTH(b1) | HEX(b1) | i1   |
+------------+---------+------+
|          1 | 00      |   23 |
|          0 |            |   42 |
+------------+---------+------+
2 rows in set (0.000 sec)

Result on MariaDB 10.1 and beyond:

MariaDB [test]> SELECT LENGTH(b1), HEX(b1), i1 FROM t2;
+------------+---------+------+
| LENGTH(b1) | HEX(b1) | i1   |
+------------+---------+------+
|          1 | 00      |   23 |
|          1 | 3F      |   42 |
+------------+---------+------+
2 rows in set (0.000 sec)

The problem here is that the E0 byte is interpreted as the beginning of an UTF8 encoded multi byte sequence, but with the b1 column being of type VARBINARY no UTF8 sequence handling should be applied, the E0 byte should be interpreted verbatim instead.



 Comments   
Comment by Elena Stepanova [ 2021-08-18 ]

Isn't it expected though?

When you do SELECT .. INTO OUTFILE .. CHARACTER SET utf8, you explicitly enforce conversion into utf8 upon writing, even though b1 is not convertible. The warning is thrown then already, and 3F is written into the outfile.

MariaDB [test]> SELECT * FROM t1 INTO OUTFILE 'data.txt' CHARACTER SET  utf8;
Query OK, 2 rows affected, 2 warnings (0.001 sec)
 
MariaDB [test]> show warnings;
<...>
| Warning | 1366 | Data truncated for column 'b1' at row 2                                                                                                                               |
2 rows in set (0.000 sec)

Or, in other words,

MariaDB [test]> SELECT HEX(CONVERT(UNHEX("E0") USING utf8));
+--------------------------------------+
| HEX(CONVERT(UNHEX("E0") USING utf8)) |
+--------------------------------------+
| 3F                                   |
+--------------------------------------+
1 row in set, 1 warning (0.000 sec)
 
MariaDB [test]> show warnings;
+---------+------+---------------------------------------+
| Level   | Code | Message                               |
+---------+------+---------------------------------------+
| Warning | 1300 | Invalid utf8 character string: '\xE0' |
+---------+------+---------------------------------------+
1 row in set (0.000 sec)

Generated at Thu Feb 08 09:44:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.