[MDEV-13361] Regression: LOAD DATA INFILE utf8 enconding issues Created: 2017-07-20  Updated: 2019-04-22  Resolved: 2019-04-22

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.0, 10.1, 10.2
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Martin Koegler Assignee: Alexander Barkov
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: File mysql-binary-load-test.patch    
Issue Links:
Relates
relates to MDEV-12240 LOAD DATA INFILE binary blobs failing... Confirmed

 Description   

Propably the fix of MDEV-11079 introduced another regression:

The attached testcase yields to the following error on mariadb 10.0.31:

@@ -5,5 +5,5 @@
 x      HEX(y)
 ÄÖÜ    000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
 äöü    000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
-ÄÖÜ    8FE95C725427
+ÄÖÜ    8FE95C0D5427
 drop table t1;
 
mysqltest: Result content mismatch

It works on mysql 5.6.17.



 Comments   
Comment by Elena Stepanova [ 2017-07-20 ]

Thanks for the report and test case.

Comment by Martin Koegler [ 2017-07-21 ]

Its still the same reason as in MDEV-11079:
The loading code tries to treat the binary data also as UTF-8 - thereby ignoring any escape sequeces.

The parser is basically:

  • get next character
  • if its an escape character, handle escape sequence
  • otherwise determine multi-byte character length from the character and try copy as much characters [ignoring any escape sequence]

This breaks as soon as binary data is read using the UTF-8 characterset. Binary data will always contain non-utf8 conformant byte sequences. The start byte of a 2 byte UTF8 symbol may even be followed by a escape sequence.
Escape sequences starting in the middle of a multi-byte character are causing this bug.

The patch in MDEV-11079 is still valid for this bug. It switches the parser character set to a binary (single byte) while reading blob fields.

Comment by Martin Koegler [ 2017-07-21 ]

--- sql/sql_load.cc.old 2017-07-20 19:11:17.038746970 +0200
+++ sql/sql_load.cc     2017-07-21 15:09:38.039931561 +0200
@@ -165,7 +165,7 @@
            String &field_term,String &line_start,String &line_term,
            String &enclosed,int escape,bool get_it_from_net, bool is_fifo);
   ~READ_INFO();
-  int read_field();
+  int read_field(CHARSET_INFO *field_charset);
   int read_fixed_length(void);
   int next_line(void);
   char unescape(char chr);
@@ -1116,7 +1116,7 @@
       uchar *pos;
       Item_field *real_item;
 
-      if (read_info.read_field())
+      if (read_info.read_field(item->real_item()->collation.collation))
        break;
 
       /* If this line is to be skipped we don't want to fill field or var */
@@ -1651,10 +1651,13 @@
   must make sure to use escapes properly.
 */
 
-int READ_INFO::read_field()
+int READ_INFO::read_field(CHARSET_INFO *field_charset)
 {
   int chr,found_enclosed_char;
   uchar *to,*new_buffer;
+  CHARSET_INFO *read_charset = this->read_charset;
+  if (field_charset == &my_charset_bin)
+      read_charset = &my_charset_bin;
 
   found_null=0;
   if (found_end_of_line)

Comment by Alexander Barkov [ 2019-04-22 ]

MariaDB assumes that the entire file has the same encoding, because it is specified during 'SELECT .. INTO OUTFILE .. CHARACTER SET ...' on per file (not per column) level. Determining character set per column at LOAD DATA type would break symmetry.

To avoid escape/unescape problems on a mixture of character and binary data one should use 'CHARACTER SET binary' both in 'SELECT INTO OUTFILE' and 'LOAD DATA INFILE' queries.

This test demonstrates that a mixture of utf8 and binary data gets restored without problems:

let $MYSQLD_DATADIR= `SELECT @@datadir`;
 
SET NAMES utf8;
 
CREATE TABLE t1 (
  x VARCHAR(10) CHARACTER SET UTF8 NOT NULL,
  y MEDIUMBLOB NOT NULL
);
INSERT INTO t1 VALUES ('ÄÖÜ', X'8FE95C725427');
 
SELECT *
  INTO OUTFILE 't1.txt' CHARACTER SET BINARY
  FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
FROM t1;
 
CREATE TABLE t2 LIKE t1;
 
LOAD DATA INFILE 't1.txt' INTO TABLE t2 CHARSET BINARY
  FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\'
  LINES TERMINATED BY '\n';
SELECT 't1' AS tab, HEX(x), HEX(y) FROM t1
UNION ALL
SELECT 't2', HEX(x), HEX(y) FROM t2;
 
DROP TABLE t1, t2;
 
--remove_file $MYSQLD_DATADIR/test/t1.txt

The output is:

SET NAMES utf8;
CREATE TABLE t1 (
x VARCHAR(10) CHARACTER SET UTF8 NOT NULL,
y MEDIUMBLOB NOT NULL
);
INSERT INTO t1 VALUES ('ÄÖÜ', X'8FE95C725427');
SELECT *
INTO OUTFILE 't1.txt' CHARACTER SET BINARY
FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
FROM t1;
CREATE TABLE t2 LIKE t1;
LOAD DATA INFILE 't1.txt' INTO TABLE t2 CHARSET BINARY
FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\'
  LINES TERMINATED BY '\n';
SELECT 't1' AS tab, HEX(x), HEX(y) FROM t1
UNION ALL
SELECT 't2', HEX(x), HEX(y) FROM t2;
tab	HEX(x)	HEX(y)
t1	C384C396C39C	8FE95C725427
t2	C384C396C39C	8FE95C725427
DROP TABLE t1, t2;

Generated at Thu Feb 08 08:05:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.