[MDEV-11079] Regression: LOAD DATA INFILE lost BLOB support using utf8 load files Created: 2016-10-18  Updated: 2017-03-14  Resolved: 2017-01-09

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 5.5.51, 10.0.27, 5.5, 10.0, 10.1
Fix Version/s: 5.5.55, 10.0.29, 10.1.21

Type: Bug Priority: Major
Reporter: Martin Koegler Assignee: Sergei Golubchik
Resolution: Fixed Votes: 0
Labels: contribution, regression, upstream

Issue Links:
Relates
relates to MDEV-12240 LOAD DATA INFILE binary blobs failing... Confirmed
Sprint: 5.5.54

 Description   

https://github.com/mysql/mysql-server/commit/774e6ffd0897dd763763b69e15028c1fbd44c4e7 changed the way, load data infile parses the data.

The commit starts validating the whole load file using the file character set. BLOBs have always been copied 1:1 (no character set translations - only escape sequences are processed). If the load file is using UTF-8, blob columns can never be encoded in UTF-8, as binary data can contain character sequences, which are invalid UTF-8 and there is no charset conversion.

Starting with this commit, LOAD DATA rejects non-UTF8 sequences in blob fields.

Create a test file:

$ hexdump -C test
00000000  22 25 aa ab ac 22 0a                              |"%ª«¬".|

CREATE TABLE `x` ( `y` mediumblob NOT NULL) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
load data local infile 'test'  into table x charset utf8 FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';

This will lead to:

ERROR 1300 (HY000): Invalid utf8 character string: '"%'

Older MariaDB 10.0.X releases can load this file.



 Comments   
Comment by Martin Koegler [ 2016-10-18 ]

--- sql/sql_load.cc.orig        2016-10-18 17:45:32.156615718 +0200
+++ sql/sql_load.cc     2016-10-18 17:49:19.990569542 +0200
@@ -90,7 +90,7 @@
            String &field_term,String &line_start,String &line_term,
            String &enclosed,int escape,bool get_it_from_net, bool is_fifo);
   ~READ_INFO();
-  int read_field();
+  int read_field(CHARSET_INFO *field_charset);
   int read_fixed_length(void);
   int next_line(void);
   char unescape(char chr);
@@ -1040,7 +1040,7 @@
       uchar *pos;
       Item *real_item;
 
-      if (read_info.read_field())
+      if (read_info.read_field(item->real_item()->collation.collation))
        break;
 
       /* If this line is to be skipped we don't want to fill field or var */
@@ -1527,10 +1527,13 @@
 }
 
 
-int READ_INFO::read_field()
+int READ_INFO::read_field(CHARSET_INFO *field_charset)
 {
   int chr,found_enclosed_char;
   uchar *to,*new_buffer;
+  CHARSET_INFO *read_charset = this->read_charset;
+  if (field_charset == &my_charset_bin)
+    read_charset = &my_charset_bin;
 
   found_null=0;
   if (found_end_of_line)

Comment by Elena Stepanova [ 2016-10-19 ]

Thanks for the report. In server trees, it's this revision:

commit 9f7288e2e0179db478d20c74f57b5c7d6c95f793 4a3f1c1f104cbfeb6d31ee02788589151b131eca
Author: Thayumanavar S <thayumanavar.x.sachithanantha@oracle.com>
Date:   Mon Jun 20 11:35:43 2016 +0530
 
    BUG#23080148 - BACKPORT BUG 14653594 AND BUG 20683959 TO
                   MYSQL-5.5
    
    The bug asks for a backport of bug#1463594 and bug#20682959. This
    is required because of the fact that if replication is enabled, master
    transaction can commit whereas slave can't commit due to not exact
    'enviroment'. This manifestation is seen in bug#22024200.

koeglermar,
Thanks for the report and test case.
Did you also file a bug report to MySQL?

Comment by Martin Koegler [ 2016-10-28 ]

I filled no upstream bug, because I have not tested with MySQL.

PS: JIRA didn't send a email notifications, so I have not noticed the new comment earlier.

Comment by Sergey Vojtovich [ 2016-12-20 ]

This change was reverted in last MySQL 5.5 release. Most probably we'll do the same.

Comment by Martin Koegler [ 2016-12-27 ]

What about newer version? MySQL 5.7 does not contain a revert.

Comment by Alexander Barkov [ 2017-01-09 ]

In the current 5.5.55 tree it seems to work fine.
I create the file:

$ printf "\x22\x25\xaa\xab\xac\x22\x0a" >/tmp/test.txt

and run this SQL script:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a mediumblob NOT NULL) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
LOAD DATA LOCAL INFILE '/tmp/test.txt' INTO TABLE t1 CHARSET utf8 FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';
SELECT HEX(a) FROM t1;

It loads the data without errors and returns this result as expected:

+----------+
| HEX(a)   |
+----------+
| 25AAABAC |
+----------+

10.2 also works fine.

Comment by Alexander Barkov [ 2017-01-09 ]

Added a test case into 5.5.

Comment by Bill Nokes [ 2017-03-14 ]

Hi,

I believe that MariaDB is still treating UTF8 binary data incorrectly while using load infile. I have a test case that seems to introduce an extra escape char into saved binary blob data when using UTF8. Please see linked Jira - MDEV-12240.

Kind regards,
Bill

Generated at Thu Feb 08 07:47:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.