Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11079

Regression: LOAD DATA INFILE lost BLOB support using utf8 load files

Details

    • 5.5.54

    Description

      https://github.com/mysql/mysql-server/commit/774e6ffd0897dd763763b69e15028c1fbd44c4e7 changed the way, load data infile parses the data.

      The commit starts validating the whole load file using the file character set. BLOBs have always been copied 1:1 (no character set translations - only escape sequences are processed). If the load file is using UTF-8, blob columns can never be encoded in UTF-8, as binary data can contain character sequences, which are invalid UTF-8 and there is no charset conversion.

      Starting with this commit, LOAD DATA rejects non-UTF8 sequences in blob fields.

      Create a test file:

      $ hexdump -C test
      00000000  22 25 aa ab ac 22 0a                              |"%ª«¬".|
      

      CREATE TABLE `x` ( `y` mediumblob NOT NULL) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
      load data local infile 'test'  into table x charset utf8 FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';
      

      This will lead to:

      ERROR 1300 (HY000): Invalid utf8 character string: '"%'
      

      Older MariaDB 10.0.X releases can load this file.

      Attachments

        Issue Links

          Activity

            --- sql/sql_load.cc.orig        2016-10-18 17:45:32.156615718 +0200
            +++ sql/sql_load.cc     2016-10-18 17:49:19.990569542 +0200
            @@ -90,7 +90,7 @@
                        String &field_term,String &line_start,String &line_term,
                        String &enclosed,int escape,bool get_it_from_net, bool is_fifo);
               ~READ_INFO();
            -  int read_field();
            +  int read_field(CHARSET_INFO *field_charset);
               int read_fixed_length(void);
               int next_line(void);
               char unescape(char chr);
            @@ -1040,7 +1040,7 @@
                   uchar *pos;
                   Item *real_item;
             
            -      if (read_info.read_field())
            +      if (read_info.read_field(item->real_item()->collation.collation))
                    break;
             
                   /* If this line is to be skipped we don't want to fill field or var */
            @@ -1527,10 +1527,13 @@
             }
             
             
            -int READ_INFO::read_field()
            +int READ_INFO::read_field(CHARSET_INFO *field_charset)
             {
               int chr,found_enclosed_char;
               uchar *to,*new_buffer;
            +  CHARSET_INFO *read_charset = this->read_charset;
            +  if (field_charset == &my_charset_bin)
            +    read_charset = &my_charset_bin;
             
               found_null=0;
               if (found_end_of_line)
            

            koeglermar Martin Koegler added a comment - --- sql/sql_load.cc.orig 2016-10-18 17:45:32.156615718 +0200 +++ sql/sql_load.cc 2016-10-18 17:49:19.990569542 +0200 @@ -90,7 +90,7 @@ String &field_term,String &line_start,String &line_term, String &enclosed,int escape,bool get_it_from_net, bool is_fifo); ~READ_INFO(); - int read_field(); + int read_field(CHARSET_INFO *field_charset); int read_fixed_length(void); int next_line(void); char unescape(char chr); @@ -1040,7 +1040,7 @@ uchar *pos; Item *real_item;   - if (read_info.read_field()) + if (read_info.read_field(item->real_item()->collation.collation)) break;   /* If this line is to be skipped we don't want to fill field or var */ @@ -1527,10 +1527,13 @@ }     -int READ_INFO::read_field() +int READ_INFO::read_field(CHARSET_INFO *field_charset) { int chr,found_enclosed_char; uchar *to,*new_buffer; + CHARSET_INFO *read_charset = this->read_charset; + if (field_charset == &my_charset_bin) + read_charset = &my_charset_bin;   found_null=0; if (found_end_of_line)

            Thanks for the report. In server trees, it's this revision:

            commit 9f7288e2e0179db478d20c74f57b5c7d6c95f793 4a3f1c1f104cbfeb6d31ee02788589151b131eca
            Author: Thayumanavar S <thayumanavar.x.sachithanantha@oracle.com>
            Date:   Mon Jun 20 11:35:43 2016 +0530
             
                BUG#23080148 - BACKPORT BUG 14653594 AND BUG 20683959 TO
                               MYSQL-5.5
                
                The bug asks for a backport of bug#1463594 and bug#20682959. This
                is required because of the fact that if replication is enabled, master
                transaction can commit whereas slave can't commit due to not exact
                'enviroment'. This manifestation is seen in bug#22024200.
            

            koeglermar,
            Thanks for the report and test case.
            Did you also file a bug report to MySQL?

            elenst Elena Stepanova added a comment - Thanks for the report. In server trees, it's this revision: commit 9f7288e2e0179db478d20c74f57b5c7d6c95f793 4a3f1c1f104cbfeb6d31ee02788589151b131eca Author: Thayumanavar S <thayumanavar.x.sachithanantha@oracle.com> Date: Mon Jun 20 11:35:43 2016 +0530   BUG#23080148 - BACKPORT BUG 14653594 AND BUG 20683959 TO MYSQL-5.5 The bug asks for a backport of bug#1463594 and bug#20682959. This is required because of the fact that if replication is enabled, master transaction can commit whereas slave can't commit due to not exact 'enviroment'. This manifestation is seen in bug#22024200. koeglermar , Thanks for the report and test case. Did you also file a bug report to MySQL?
            koeglermar Martin Koegler added a comment - - edited

            I filled no upstream bug, because I have not tested with MySQL.

            PS: JIRA didn't send a email notifications, so I have not noticed the new comment earlier.

            koeglermar Martin Koegler added a comment - - edited I filled no upstream bug, because I have not tested with MySQL. PS: JIRA didn't send a email notifications, so I have not noticed the new comment earlier.

            This change was reverted in last MySQL 5.5 release. Most probably we'll do the same.

            svoj Sergey Vojtovich added a comment - This change was reverted in last MySQL 5.5 release. Most probably we'll do the same.

            What about newer version? MySQL 5.7 does not contain a revert.

            koeglermar Martin Koegler added a comment - What about newer version? MySQL 5.7 does not contain a revert.
            bar Alexander Barkov added a comment - - edited

            In the current 5.5.55 tree it seems to work fine.
            I create the file:

            $ printf "\x22\x25\xaa\xab\xac\x22\x0a" >/tmp/test.txt
            

            and run this SQL script:

            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (a mediumblob NOT NULL) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
            LOAD DATA LOCAL INFILE '/tmp/test.txt' INTO TABLE t1 CHARSET utf8 FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';
            SELECT HEX(a) FROM t1;
            

            It loads the data without errors and returns this result as expected:

            +----------+
            | HEX(a)   |
            +----------+
            | 25AAABAC |
            +----------+
            

            10.2 also works fine.

            bar Alexander Barkov added a comment - - edited In the current 5.5.55 tree it seems to work fine. I create the file: $ printf "\x22\x25\xaa\xab\xac\x22\x0a" >/tmp/test.txt and run this SQL script: DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a mediumblob NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE =latin1_general_ci; LOAD DATA LOCAL INFILE '/tmp/test.txt' INTO TABLE t1 CHARSET utf8 FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' ; SELECT HEX(a) FROM t1; It loads the data without errors and returns this result as expected: +----------+ | HEX(a) | +----------+ | 25AAABAC | +----------+ 10.2 also works fine.

            Added a test case into 5.5.

            bar Alexander Barkov added a comment - Added a test case into 5.5.
            bnokes Bill Nokes added a comment -

            Hi,

            I believe that MariaDB is still treating UTF8 binary data incorrectly while using load infile. I have a test case that seems to introduce an extra escape char into saved binary blob data when using UTF8. Please see linked Jira - MDEV-12240.

            Kind regards,
            Bill

            bnokes Bill Nokes added a comment - Hi, I believe that MariaDB is still treating UTF8 binary data incorrectly while using load infile. I have a test case that seems to introduce an extra escape char into saved binary blob data when using UTF8. Please see linked Jira - MDEV-12240 . Kind regards, Bill

            People

              serg Sergei Golubchik
              koeglermar Martin Koegler
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.