Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9811

LOAD DATA INFILE does not work well with gbk in some cases

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 5.5(EOL), 10.0(EOL), 10.1(EOL), 10.2(EOL)
    • 10.2.0
    • Character Sets
    • None

    Description

      I have am empty table:

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET gbk);
      

      and an external file with this GBK text data:

      printf "\xB0\x40\x61\xB0\x41\x40\xB0\x42\x40" >/tmp/test.txt
      

      Note, the file can be checked with:

      SELECT HEX(LOAD_FILE('/tmp/test.txt'));
      

      +---------------------------------+
      | HEX(LOAD_FILE('/tmp/test.txt')) |
      +---------------------------------+
      | B04061B04140B04240              |
      +---------------------------------+
      

      The file consists of:

      [B040] - a GBK double-byte character
      [61]   - ASCII 'a'
      [B041] - a GBK double-byte character
      [40]   - ASCII '@'
      [B042] - a GBK double-byte character
      [40]   - ASCII '@'
      

      Now I want to treat the '@' characters as line separators and load the file into the table:

      LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET gbk LINES TERMINATED BY '@';
      SELECT HEX(a),a FROM t1;
      

      +------------+---------+
      | HEX(a)     | a       |
      +------------+---------+
      | B04061B041 | 癅a癆   |
      | B042       | 癇      |
      +------------+---------+
      

      It correctly recognized two lines.

      Now I want to skip the first line and reload the data:

      DELETE FROM t1;
      LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET gbk LINES TERMINATED BY '@' IGNORE 1 LINES;
      SELECT HEX(a),a FROM t1;
      

      +--------+------+
      | HEX(a) | a    |
      +--------+------+
      | 61B041 | a癆  |
      | B042   | 癇   |
      +--------+------+
      

      The result is wrong. It still returns two lines.
      The 0x40 byte which is a part of the first double-byte character 0xB040 was erroneously interpreted as a line terminator.
      The expected result is to return one row:

      +--------+------+
      | HEX(a) | a    |
      +--------+------+
      | B042   | 癇   |
      +--------+------+
      

      Attachments

        Issue Links

          Activity

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.