Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9811

LOAD DATA INFILE does not work well with gbk in some cases

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5, 10.0, 10.1, 10.2
    • Fix Version/s: 10.2.0
    • Component/s: Character Sets
    • Labels:
      None

      Description

      I have am empty table:

      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET gbk);
      

      and an external file with this GBK text data:

      printf "\xB0\x40\x61\xB0\x41\x40\xB0\x42\x40" >/tmp/test.txt
      

      Note, the file can be checked with:

      SELECT HEX(LOAD_FILE('/tmp/test.txt'));
      

      +---------------------------------+
      | HEX(LOAD_FILE('/tmp/test.txt')) |
      +---------------------------------+
      | B04061B04140B04240              |
      +---------------------------------+
      

      The file consists of:

      [B040] - a GBK double-byte character
      [61]   - ASCII 'a'
      [B041] - a GBK double-byte character
      [40]   - ASCII '@'
      [B042] - a GBK double-byte character
      [40]   - ASCII '@'
      

      Now I want to treat the '@' characters as line separators and load the file into the table:

      LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET gbk LINES TERMINATED BY '@';
      SELECT HEX(a),a FROM t1;
      

      +------------+---------+
      | HEX(a)     | a       |
      +------------+---------+
      | B04061B041 | 癅a癆   |
      | B042       | 癇      |
      +------------+---------+
      

      It correctly recognized two lines.

      Now I want to skip the first line and reload the data:

      DELETE FROM t1;
      LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET gbk LINES TERMINATED BY '@' IGNORE 1 LINES;
      SELECT HEX(a),a FROM t1;
      

      +--------+------+
      | HEX(a) | a    |
      +--------+------+
      | 61B041 | a癆  |
      | B042   | 癇   |
      +--------+------+
      

      The result is wrong. It still returns two lines.
      The 0x40 byte which is a part of the first double-byte character 0xB040 was erroneously interpreted as a line terminator.
      The expected result is to return one row:

      +--------+------+
      | HEX(a) | a    |
      +--------+------+
      | B042   | 癇   |
      +--------+------+
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bar Alexander Barkov
                Reporter:
                bar Alexander Barkov
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: