[MDEV-9811] LOAD DATA INFILE does not work well with gbk in some cases Created: 2016-03-28  Updated: 2016-03-31  Resolved: 2016-03-31

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 5.5, 10.0, 10.1, 10.2
Fix Version/s: 10.2.0

Type: Bug Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-6353 my_ismbchar() and my_mbcharlen() refa... Closed

 Description   

I have am empty table:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET gbk);

and an external file with this GBK text data:

printf "\xB0\x40\x61\xB0\x41\x40\xB0\x42\x40" >/tmp/test.txt

Note, the file can be checked with:

SELECT HEX(LOAD_FILE('/tmp/test.txt'));

+---------------------------------+
| HEX(LOAD_FILE('/tmp/test.txt')) |
+---------------------------------+
| B04061B04140B04240              |
+---------------------------------+

The file consists of:

[B040] - a GBK double-byte character
[61]   - ASCII 'a'
[B041] - a GBK double-byte character
[40]   - ASCII '@'
[B042] - a GBK double-byte character
[40]   - ASCII '@'

Now I want to treat the '@' characters as line separators and load the file into the table:

LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET gbk LINES TERMINATED BY '@';
SELECT HEX(a),a FROM t1;

+------------+---------+
| HEX(a)     | a       |
+------------+---------+
| B04061B041 | 癅a癆   |
| B042       | 癇      |
+------------+---------+

It correctly recognized two lines.

Now I want to skip the first line and reload the data:

DELETE FROM t1;
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET gbk LINES TERMINATED BY '@' IGNORE 1 LINES;
SELECT HEX(a),a FROM t1;

+--------+------+
| HEX(a) | a    |
+--------+------+
| 61B041 | a癆  |
| B042   | 癇   |
+--------+------+

The result is wrong. It still returns two lines.
The 0x40 byte which is a part of the first double-byte character 0xB040 was erroneously interpreted as a line terminator.
The expected result is to return one row:

+--------+------+
| HEX(a) | a    |
+--------+------+
| B042   | 癇   |
+--------+------+


Generated at Thu Feb 08 07:37:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.