[MDEV-9842] LOAD DATA INFILE does not work well with a TEXT column when using sjis Created: 2016-03-31  Updated: 2016-04-01  Resolved: 2016-04-01

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 5.5, 10.0, 10.1
Fix Version/s: 10.2.0

Type: Bug Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-6353 my_ismbchar() and my_mbcharlen() refa... Closed

 Description   

I create a text file:

mysql --default-character-set=sjis --skip-column-names --exec="SELECT CONCAT('x', REPEAT(_sjis 0x835C, 200))" >/tmp/test.txt

It consists of one LATIN SMALL LETTER X, followed by 200 characters KATAKANA LETTER SO (which is encoded as 0x835C in SJIS).

Now I create a table and load the file into it:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a TEXT CHARACTER SET sjis);
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET sjis;
SHOW WARNINGS;
SELECT a FROM t1;

It returns the following warning:

+---------+------+------------------------------------------------------------+
| Level   | Code | Message                                                    |
+---------+------+------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x83\x0A' for column 'a' at row 1 |
+---------+------+------------------------------------------------------------+

and this result (make sure to scroll the below window to the right):

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| xソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャャ?
                                                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The warning and the result look wrong:

  • Notice, it started to import the data well, but at some offset KATAKANA LETTER SO changed to something else.
  • The question mark at the end is wrong.

If I now change the column type from TEXT to VARCHAR(1000), it loads the file without problems and without warnings:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(1000) CHARACTER SET sjis);
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE t1 CHARACTER SET sjis; 
SELECT a FROM t1;

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| xソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソソ                                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



 Comments   
Comment by Alexander Barkov [ 2016-03-31 ]

The problem happens in this piece of the code:

#ifdef USE_MB
      if (my_mbcharlen(read_charset, chr) > 1 &&
          to + my_mbcharlen(read_charset, chr) <= end_of_buff)
      {

When there is only one byte left in the allocated buffer, the next multi-byte character is erroneously scanned byte-by-byte instead of two bytes at a time.

Generated at Thu Feb 08 07:37:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.