[MDEV-23471] LOAD DATA: partial utf8 Sequence in binary data may "swallow" field separator Created: 2020-08-13  Updated: 2021-11-01  Resolved: 2021-09-20

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.2.33, 10.3.24, 10.4.14, 10.5.5
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Hartmut Holzgraefe Assignee: Rucha Deodhar
Resolution: Cannot Reproduce Votes: 1
Labels: need_feedback

Issue Links:
Relates
relates to MDEV-26338 BINARY data differences after SELECT... Closed

 Description   

LOAD DATA has special treatment for binary columns like BLOB or VARBINARY so that these are excluded from character set conversions regardless of the CHARACTER SET setting for LOAD DATA.

Starting with MariaDB 10.2 though, when using CHARACTER SET UTF8 though, the column delimiter detection can be fooled by a byte with its highest bit set near the end of a BLOB or VARBINARY column, treating that as the beginning of an UTF sequence, and interpreting the actual delimiter character as the 2nd, 3rd (or in case of utf8mb4 4th) byte in the sequence, so skipping over it and ending up with too little columns due to the missed delimiter, and an error message like:

ERROR 1261 (01000) at line 18: Row 1 doesn't contain data for all columns

How to reproduce:

CREATE DATABASE IF NOT EXISTS test;
USE test;
 
DROP TABLE IF EXISTS t1;
 
CREATE TABLE t1(
  b1 VARBINARY(16),
  i1 INT
) DEFAULT CHARSET=utf8;
 
INSERT INTO t1 VALUES(HEX("00"), 23);
INSERT INTO t1 VALUES(HEX("E0"), 42);
 
SELECT * FROM t1 INTO OUTFILE 'data.txt';
 
TRUNCATE TABLE t1;
 
LOAD DATA INFILE 'data.txt'
     INTO TABLE t1
     CHARACTER SET utf8
     FIELDS TERMINATED BY ';' ;
 
SELECT HEX(b1), i1 FROM t1;

Expected result:

+---------+------+
| HEX(b1) | i1   |
+---------+------+
| 00      |   23 |
| E0      |   42 |
+---------+------+

Actual result:

ERROR 1261 (01000): Row 2 doesn't contain data for all columns
 
Empty set (0.001 sec)



 Comments   
Comment by Elena Stepanova [ 2020-08-18 ]

hholzgra,

It looks like your test case throws an error because SELECT .. INTO OUTFILE uses the default field delimiter, while LOAD DATA wants a semicolon. There is no semicolon in the file, hence the error.

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

No, it would fail on row 1 already in that case, and not only on row 2.

Remove the 2nd INSERT and the test will run fine, it is really the data in the 2nd row that breaks it

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

And it really only starts with MariaDB 10.2, full test with both INSERT statements works on 10.1

Comment by Elena Stepanova [ 2020-08-18 ]

hholzgra,

Yes, it does fail on row 1 (see output 1), removal of the 2nd INSERT doesn't change anything (see output 2), and it only works on 10.1 because 10.1 runs with a non-strict SQL mode by default, so it produces warnings instead of errors (see output 3).

I do believe that there is a problem, but I think you've pasted a wrong test case, please double-check. Or maybe something else is different – configuration, environment.

output 1, 10.5

MariaDB [test]> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected (0.040 sec)
 
MariaDB [test]>  
MariaDB [test]> CREATE TABLE t1(
    ->   b1 VARBINARY(16),
    ->   i1 INT
    -> ) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.128 sec)
 
MariaDB [test]>  
MariaDB [test]> INSERT INTO t1 VALUES(HEX("00"), 23);
Query OK, 1 row affected (0.005 sec)
 
MariaDB [test]> INSERT INTO t1 VALUES(HEX("E0"), 42);
Query OK, 1 row affected (0.006 sec)
 
MariaDB [test]>  
MariaDB [test]> SELECT * FROM t1 INTO OUTFILE 'data.txt';
Query OK, 2 rows affected, 1 warning (0.001 sec)
 
MariaDB [test]>  
MariaDB [test]> TRUNCATE TABLE t1;
Query OK, 0 rows affected (0.136 sec)
 
MariaDB [test]>  
MariaDB [test]> LOAD DATA INFILE 'data.txt'
    ->      INTO TABLE t1
    ->      CHARACTER SET utf8
    ->      FIELDS TERMINATED BY ';' ;
ERROR 1261 (01000): Row 1 doesn't contain data for all columns

output 2, 10.5

MariaDB [test]> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected (0.058 sec)
 
MariaDB [test]>  
MariaDB [test]> CREATE TABLE t1(
    ->   b1 VARBINARY(16),
    ->   i1 INT
    -> ) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.129 sec)
 
MariaDB [test]>  
MariaDB [test]> INSERT INTO t1 VALUES(HEX("00"), 23);
Query OK, 1 row affected (0.005 sec)
 
MariaDB [test]>  
MariaDB [test]> SELECT * FROM t1 INTO OUTFILE 'data.txt';
Query OK, 1 row affected, 1 warning (0.001 sec)
 
MariaDB [test]>  
MariaDB [test]> TRUNCATE TABLE t1;
Query OK, 0 rows affected (0.136 sec)
 
MariaDB [test]>  
MariaDB [test]> LOAD DATA INFILE 'data.txt'
    ->      INTO TABLE t1
    ->      CHARACTER SET utf8
    ->      FIELDS TERMINATED BY ';' ;
ERROR 1261 (01000): Row 1 doesn't contain data for all columns

output 3, 10.1

MariaDB [test]> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected, 1 warning (0.004 sec)
 
MariaDB [test]>  
MariaDB [test]> CREATE TABLE t1(
    ->   b1 VARBINARY(16),
    ->   i1 INT
    -> ) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.607 sec)
 
MariaDB [test]>  
MariaDB [test]> INSERT INTO t1 VALUES(HEX("00"), 23);
Query OK, 1 row affected (0.075 sec)
 
MariaDB [test]> INSERT INTO t1 VALUES(HEX("E0"), 42);
Query OK, 1 row affected (0.039 sec)
 
MariaDB [test]>  
MariaDB [test]> SELECT * FROM t1 INTO OUTFILE 'data.txt';
Query OK, 2 rows affected (0.001 sec)
 
MariaDB [test]>  
MariaDB [test]> TRUNCATE TABLE t1;
Query OK, 0 rows affected (0.340 sec)
 
MariaDB [test]>  
MariaDB [test]> LOAD DATA INFILE 'data.txt'
    ->      INTO TABLE t1
    ->      CHARACTER SET utf8
    ->      FIELDS TERMINATED BY ';' ;
Query OK, 2 rows affected, 2 warnings (0.040 sec)    
Records: 2  Deleted: 0  Skipped: 0  Warnings: 2
 
MariaDB [test]> show warnings;
+---------+------+--------------------------------------------+
| Level   | Code | Message                                    |
+---------+------+--------------------------------------------+
| Warning | 1261 | Row 1 doesn't contain data for all columns |
| Warning | 1261 | Row 2 doesn't contain data for all columns |
+---------+------+--------------------------------------------+
2 rows in set (0.000 sec)

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

Fresh 10.1.22 installation on Ubuntu Xenial, using the mariadb.com community package repository, no changes to the default configuration.

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 10.1.22-MariaDB-1~xenial mariadb.org binary distribution
 
Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS test;
Query OK, 0 rows affected, 1 warning (0.00 sec)
 
MariaDB [(none)]> USE test;
Database changed
 
MariaDB [test]> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected, 1 warning (0.01 sec)
 
MariaDB [test]> CREATE TABLE t1(
    -> 
    ->   b1 VARBINARY(16),
    -> 
    ->   i1 INT
    -> 
    -> ) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [test]> INSERT INTO t1 VALUES(HEX("00"), 23);
Query OK, 1 row affected (0.00 sec)
 
MariaDB [test]> INSERT INTO t1 VALUES(HEX("E0"), 42);
Query OK, 1 row affected (0.00 sec)
 
MariaDB [test]> SELECT * FROM t1 INTO OUTFILE 'data.txt';
Query OK, 2 rows affected (0.00 sec)
 
MariaDB [test]> TRUNCATE TABLE t1;
Query OK, 0 rows affected (0.01 sec)
 
MariaDB [test]> LOAD DATA INFILE 'data.txt'
    -> 
    ->      INTO TABLE t1
    -> 
    ->      CHARACTER SET utf8
    -> 
    ->      FIELDS TERMINATED BY ';' ;
Query OK, 2 rows affected, 2 warnings (0.00 sec)     
Records: 2  Deleted: 0  Skipped: 0  Warnings: 2
 
MariaDB [test]> SELECT HEX(b1), i1 FROM t1;
+----------------+------+
| HEX(b1)        | i1   |
+----------------+------+
| 33303330093233 | NULL |
| 34353330093432 | NULL |
+----------------+------+
2 rows in set (0.00 sec)

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

Wait, NULL? I'm sure there were proper 23 and 42 in that result set when i checked earlier today ...

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

This is very weird, especially as it clearly says "Row 2 doesn't contain data for all columns" in the error message in my original description. Now trying to figure out what changed. I rebuilt all my test VMs over the week, but there should not have been any change related to this ...

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

It was indeed the FIELDS TERMINATED BY ';' after all. While minimizing the test I had removed that from both the SELECT INTO OUTFILE and LOAD DATA INFILE, but somehow pasted a wrong version into the bug report. So the following now works on fresh 10.1.22, after removing the TERMINATED BY, and adding an extra SHOW WARNINGS after LOAD DATA just in case ...

CREATE DATABASE IF NOT EXISTS test;
 
USE test;
 
DROP TABLE IF EXISTS t1;
 
CREATE TABLE t1(
  b1 VARBINARY(16),
  i1 INT
) DEFAULT CHARSET=utf8;
 
INSERT INTO t1 VALUES(HEX("00"), 23);
INSERT INTO t1 VALUES(HEX("E0"), 42);
 
SELECT * FROM t1 INTO OUTFILE 'data.txt';
 
TRUNCATE TABLE t1;
 
LOAD DATA INFILE 'data.txt'
     INTO TABLE t1
     CHARACTER SET utf8;
 
SHOW WARNINGS;
 
SELECT HEX(b1), i1 FROM t1;

10.1.22 result:

MariaDB [test]> CREATE DATABASE IF NOT EXISTS test;
Query OK, 0 rows affected, 1 warning (0.00 sec)
 
MariaDB [test]> USE test;
Database changed
 
MariaDB [test]> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected (0.02 sec)
 
MariaDB [test]> CREATE TABLE t1(
    ->   b1 VARBINARY(16),
    ->   i1 INT
    -> ) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.05 sec)
 
MariaDB [test]> INSERT INTO t1 VALUES(HEX("00"), 23);
Query OK, 1 row affected (0.01 sec)
 
MariaDB [test]> INSERT INTO t1 VALUES(HEX("E0"), 42);
Query OK, 1 row affected (0.01 sec)
 
MariaDB [test]> SELECT * FROM t1 INTO OUTFILE 'data.txt';
 
MariaDB [test]> TRUNCATE TABLE t1;
Query OK, 0 rows affected (0.09 sec)
 
MariaDB [test]> 
MariaDB [test]> LOAD DATA INFILE 'data.txt'
    ->      INTO TABLE t1
    ->      CHARACTER SET utf8;
Query OK, 2 rows affected (0.00 sec)                 
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0
 
MariaDB [test]> SHOW WARNINGS;
Empty set (0.00 sec)
 
MariaDB [test]> SELECT HEX(b1), i1 FROM t1;
+----------+------+
| HEX(b1)  | i1   |
+----------+------+
| 33303330 |   23 |
| 34353330 |   42 |
+----------+------+
2 rows in set (0.00 sec)

Comment by Hartmut Holzgraefe [ 2020-08-18 ]

So it works on 10.1.22 now, but also on 10.4.14 and 10.5.5 ... WT...?

I'm not making up the ERROR 1261 (01000): Row 2 doesn't contain data for all columns, this was a literal copy&paste of the test result I got ...

Still trying to figure out what's different now ...

Comment by Elena Stepanova [ 2020-08-18 ]

bar, could you please take a look to see if it's a regression in 10.x or the previously wrong behavior was fixed?

You only need the previous comment by hholzgra, it contains all the info and the test case demonstrating the difference between 5.5 and 10.x. The description and previous comments can be skipped.

Comment by Hartmut Holzgraefe [ 2021-02-24 ]

Any news on this?

It is a regression that is now affecting all non-EOL release versions ...

This is also not really related to character sets, but to the way the special "Ignore character sets and encodigs" handling for BINARY/BLOB data in LOAD DATA works.

Comment by Hartmut Holzgraefe [ 2021-02-25 ]

Not sure what was wrong with me the day I filed this.

A) the test case above is wrong, and actually even in two places. The test should actually use UNHEX() not HEX, and not change the default column delimiter on LOAD DATA.

B) The behavior already changed between 10.0.28 and 10.0.29, not

This one actually works:

DROP TABLE IF EXISTS t1;
  
CREATE TABLE t1(
  b1 VARBINARY(16),
  i1 INT
) DEFAULT CHARSET=utf8;
 
INSERT INTO t1 VALUES(UNHEX("00"), 23);
INSERT INTO t1 VALUES(UNHEX("E0"), 42);
 
SELECT * FROM t1 INTO OUTFILE 'data.txt';
 
TRUNCATE TABLE t1;
 
LOAD DATA INFILE 'data.txt'
     INTO TABLE t1
     CHARACTER SET utf8;
 
SHOW WARNINGS;
 
SELECT LENGTH(b1), HEX(b1), i1 FROM t1;

Expected result (up to MariaDB 10.0.28):

SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
+------------+---------+------+
| LENGTH(b1) | HEX(b1) | i1   |
+------------+---------+------+
|          1 | 00      |   23 |
|          1 | E0      |   42 |
+------------+---------+------+

Wrong result starting with 10.0.29:

SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
+------------+----------+------+
| LENGTH(b1) | HEX(b1)  | i1   |
+------------+----------+------+
|          1 | 00       |   23 |
|          4 | E0093432 | NULL |
+------------+----------+------+

Comment by Hartmut Holzgraefe [ 2021-02-25 ]

Problem started in 10.0.29 with fix for https://jira.mariadb.org/browse/MDEV-11343 / https://github.com/MariaDB/server/commit/dd0ff30278cd7b24776ccf36a9c0d9171a569750

Comment by Hartmut Holzgraefe [ 2021-02-25 ]

Another error in my original test, it should have used "CHARACTER SET utf8" for both the INTO OUTFILE and LOAD DATA.

With that things become different once more, as 10.0 would dump using latin1 by default.

With that fixed the test now becomes

DROP TABLE IF EXISTS t1;
 
CREATE TABLE t1(
  b1 VARBINARY(16),
  i1 INT
) DEFAULT CHARSET=utf8;
 
INSERT INTO t1 VALUES(UNHEX("00"), 23);
INSERT INTO t1 VALUES(UNHEX("E0"), 42);
 
SELECT * FROM t1 INTO OUTFILE 'data.txt' CHARACTER SET  utf8;
 
TRUNCATE TABLE t1;
 
LOAD DATA INFILE 'data.txt'
     INTO TABLE t1
     CHARACTER SET utf8;
 
SHOW WARNINGS;
 
SELECT LENGTH(b1), HEX(b1), i1 FROM t1;

And the result becomes this for all 10.0.x releases (before and after 10.0.29):

SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
 
+------------+---------+------+
| LENGTH(b1) | HEX(b1) | i1   |
+------------+---------+------+
|          1 | 00      |   23 |
|          0 |         |   42 |
+------------+---------+------+

Beginning with 10.1 it becomes a little better:

SELECT LENGTH(b1), HEX(b1), i1 FROM t1;
 
+------------+---------+------+
| LENGTH(b1) | HEX(b1) | i1   |
+------------+---------+------+
|          1 | 00      |   23 |
|          1 | 3F      |   42 |
+------------+---------+------+

So instead of the 0xE0 bytes we're now getting a question mark (ASCII 0x3F). This is the same all the way up to latest 10.5

Comment by Hartmut Holzgraefe [ 2021-08-10 ]

I have now split out the test case from my comment on February 26th to a separate report MDEV-26388.

I will continue to look over the rest of my report here, and will either split further valid problems out into additional new MDEVs, or suggest closing this one as the rest of what I originally reported may just have been due to mixed character sets between export and import ...

Generated at Thu Feb 08 09:22:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.