[MCOL-4931] Make cpimport charset aware Created: 2021-11-22  Updated: 2023-09-22  Resolved: 2023-09-07

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport
Affects Version/s: 5.5.1, 6.1.1
Fix Version/s: 23.10.0

Type: Bug Priority: Major
Reporter: Yakov Kushnirsky Assignee: Gagan Goel (Inactive)
Resolution: Fixed Votes: 0
Labels: rm_invalid_data
Environment:

CentOS; Amazon EC2


Attachments: Text File flights.txt     File repro.tsv     File repro_cpimp.tsv     File repro_ldif.tsv     File reproduction.bash    
Issue Links:
Blocks
blocks MCOL-4484 "Alter table modify column oldname ne... Closed
is blocked by MCOL-5005 Add charset number to system catalog Closed
Relates
relates to MCOL-5563 Investigate different MTR test output... Closed
Sprint: 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10
Assigned for Review: Roman Roman
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

Rewording

Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings

cpimport test flights /tmp/flights.txt -m1 -s '\t'
versus
mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2  FIELDS TERMINATED BY '\t';"

Expected:
When using cpimport - Strings longer than 255 are truncated to fit varchar(255) just like LDIF does

Actual:
cpimport does not truncate strings even when the column is defined as varchar(255), unlike LDIF

Reproduction:
Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory

-----------------------------
it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
For example, in the original case, data loaded from .tsv file into varchar(255) as

cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

resulted in the following output (charset=utf8mb3), which does not look right:
select id, lengthb(notes),char_length(notes) from flights_repro;
---------------------------------------

id lengthb(notes) char_length(notes)

---------------------------------------

3 765 765
5199 765 765
7275 765 765

...

If the same data were loaded via LDIF as

LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

then result looks correct:
select id, lengthb(notes),char_length(notes) from flights_repro;
---------------------------------------

id lengthb(notes) char_length(notes)

---------------------------------------

3 255 255
5199 255 255
7275 255 255

...

An attempted simplified repro is the following:

repro.tsv produced as (in same way/options as in the original case)

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

use test;
CREATE TABLE `repro` (
`id` int(11) NOT NULL,
`notes` varchar(255) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
;
LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
...
Query OK, 1 row affected (1.186 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
...

select id, lengthb(notes),char_length(notes) from repro;
------------------------------------

id lengthb(notes) char_length(notes)

------------------------------------

3 255 255

------------------------------------

\q

mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

truncate table repro;

cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
...
2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
...

select id, lengthb(notes),char_length(notes) from repro;
------------------------------------

id lengthb(notes) char_length(notes)

------------------------------------

3 259 259

------------------------------------
1 row in set (0.037 sec)
\q

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '
' at the beginning of line. While LDIF loads data correctly, without prepending.

I not sure whether options are wrong or is there a problem with cpimport ?



 Comments   
Comment by alexey vorovich (Inactive) [ 2022-03-24 ]

toddstoffel pls review with YK

Comment by Gagan Goel (Inactive) [ 2023-08-21 ]

For QA:

Here is a simplified test case to reproduce the issue. In the below, /tmp/utf8_test.txt contains the following text:

"König-\\n\\n-Straße"

MariaDB [test]> drop table if exists t1;
Query OK, 0 rows affected (0.315 sec)
 
MariaDB [test]> create table t1 (a varchar(15))engine=columnstore default charset=utf8mb3;
Query OK, 0 rows affected (0.272 sec)
 
MariaDB [test]> LOAD DATA INFILE '/tmp/utf8_test.txt' IGNORE INTO TABLE t1 charset utf8mb3 fields enclosed by '"';
Query OK, 1 row affected, 1 warning (1.365 sec)      
Records: 1  Deleted: 0  Skipped: 0  Warnings: 1

Verify that LDI correctly loads and truncates the multi-byte string:

MariaDB [test]> select * from t1;
+------------------+
| a                |
+------------------+
| König-\n\n-Stra  |
+------------------+
1 row in set (0.160 sec)
 
MariaDB [test]> select lengthb(a), char_length(a) from t1;
+------------+----------------+
| lengthb(a) | char_length(a) |
+------------+----------------+
|         16 |             15 |
+------------+----------------+
1 row in set (0.037 sec)

Now import the same data using cpimport:

cpimport -E'"' test t1 /tmp/utf8_test.txt

Verify that the number of bytes imported by cpimport is incorrect:

MariaDB [test]> select * from t1;
+------------------+
| a                |
+------------------+
| König-\n\n-Stra  |
| König-\n\n-Stra  |
+------------------+
2 rows in set (0.094 sec)
 
MariaDB [test]> select lengthb(a), char_length(a) from t1;
+------------+----------------+
| lengthb(a) | char_length(a) |
+------------+----------------+
|         16 |             15 |
|         19 |             17 |
+------------+----------------+
2 rows in set (0.042 sec)

With the fix, rerun cpimport:

cpimport -E'"' test t1 /tmp/utf8_test.txt

Now verify that cpimport correctly truncates the string (with the cpimport log showing truncation message) and loads the correct number of bytes:

MariaDB [test]> select * from t1;
+------------------+
| a                |
+------------------+
| König-\n\n-Stra  |
| König-\n\n-Stra  |
| König-\n\n-Stra  |
+------------------+
3 rows in set (0.095 sec)
 
MariaDB [test]> select lengthb(a), char_length(a) from t1;
+------------+----------------+
| lengthb(a) | char_length(a) |
+------------+----------------+
|         16 |             15 |
|         19 |             17 | <- row imported using cpimport before the fix
|         16 |             15 | <- row imported using cpimport after the fix
+------------+----------------+
3 rows in set (0.038 sec)

Generated at Thu Feb 08 02:54:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.