[MCOL-4931] Make cpimport charset aware Created: 2021-11-22 Updated: 2023-09-22 Resolved: 2023-09-07 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | cpimport |
| Affects Version/s: | 5.5.1, 6.1.1 |
| Fix Version/s: | 23.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Yakov Kushnirsky | Assignee: | Gagan Goel (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | rm_invalid_data | ||
| Environment: |
CentOS; Amazon EC2 |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Sprint: | 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10 | ||||||||||||||||||||
| Assigned for Review: | |
||||||||||||||||||||
| Assigned for Testing: | |
||||||||||||||||||||
| Description |
|
Rewording Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings
Expected: Actual: Reproduction: ----------------------------- cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1 resulted in the following output (charset=utf8mb3), which does not look right:
------
... If the same data were loaded via LDIF as LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t'; then result looks correct:
------
... An attempted simplified repro is the following: repro.tsv produced as (in same way/options as in the original case) mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt use test; select id, lengthb(notes),char_length(notes) from repro;
---
--- \q mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv truncate table repro; cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1 select id, lengthb(notes),char_length(notes) from repro;
---
--- mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra ' I not sure whether options are wrong or is there a problem with cpimport ? |
| Comments |
| Comment by alexey vorovich (Inactive) [ 2022-03-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
toddstoffel pls review with YK | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gagan Goel (Inactive) [ 2023-08-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For QA: Here is a simplified test case to reproduce the issue. In the below, /tmp/utf8_test.txt contains the following text:
Verify that LDI correctly loads and truncates the multi-byte string:
Now import the same data using cpimport:
Verify that the number of bytes imported by cpimport is incorrect:
With the fix, rerun cpimport:
Now verify that cpimport correctly truncates the string (with the cpimport log showing truncation message) and loads the correct number of bytes:
|