[MCOL-4931] Make cpimport charset aware - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 5.5.1, 6.1.1
Fix Version/s: 23.10.0
Component/s: cpimport
Labels:
- rm_invalid_data
Environment:
CentOS; Amazon EC2

Sprint:
2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10

Description

Rewording

Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings

cpimport test flights /tmp/flights.txt -m1 -s '\t'

versus

mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2  FIELDS TERMINATED BY '\t';"

Expected:
When using cpimport - Strings longer than 255 are truncated to fit varchar(255) just like LDIF does

Actual:
cpimport does not truncate strings even when the column is defined as varchar(255), unlike LDIF

Reproduction:
Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory

-----------------------------
it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
For example, in the original case, data loaded from .tsv file into varchar(255) as

cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

resulted in the following output (charset=utf8mb3), which does not look right:
select id, lengthb(notes),char_length(notes) from flights_repro;
---------------------------------------

lengthb(notes)

char_length(notes)

---------------------------------------

3	765	765
5199	765	765
7275	765	765

...

If the same data were loaded via LDIF as

LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

then result looks correct:
select id, lengthb(notes),char_length(notes) from flights_repro;
---------------------------------------

lengthb(notes)

char_length(notes)

---------------------------------------

3	255	255
5199	255	255
7275	255	255

...

An attempted simplified repro is the following:

repro.tsv produced as (in same way/options as in the original case)

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

use test;
CREATE TABLE `repro` (
`id` int(11) NOT NULL,
`notes` varchar(255) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
;
LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
...
Query OK, 1 row affected (1.186 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
...

select id, lengthb(notes),char_length(notes) from repro;
------------------------------------

lengthb(notes)

char_length(notes)

------------------------------------

255

------------------------------------

mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

truncate table repro;

cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
...
2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
...

select id, lengthb(notes),char_length(notes) from repro;
------------------------------------

lengthb(notes)

char_length(notes)

------------------------------------

259

------------------------------------
1 row in set (0.037 sec)
\q

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '
' at the beginning of line. While LDIF loads data correctly, without prepending.

I not sure whether options are wrong or is there a problem with cpimport ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

flights.txt
20 kB
2021-12-09 19:44
repro_cpimp.tsv
0.3 kB
2021-11-22 22:36
repro_ldif.tsv
0.3 kB
2021-11-22 22:36
repro.tsv
0.3 kB
2021-11-22 22:36
reproduction.bash
7 kB
2021-12-09 19:48

Issue Links

blocks

MCOL-4484 "Alter table modify column oldname newname datatype" fails with Error 1815: "Changing the datatype of a column is not supported"

Closed

is blocked by

MCOL-5005 Add charset number to system catalog

Closed

relates to

MCOL-5563 Investigate different MTR test output between DEB and RPM distros, cpimport. \ in data

Closed

Activity

Ascending order - Click to sort in descending order

Yakov Kushnirsky created issue - 2021-11-22 22:32

Yakov Kushnirsky made changes - 2021-11-22 22:36

Field	Original Value	New Value
Attachment		repro.tsv [ 60865 ]
Attachment		repro_cpimp.tsv [ 60866 ]
Attachment		repro_ldif.tsv [ 60867 ]

David Hall (Inactive) made changes - 2021-11-23 19:07

Assignee

David Hall [ david.hall ]

Allen Herrera made changes - 2021-12-09 19:36

Summary

cpimport load may multiply some characters

cpimport does not truncate strings - may multiply some characters

Allen Herrera made changes - 2021-12-09 19:44

Description

it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
For example, in the original case, data loaded from .tsv file into varchar(255) as

cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

resulted in the following output (charset=utf8mb3), which does not look right:
select id, lengthb(notes),char_length(notes) from flights_repro;
+-------+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+-------+----------------+--------------------+
| 3 | 765 | 765 |
| 5199 | 765 | 765 |
| 7275 | 765 | 765 |
...

If the same data were loaded via LDIF as

LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

then result looks correct:
select id, lengthb(notes),char_length(notes) from flights_repro;
+-------+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+-------+----------------+--------------------+
| 3 | 255 | 255 |
| 5199 | 255 | 255 |
| 7275 | 255 | 255 |
...

An attempted simplified repro is the following:

repro.tsv produced as (in same way/options as in the original case)

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

use test;
CREATE TABLE `repro` (
`id` int(11) NOT NULL,
`notes` varchar(255) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
;
LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
...
Query OK, 1 row affected (1.186 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
...

select id, lengthb(notes),char_length(notes) from repro;
+----+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+----+----------------+--------------------+
| 3 | 255 | 255 |
+----+----------------+--------------------+

\q

mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

truncate table repro;

cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
...
2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
...

select id, lengthb(notes),char_length(notes) from repro;
+----+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+----+----------------+--------------------+
| 3 | 259 | 259 |
+----+----------------+--------------------+
1 row in set (0.037 sec)
\q

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

I not sure whether options are wrong or is there a problem with cpimport ?

Rewording

Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings
{code:java}
cpimport test flights /tmp/flights.txt -m1 -s '\t'
versus
mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2 FIELDS TERMINATED BY '\t';"
{code}

Reproduction:
Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory

-----------------------------
it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
For example, in the original case, data loaded from .tsv file into varchar(255) as

cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

resulted in the following output (charset=utf8mb3), which does not look right:
select id, lengthb(notes),char_length(notes) from flights_repro;
+-------+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+-------+----------------+--------------------+
| 3 | 765 | 765 |
| 5199 | 765 | 765 |
| 7275 | 765 | 765 |
...

If the same data were loaded via LDIF as

LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

then result looks correct:
select id, lengthb(notes),char_length(notes) from flights_repro;
+-------+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+-------+----------------+--------------------+
| 3 | 255 | 255 |
| 5199 | 255 | 255 |
| 7275 | 255 | 255 |
...

An attempted simplified repro is the following:

repro.tsv produced as (in same way/options as in the original case)

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

use test;
CREATE TABLE `repro` (
`id` int(11) NOT NULL,
`notes` varchar(255) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
;
LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
...
Query OK, 1 row affected (1.186 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
...

select id, lengthb(notes),char_length(notes) from repro;
+----+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+----+----------------+--------------------+
| 3 | 255 | 255 |
+----+----------------+--------------------+

\q

mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

truncate table repro;

cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
...
2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
...

select id, lengthb(notes),char_length(notes) from repro;
+----+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+----+----------------+--------------------+
| 3 | 259 | 259 |
+----+----------------+--------------------+
1 row in set (0.037 sec)
\q

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

I not sure whether options are wrong or is there a problem with cpimport ?

Allen Herrera made changes - 2021-12-09 19:44

Attachment

flights.txt [ 61134 ]

Allen Herrera made changes - 2021-12-09 19:48

Attachment

reproduction.bash [ 61135 ]

Allen Herrera made changes - 2021-12-10 18:48

Description

Rewording

Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings
{code:java}
cpimport test flights /tmp/flights.txt -m1 -s '\t'
versus
mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2 FIELDS TERMINATED BY '\t';"
{code}

Expected:
When using cpimport - Strings longer than 255 are truncated to fit varchar(255) just like LDIF does

Actual:
cpimport does not truncate strings even when the column is defined as varchar(255), unlike LDIF

Reproduction:
Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory

-----------------------------
it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
For example, in the original case, data loaded from .tsv file into varchar(255) as

cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

resulted in the following output (charset=utf8mb3), which does not look right:
select id, lengthb(notes),char_length(notes) from flights_repro;
+-------+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+-------+----------------+--------------------+
| 3 | 765 | 765 |
| 5199 | 765 | 765 |
| 7275 | 765 | 765 |
...

If the same data were loaded via LDIF as

LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

then result looks correct:
select id, lengthb(notes),char_length(notes) from flights_repro;
+-------+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+-------+----------------+--------------------+
| 3 | 255 | 255 |
| 5199 | 255 | 255 |
| 7275 | 255 | 255 |
...

An attempted simplified repro is the following:

repro.tsv produced as (in same way/options as in the original case)

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

use test;
CREATE TABLE `repro` (
`id` int(11) NOT NULL,
`notes` varchar(255) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
;
LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
...
Query OK, 1 row affected (1.186 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
...

select id, lengthb(notes),char_length(notes) from repro;
+----+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+----+----------------+--------------------+
| 3 | 255 | 255 |
+----+----------------+--------------------+

\q

mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

truncate table repro;

cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
...
2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
...

select id, lengthb(notes),char_length(notes) from repro;
+----+----------------+--------------------+
| id | lengthb(notes) | char_length(notes) |
+----+----------------+--------------------+
| 3 | 259 | 259 |
+----+----------------+--------------------+
1 row in set (0.037 sec)
\q

mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

I not sure whether options are wrong or is there a problem with cpimport ?

Gregory Dorman (Inactive) made changes - 2021-12-16 19:12

Sprint

2021-16 [ 598 ]

Gregory Dorman (Inactive) made changes - 2021-12-16 19:12

Rank

Ranked higher

Todd Stoffel (Inactive) made changes - 2022-01-05 01:36

Fix Version/s

Icebox [ 22302 ]

Chris Calender (Inactive) made changes - 2022-01-06 19:30

Fix Version/s		6.3.1 [ 25801 ]
Fix Version/s	Icebox [ 22302 ]

Gregory Dorman (Inactive) made changes - 2022-02-01 18:18

Sprint

2021-16 [ 598 ]

2021-16, 2021-17 [ 598, 614 ]

Todd Stoffel (Inactive) made changes - 2022-02-18 00:27

Rank

Ranked higher

David Hall (Inactive) made changes - 2022-03-04 20:29

Fix Version/s		7.1.1 [ 26904 ]
Fix Version/s	6.3.1 [ 25801 ]

alexey vorovich (Inactive) made changes - 2022-03-24 18:19

Assignee

David Hall [ david.hall ]

Todd Stoffel [ toddstoffel ]

alexey vorovich (Inactive) added a comment - 2022-03-24 18:20

toddstoffel pls review with YK

alexey vorovich (Inactive) added a comment - 2022-03-24 18:20 toddstoffel pls review with YK

Todd Stoffel (Inactive) made changes - 2022-04-05 06:14

Assignee

Todd Stoffel [ toddstoffel ]

alexey vorovich (Inactive) made changes - 2022-05-12 18:11

Assignee

David Hall [ david.hall ]

David Hall (Inactive) made changes - 2022-05-13 15:48

Assignee

David Hall [ david.hall ]

Ben Thompson [ ben.thompson ]

Todd Stoffel (Inactive) made changes - 2022-06-02 18:08

Fix Version/s		22.08.1 [ 28206 ]
Fix Version/s	22.08 [ 26904 ]

Ben Thompson (Inactive) made changes - 2022-06-14 16:12

Status

Open [ 1 ]

In Progress [ 3 ]

alexey vorovich (Inactive) made changes - 2022-06-21 15:42

Link

This issue blocks ~~MCOL-4484~~ [ ~~MCOL-4484~~ ]

alexey vorovich (Inactive) made changes - 2022-06-21 15:50

Summary

cpimport does not truncate strings - may multiply some characters

cpimport does not truncate strings - need to keep charset in sys catalog

alexey vorovich (Inactive) made changes - 2022-06-21 16:00

Summary

cpimport does not truncate strings - need to keep charset in sys catalog

cpimport does not truncate strings - need to keep charset/collation? in sys catalog

Chris Calender (Inactive) made changes - 2022-07-13 18:45

Fix Version/s		22.08 [ 26904 ]
Fix Version/s	22.08.1 [ 28206 ]

Todd Stoffel (Inactive) made changes - 2022-07-14 03:00

Rank

Ranked lower

Chris Calender (Inactive) made changes - 2022-07-20 18:39

Fix Version/s		22.08.2 [ 28208 ]
Fix Version/s	22.08 [ 26904 ]

Todd Stoffel (Inactive) made changes - 2022-10-03 17:25

Fix Version/s		22.08.3 [ 28456 ]
Fix Version/s	22.08.2 [ 28208 ]

David Hall (Inactive) made changes - 2022-10-04 22:17

Fix Version/s		22.11.01 [ 28458 ]
Fix Version/s	22.08.3 [ 28456 ]

Todd Stoffel (Inactive) made changes - 2022-10-26 04:45

Sprint

2021-16, 2021-17 [ 598, 614 ]

2021-16, 2021-17, 2021-18 [ 598, 614, 672 ]

Allen Herrera made changes - 2022-10-31 17:57

Labels

triage

Allen Herrera made changes - 2022-10-31 18:02

Priority

Major [ 3 ]

Minor [ 4 ]

alexey vorovich (Inactive) made changes - 2022-11-28 21:20

Fix Version/s		23.02 [ 28209 ]
Fix Version/s	23.03.1 [ 28458 ]

Todd Stoffel (Inactive) made changes - 2022-12-27 07:02

Sprint

2021-16, 2021-17, 2022-22 [ 598, 614, 672 ]

2021-16, 2021-17, 2022-22, 2022-23 [ 598, 614, 672, 686 ]

Todd Stoffel (Inactive) made changes - 2023-02-01 08:01

Assignee

Ben Thompson [ ben.thompson ]

alexey vorovich [ JIRAUSER48263 ]

Todd Stoffel (Inactive) made changes - 2023-02-17 20:57

Sprint

2021-16, 2021-17, 2022-22, 2022-23 [ 598, 614, 672, 686 ]

2021-16, 2021-17, 2022-22, 2022-23, 2022-24 [ 598, 614, 672, 686, 698 ]

Todd Stoffel (Inactive) made changes - 2023-03-16 18:22

Sprint

2021-16, 2021-17, 2022-22, 2022-23, 2023-4 [ 598, 614, 672, 686, 698 ]

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5 [ 598, 614, 672, 686, 698, 702 ]

alexey vorovich (Inactive) made changes - 2023-03-22 18:41

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Todd Stoffel (Inactive) made changes - 2023-04-06 05:19

Sprint

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5 [ 598, 614, 672, 686, 698, 702 ]

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6 [ 598, 614, 672, 686, 698, 702, 706 ]

alexey vorovich (Inactive) made changes - 2023-05-15 18:23

Assignee

alexey vorovich [ JIRAUSER48263 ]

Gagan Goel [ tntnatbry ]

alexey vorovich (Inactive) made changes - 2023-06-20 14:23

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Chris Calender (Inactive) made changes - 2023-06-21 18:42

Fix Version/s		23.08 [ 28540 ]
Fix Version/s	23.02 [ 28209 ]

Todd Stoffel (Inactive) made changes - 2023-07-01 20:18

Sprint

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6 [ 598, 614, 672, 686, 698, 702, 706 ]

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7 [ 598, 614, 672, 686, 698, 702, 706, 726 ]

Todd Stoffel (Inactive) made changes - 2023-07-17 07:06

Sprint

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7 [ 598, 614, 672, 686, 698, 702, 706, 726 ]

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8 [ 598, 614, 672, 686, 698, 702, 706, 726, 728 ]

Gagan Goel (Inactive) made changes - 2023-07-21 18:36

Link

This issue is blocked by ~~MCOL-5005~~ [ ~~MCOL-5005~~ ]

alexey vorovich (Inactive) made changes - 2023-07-26 15:16

Labels

triage

alexey vorovich (Inactive) made changes - 2023-07-26 15:37

Labels

rm_invalid_data

Allen Herrera made changes - 2023-08-02 18:54

Priority

Minor [ 4 ]

Major [ 3 ]

Gagan Goel (Inactive) made changes - 2023-08-04 17:48

Summary

cpimport does not truncate strings - need to keep charset/collation? in sys catalog

Make cpimport charset aware

Gagan Goel (Inactive) made changes - 2023-08-17 20:14

Fix Version/s		23.08.1 [ 29105 ]
Fix Version/s	23.08 [ 28540 ]

Gagan Goel (Inactive) made changes - 2023-08-21 18:14

Assigned for Review		Roman [ drrtuy ]
Assigned for Testing		Daniel Lee [ dleeyh ]

Gagan Goel (Inactive) made changes - 2023-08-21 18:15

Status

In Progress [ 3 ]

In Review [ 10002 ]

Gagan Goel (Inactive) added a comment - 2023-08-21 18:27

For QA:

Here is a simplified test case to reproduce the issue. In the below, /tmp/utf8_test.txt contains the following text:

"König-\\n\\n-Straße"

MariaDB [test]> drop table if exists t1;

Query OK, 0 rows affected (0.315 sec)

MariaDB [test]> create table t1 (a varchar(15))engine=columnstore default charset=utf8mb3;

Query OK, 0 rows affected (0.272 sec)

MariaDB [test]> LOAD DATA INFILE '/tmp/utf8_test.txt' IGNORE INTO TABLE t1 charset utf8mb3 fields enclosed by '"';

Query OK, 1 row affected, 1 warning (1.365 sec)

Records: 1  Deleted: 0  Skipped: 0  Warnings: 1

Verify that LDI correctly loads and truncates the multi-byte string:

MariaDB [test]> select * from t1;

+------------------+

| a                |

+------------------+

| König-\n\n-Stra  |

+------------------+

1 row in set (0.160 sec)

MariaDB [test]> select lengthb(a), char_length(a) from t1;

+------------+----------------+

| lengthb(a) | char_length(a) |

+------------+----------------+

|         16 |             15 |

+------------+----------------+

1 row in set (0.037 sec)

Now import the same data using cpimport:

cpimport -E'"' test t1 /tmp/utf8_test.txt

Verify that the number of bytes imported by cpimport is incorrect:

MariaDB [test]> select * from t1;

+------------------+

| a                |

+------------------+

| König-\n\n-Stra  |

| König-\n\n-Stra  |

+------------------+

2 rows in set (0.094 sec)

MariaDB [test]> select lengthb(a), char_length(a) from t1;

+------------+----------------+

| lengthb(a) | char_length(a) |

+------------+----------------+

|         16 |             15 |

|         19 |             17 |

+------------+----------------+

2 rows in set (0.042 sec)

With the fix, rerun cpimport:

cpimport -E'"' test t1 /tmp/utf8_test.txt

Now verify that cpimport correctly truncates the string (with the cpimport log showing truncation message) and loads the correct number of bytes:

MariaDB [test]> select * from t1;

+------------------+

| a                |

+------------------+

| König-\n\n-Stra  |

| König-\n\n-Stra  |

| König-\n\n-Stra  |

+------------------+

3 rows in set (0.095 sec)

MariaDB [test]> select lengthb(a), char_length(a) from t1;

+------------+----------------+

| lengthb(a) | char_length(a) |

+------------+----------------+

|         16 |             15 |

|         19 |             17 | <- row imported using cpimport before the fix

|         16 |             15 | <- row imported using cpimport after the fix

+------------+----------------+

3 rows in set (0.038 sec)

Gagan Goel (Inactive) added a comment - 2023-08-21 18:27 For QA: Here is a simplified test case to reproduce the issue. In the below, /tmp/utf8_test.txt contains the following text: "König-\\n\\n-Straße" MariaDB [test]> drop table if exists t1; Query OK, 0 rows affected (0.315 sec) MariaDB [test]> create table t1 (a varchar (15))engine=columnstore default charset=utf8mb3; Query OK, 0 rows affected (0.272 sec) MariaDB [test]> LOAD DATA INFILE '/tmp/utf8_test.txt' IGNORE INTO TABLE t1 charset utf8mb3 fields enclosed by '"' ; Query OK, 1 row affected, 1 warning (1.365 sec) Records: 1 Deleted: 0 Skipped: 0 Warnings: 1 Verify that LDI correctly loads and truncates the multi-byte string: MariaDB [test]> select * from t1; + ------------------+ | a | + ------------------+ | König-\n\n-Stra | + ------------------+ 1 row in set (0.160 sec) MariaDB [test]> select lengthb(a), char_length(a) from t1; + ------------+----------------+ | lengthb(a) | char_length(a) | + ------------+----------------+ | 16 | 15 | + ------------+----------------+ 1 row in set (0.037 sec) Now import the same data using cpimport: cpimport -E '"' test t1 /tmp/utf8_test .txt Verify that the number of bytes imported by cpimport is incorrect: MariaDB [test]> select * from t1; + ------------------+ | a | + ------------------+ | König-\n\n-Stra | | König-\n\n-Stra | + ------------------+ 2 rows in set (0.094 sec) MariaDB [test]> select lengthb(a), char_length(a) from t1; + ------------+----------------+ | lengthb(a) | char_length(a) | + ------------+----------------+ | 16 | 15 | | 19 | 17 | + ------------+----------------+ 2 rows in set (0.042 sec) With the fix, rerun cpimport: cpimport -E '"' test t1 /tmp/utf8_test .txt Now verify that cpimport correctly truncates the string (with the cpimport log showing truncation message) and loads the correct number of bytes: MariaDB [test]> select * from t1; + ------------------+ | a | + ------------------+ | König-\n\n-Stra | | König-\n\n-Stra | | König-\n\n-Stra | + ------------------+ 3 rows in set (0.095 sec) MariaDB [test]> select lengthb(a), char_length(a) from t1; + ------------+----------------+ | lengthb(a) | char_length(a) | + ------------+----------------+ | 16 | 15 | | 19 | 17 | <- row imported using cpimport before the fix | 16 | 15 | <- row imported using cpimport after the fix + ------------+----------------+ 3 rows in set (0.038 sec)

Todd Stoffel (Inactive) made changes - 2023-08-25 22:55

Sprint

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8 [ 598, 614, 672, 686, 698, 702, 706, 726, 728 ]

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-9 [ 598, 614, 672, 686, 698, 702, 706, 726, 728, 733 ]

Todd Stoffel (Inactive) made changes - 2023-08-25 22:55

Sprint

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-9 [ 598, 614, 672, 686, 698, 702, 706, 726, 728, 733 ]

2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10 [ 598, 614, 672, 686, 698, 702, 706, 726, 728, 734 ]

Gagan Goel (Inactive) made changes - 2023-08-29 16:29

Link

This issue relates to ~~MCOL-5563~~ [ ~~MCOL-5563~~ ]

alexey vorovich (Inactive) made changes - 2023-09-01 14:17

Status

In Review [ 10002 ]

In Testing [ 10301 ]

Daniel Lee (Inactive) made changes - 2023-09-07 13:10

Resolution		Fixed [ 1 ]
Status	In Testing [ 10301 ]	Closed [ 6 ]

Todd Stoffel (Inactive) made changes - 2023-09-22 15:26

Fix Version/s

23.10.0 [ 29422 ]

Todd Stoffel (Inactive) made changes - 2023-09-22 15:29

Fix Version/s

23.10.1 [ 29105 ]

Jira Automation (IT) made changes - 2024-07-04 12:53

Zendesk Related Tickets

116910

People

Assignee:: Gagan Goel (Inactive)

Reporter:: Yakov Kushnirsky

Assigned for Review:: Roman

Assigned for Testing:: Daniel Lee (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2021-11-22 22:32

Updated:: 2024-07-08 01:35

Resolved:: 2023-09-07 13:10

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB ColumnStore

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration