Details

    • 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10

    Description

      Rewording

      Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings

      cpimport test flights /tmp/flights.txt -m1 -s '\t'
      versus
      mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2  FIELDS TERMINATED BY '\t';"
      

      Expected:
      When using cpimport - Strings longer than 255 are truncated to fit varchar(255) just like LDIF does

      Actual:
      cpimport does not truncate strings even when the column is defined as varchar(255), unlike LDIF

      Reproduction:
      Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory

      -----------------------------
      it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
      For example, in the original case, data loaded from .tsv file into varchar(255) as

      cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

      resulted in the following output (charset=utf8mb3), which does not look right:
      select id, lengthb(notes),char_length(notes) from flights_repro;
      ---------------------------------------

      id lengthb(notes) char_length(notes)

      ---------------------------------------

      3 765 765
      5199 765 765
      7275 765 765

      ...

      If the same data were loaded via LDIF as

      LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

      then result looks correct:
      select id, lengthb(notes),char_length(notes) from flights_repro;
      ---------------------------------------

      id lengthb(notes) char_length(notes)

      ---------------------------------------

      3 255 255
      5199 255 255
      7275 255 255

      ...

      An attempted simplified repro is the following:

      repro.tsv produced as (in same way/options as in the original case)

      mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

      use test;
      CREATE TABLE `repro` (
      `id` int(11) NOT NULL,
      `notes` varchar(255) DEFAULT NULL
      ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
      ;
      LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
      ...
      Query OK, 1 row affected (1.186 sec)
      Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
      ...

      select id, lengthb(notes),char_length(notes) from repro;
      ------------------------------------

      id lengthb(notes) char_length(notes)

      ------------------------------------

      3 255 255

      ------------------------------------

      \q

      mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

      truncate table repro;

      cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
      ...
      2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
      2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
      2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
      ...

      select id, lengthb(notes),char_length(notes) from repro;
      ------------------------------------

      id lengthb(notes) char_length(notes)

      ------------------------------------

      3 259 259

      ------------------------------------
      1 row in set (0.037 sec)
      \q

      mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

      SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '
      ' at the beginning of line. While LDIF loads data correctly, without prepending.

      I not sure whether options are wrong or is there a problem with cpimport ?

      Attachments

        1. flights.txt
          20 kB
        2. repro_cpimp.tsv
          0.3 kB
        3. repro_ldif.tsv
          0.3 kB
        4. repro.tsv
          0.3 kB
        5. reproduction.bash
          7 kB

        Issue Links

          Activity

            YK Yakov Kushnirsky created issue -
            YK Yakov Kushnirsky made changes -
            Field Original Value New Value
            Attachment repro.tsv [ 60865 ]
            Attachment repro_cpimp.tsv [ 60866 ]
            Attachment repro_ldif.tsv [ 60867 ]
            David.Hall David Hall (Inactive) made changes -
            Assignee David Hall [ david.hall ]
            allen.herrera Allen Herrera made changes -
            Summary cpimport load may multiply some characters cpimport does not truncate strings - may multiply some characters
            allen.herrera Allen Herrera made changes -
            Description it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
            For example, in the original case, data loaded from .tsv file into varchar(255) as

            cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

            resulted in the following output (charset=utf8mb3), which does not look right:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 765 | 765 |
            | 5199 | 765 | 765 |
            | 7275 | 765 | 765 |
            ...

            If the same data were loaded via LDIF as

            LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

            then result looks correct:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 255 | 255 |
            | 5199 | 255 | 255 |
            | 7275 | 255 | 255 |
            ...

            An attempted simplified repro is the following:

            repro.tsv produced as (in same way/options as in the original case)

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

            use test;
            CREATE TABLE `repro` (
              `id` int(11) NOT NULL,
              `notes` varchar(255) DEFAULT NULL
            ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
            ;
            LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
            ...
            Query OK, 1 row affected (1.186 sec)
            Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 255 | 255 |
            +----+----------------+--------------------+

            \q

            mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

            truncate table repro;

            cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
            ...
            2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
            2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
            2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 259 | 259 |
            +----+----------------+--------------------+
            1 row in set (0.037 sec)
            \q

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

            SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

            I not sure whether options are wrong or is there a problem with cpimport ?
            Rewording

            Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings
            {code:java}
            cpimport test flights /tmp/flights.txt -m1 -s '\t'
            versus
            mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2 FIELDS TERMINATED BY '\t';"
            {code}

            Reproduction:
            Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory



            -----------------------------
            it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
            For example, in the original case, data loaded from .tsv file into varchar(255) as

            cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

            resulted in the following output (charset=utf8mb3), which does not look right:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 765 | 765 |
            | 5199 | 765 | 765 |
            | 7275 | 765 | 765 |
            ...

            If the same data were loaded via LDIF as

            LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

            then result looks correct:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 255 | 255 |
            | 5199 | 255 | 255 |
            | 7275 | 255 | 255 |
            ...

            An attempted simplified repro is the following:

            repro.tsv produced as (in same way/options as in the original case)

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

            use test;
            CREATE TABLE `repro` (
              `id` int(11) NOT NULL,
              `notes` varchar(255) DEFAULT NULL
            ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
            ;
            LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
            ...
            Query OK, 1 row affected (1.186 sec)
            Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 255 | 255 |
            +----+----------------+--------------------+

            \q

            mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

            truncate table repro;

            cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
            ...
            2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
            2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
            2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 259 | 259 |
            +----+----------------+--------------------+
            1 row in set (0.037 sec)
            \q

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

            SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

            I not sure whether options are wrong or is there a problem with cpimport ?
            allen.herrera Allen Herrera made changes -
            Attachment flights.txt [ 61134 ]
            allen.herrera Allen Herrera made changes -
            Attachment reproduction.bash [ 61135 ]
            allen.herrera Allen Herrera made changes -
            Description Rewording

            Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings
            {code:java}
            cpimport test flights /tmp/flights.txt -m1 -s '\t'
            versus
            mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2 FIELDS TERMINATED BY '\t';"
            {code}

            Reproduction:
            Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory



            -----------------------------
            it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
            For example, in the original case, data loaded from .tsv file into varchar(255) as

            cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

            resulted in the following output (charset=utf8mb3), which does not look right:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 765 | 765 |
            | 5199 | 765 | 765 |
            | 7275 | 765 | 765 |
            ...

            If the same data were loaded via LDIF as

            LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

            then result looks correct:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 255 | 255 |
            | 5199 | 255 | 255 |
            | 7275 | 255 | 255 |
            ...

            An attempted simplified repro is the following:

            repro.tsv produced as (in same way/options as in the original case)

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

            use test;
            CREATE TABLE `repro` (
              `id` int(11) NOT NULL,
              `notes` varchar(255) DEFAULT NULL
            ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
            ;
            LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
            ...
            Query OK, 1 row affected (1.186 sec)
            Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 255 | 255 |
            +----+----------------+--------------------+

            \q

            mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

            truncate table repro;

            cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
            ...
            2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
            2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
            2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 259 | 259 |
            +----+----------------+--------------------+
            1 row in set (0.037 sec)
            \q

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

            SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

            I not sure whether options are wrong or is there a problem with cpimport ?
            Rewording

            Cpimport and LDIF of the same file doesn't have the same result. Cpimport appears to not truncate strings
            {code:java}
            cpimport test flights /tmp/flights.txt -m1 -s '\t'
            versus
            mariadb test -e "LOAD DATA INFILE '/tmp/flights.txt' IGNORE INTO TABLE flights2 FIELDS TERMINATED BY '\t';"
            {code}

            Expected:
            When using cpimport - Strings longer than 255 are truncated to fit varchar(255) just like LDIF does

            Actual:
            cpimport does not truncate strings even when the column is defined as varchar(255), unlike LDIF

            Reproduction:
            Follow the commands/steps in reproduction.bash after scp of flights.txt to /tmp/ directory



            -----------------------------
            it seems that cpimport could multiply some characters (up to number of charset bytes) when loading data into varchar column(s).
            For example, in the original case, data loaded from .tsv file into varchar(255) as

            cpimport test flights_repro flights_repro.txt -m1 -e1 -s '\t' -n1

            resulted in the following output (charset=utf8mb3), which does not look right:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 765 | 765 |
            | 5199 | 765 | 765 |
            | 7275 | 765 | 765 |
            ...

            If the same data were loaded via LDIF as

            LOAD DATA INFILE '/tmp/flights2.txt' INTO TABLE flights2_cs FIELDS TERMINATED BY '\t';

            then result looks correct:
            select id, lengthb(notes),char_length(notes) from flights_repro;
            +-------+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +-------+----------------+--------------------+
            | 3 | 255 | 255 |
            | 5199 | 255 | 255 |
            | 7275 | 255 | 255 |
            ...

            An attempted simplified repro is the following:

            repro.tsv produced as (in same way/options as in the original case)

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro.txt

            use test;
            CREATE TABLE `repro` (
              `id` int(11) NOT NULL,
              `notes` varchar(255) DEFAULT NULL
            ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb3
            ;
            LOAD DATA INFILE '/tmp/repro.tsv' INTO TABLE repro FIELDS TERMINATED BY '\t';
            ...
            Query OK, 1 row affected (1.186 sec)
            Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 255 | 255 |
            +----+----------------+--------------------+

            \q

            mysql -Ns -B -D test --execute="select id,notes from repro" > repro_ldif.tsv

            truncate table repro;

            cpimport test repro repro.tsv -m1 -e1 -s '\t' -n1
            ...
            2021-11-22 16:19:25 (4607) INFO : Running distributed import (mode 1) on all PMs...
            2021-11-22 16:19:25 (4607) INFO : For table test.repro: 1 rows processed and 1 rows inserted.
            2021-11-22 16:19:25 (4607) INFO : Bulk load completed, total run time : 0.192545 seconds
            ...

            select id, lengthb(notes),char_length(notes) from repro;
            +----+----------------+--------------------+
            | id | lengthb(notes) | char_length(notes) |
            +----+----------------+--------------------+
            | 3 | 259 | 259 |
            +----+----------------+--------------------+
            1 row in set (0.037 sec)
            \q

            mysql -Ns -B -D test --execute="select id,notes from flights_biu where id =3" > repro_cpimp.tsv

            SELECT and comparison of dumps produced by cpimp and ldif shows that cpimport loads 2 extra '\\' at the beginning of line. While LDIF loads data correctly, without prepending.

            I not sure whether options are wrong or is there a problem with cpimport ?
            gdorman Gregory Dorman (Inactive) made changes -
            Sprint 2021-16 [ 598 ]
            gdorman Gregory Dorman (Inactive) made changes -
            Rank Ranked higher
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s Icebox [ 22302 ]
            ccalender Chris Calender (Inactive) made changes -
            Fix Version/s 6.3.1 [ 25801 ]
            Fix Version/s Icebox [ 22302 ]
            gdorman Gregory Dorman (Inactive) made changes -
            Sprint 2021-16 [ 598 ] 2021-16, 2021-17 [ 598, 614 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            David.Hall David Hall (Inactive) made changes -
            Fix Version/s 7.1.1 [ 26904 ]
            Fix Version/s 6.3.1 [ 25801 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee David Hall [ david.hall ] Todd Stoffel [ toddstoffel ]

            toddstoffel pls review with YK

            alexey.vorovich alexey vorovich (Inactive) added a comment - toddstoffel pls review with YK
            toddstoffel Todd Stoffel (Inactive) made changes -
            Assignee Todd Stoffel [ toddstoffel ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee David Hall [ david.hall ]
            David.Hall David Hall (Inactive) made changes -
            Assignee David Hall [ david.hall ] Ben Thompson [ ben.thompson ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s 22.08.1 [ 28206 ]
            Fix Version/s 22.08 [ 26904 ]
            ben.thompson Ben Thompson (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Summary cpimport does not truncate strings - may multiply some characters cpimport does not truncate strings - need to keep charset in sys catalog
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Summary cpimport does not truncate strings - need to keep charset in sys catalog cpimport does not truncate strings - need to keep charset/collation? in sys catalog
            ccalender Chris Calender (Inactive) made changes -
            Fix Version/s 22.08 [ 26904 ]
            Fix Version/s 22.08.1 [ 28206 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked lower
            ccalender Chris Calender (Inactive) made changes -
            Fix Version/s 22.08.2 [ 28208 ]
            Fix Version/s 22.08 [ 26904 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s 22.08.3 [ 28456 ]
            Fix Version/s 22.08.2 [ 28208 ]
            David.Hall David Hall (Inactive) made changes -
            Fix Version/s 22.11.01 [ 28458 ]
            Fix Version/s 22.08.3 [ 28456 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17 [ 598, 614 ] 2021-16, 2021-17, 2021-18 [ 598, 614, 672 ]
            allen.herrera Allen Herrera made changes -
            Labels triage
            allen.herrera Allen Herrera made changes -
            Priority Major [ 3 ] Minor [ 4 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Fix Version/s 23.02 [ 28209 ]
            Fix Version/s 23.03.1 [ 28458 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22 [ 598, 614, 672 ] 2021-16, 2021-17, 2022-22, 2022-23 [ 598, 614, 672, 686 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Assignee Ben Thompson [ ben.thompson ] alexey vorovich [ JIRAUSER48263 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23 [ 598, 614, 672, 686 ] 2021-16, 2021-17, 2022-22, 2022-23, 2022-24 [ 598, 614, 672, 686, 698 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23, 2023-4 [ 598, 614, 672, 686, 698 ] 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5 [ 598, 614, 672, 686, 698, 702 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5 [ 598, 614, 672, 686, 698, 702 ] 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6 [ 598, 614, 672, 686, 698, 702, 706 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee alexey vorovich [ JIRAUSER48263 ] Gagan Goel [ tntnatbry ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            ccalender Chris Calender (Inactive) made changes -
            Fix Version/s 23.08 [ 28540 ]
            Fix Version/s 23.02 [ 28209 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6 [ 598, 614, 672, 686, 698, 702, 706 ] 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7 [ 598, 614, 672, 686, 698, 702, 706, 726 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7 [ 598, 614, 672, 686, 698, 702, 706, 726 ] 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8 [ 598, 614, 672, 686, 698, 702, 706, 726, 728 ]
            tntnatbry Gagan Goel (Inactive) made changes -
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Labels triage
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Labels rm_invalid_data
            allen.herrera Allen Herrera made changes -
            Priority Minor [ 4 ] Major [ 3 ]
            tntnatbry Gagan Goel (Inactive) made changes -
            Summary cpimport does not truncate strings - need to keep charset/collation? in sys catalog Make cpimport charset aware
            tntnatbry Gagan Goel (Inactive) made changes -
            Fix Version/s 23.08.1 [ 29105 ]
            Fix Version/s 23.08 [ 28540 ]
            tntnatbry Gagan Goel (Inactive) made changes -
            Assigned for Review Roman [ drrtuy ]
            Assigned for Testing Daniel Lee [ dleeyh ]
            tntnatbry Gagan Goel (Inactive) made changes -
            Status In Progress [ 3 ] In Review [ 10002 ]

            For QA:

            Here is a simplified test case to reproduce the issue. In the below, /tmp/utf8_test.txt contains the following text:

            "König-\\n\\n-Straße"
            

            MariaDB [test]> drop table if exists t1;
            Query OK, 0 rows affected (0.315 sec)
             
            MariaDB [test]> create table t1 (a varchar(15))engine=columnstore default charset=utf8mb3;
            Query OK, 0 rows affected (0.272 sec)
             
            MariaDB [test]> LOAD DATA INFILE '/tmp/utf8_test.txt' IGNORE INTO TABLE t1 charset utf8mb3 fields enclosed by '"';
            Query OK, 1 row affected, 1 warning (1.365 sec)      
            Records: 1  Deleted: 0  Skipped: 0  Warnings: 1
            

            Verify that LDI correctly loads and truncates the multi-byte string:

            MariaDB [test]> select * from t1;
            +------------------+
            | a                |
            +------------------+
            | König-\n\n-Stra  |
            +------------------+
            1 row in set (0.160 sec)
             
            MariaDB [test]> select lengthb(a), char_length(a) from t1;
            +------------+----------------+
            | lengthb(a) | char_length(a) |
            +------------+----------------+
            |         16 |             15 |
            +------------+----------------+
            1 row in set (0.037 sec)
            

            Now import the same data using cpimport:

            cpimport -E'"' test t1 /tmp/utf8_test.txt
            

            Verify that the number of bytes imported by cpimport is incorrect:

            MariaDB [test]> select * from t1;
            +------------------+
            | a                |
            +------------------+
            | König-\n\n-Stra  |
            | König-\n\n-Stra  |
            +------------------+
            2 rows in set (0.094 sec)
             
            MariaDB [test]> select lengthb(a), char_length(a) from t1;
            +------------+----------------+
            | lengthb(a) | char_length(a) |
            +------------+----------------+
            |         16 |             15 |
            |         19 |             17 |
            +------------+----------------+
            2 rows in set (0.042 sec)
            

            With the fix, rerun cpimport:

            cpimport -E'"' test t1 /tmp/utf8_test.txt
            

            Now verify that cpimport correctly truncates the string (with the cpimport log showing truncation message) and loads the correct number of bytes:

            MariaDB [test]> select * from t1;
            +------------------+
            | a                |
            +------------------+
            | König-\n\n-Stra  |
            | König-\n\n-Stra  |
            | König-\n\n-Stra  |
            +------------------+
            3 rows in set (0.095 sec)
             
            MariaDB [test]> select lengthb(a), char_length(a) from t1;
            +------------+----------------+
            | lengthb(a) | char_length(a) |
            +------------+----------------+
            |         16 |             15 |
            |         19 |             17 | <- row imported using cpimport before the fix
            |         16 |             15 | <- row imported using cpimport after the fix
            +------------+----------------+
            3 rows in set (0.038 sec)
            

            tntnatbry Gagan Goel (Inactive) added a comment - For QA: Here is a simplified test case to reproduce the issue. In the below, /tmp/utf8_test.txt contains the following text: "König-\\n\\n-Straße" MariaDB [test]> drop table if exists t1; Query OK, 0 rows affected (0.315 sec)   MariaDB [test]> create table t1 (a varchar (15))engine=columnstore default charset=utf8mb3; Query OK, 0 rows affected (0.272 sec)   MariaDB [test]> LOAD DATA INFILE '/tmp/utf8_test.txt' IGNORE INTO TABLE t1 charset utf8mb3 fields enclosed by '"' ; Query OK, 1 row affected, 1 warning (1.365 sec) Records: 1 Deleted: 0 Skipped: 0 Warnings: 1 Verify that LDI correctly loads and truncates the multi-byte string: MariaDB [test]> select * from t1; + ------------------+ | a | + ------------------+ | König-\n\n-Stra | + ------------------+ 1 row in set (0.160 sec)   MariaDB [test]> select lengthb(a), char_length(a) from t1; + ------------+----------------+ | lengthb(a) | char_length(a) | + ------------+----------------+ | 16 | 15 | + ------------+----------------+ 1 row in set (0.037 sec) Now import the same data using cpimport: cpimport -E '"' test t1 /tmp/utf8_test .txt Verify that the number of bytes imported by cpimport is incorrect: MariaDB [test]> select * from t1; + ------------------+ | a | + ------------------+ | König-\n\n-Stra | | König-\n\n-Stra | + ------------------+ 2 rows in set (0.094 sec)   MariaDB [test]> select lengthb(a), char_length(a) from t1; + ------------+----------------+ | lengthb(a) | char_length(a) | + ------------+----------------+ | 16 | 15 | | 19 | 17 | + ------------+----------------+ 2 rows in set (0.042 sec) With the fix, rerun cpimport: cpimport -E '"' test t1 /tmp/utf8_test .txt Now verify that cpimport correctly truncates the string (with the cpimport log showing truncation message) and loads the correct number of bytes: MariaDB [test]> select * from t1; + ------------------+ | a | + ------------------+ | König-\n\n-Stra | | König-\n\n-Stra | | König-\n\n-Stra | + ------------------+ 3 rows in set (0.095 sec)   MariaDB [test]> select lengthb(a), char_length(a) from t1; + ------------+----------------+ | lengthb(a) | char_length(a) | + ------------+----------------+ | 16 | 15 | | 19 | 17 | <- row imported using cpimport before the fix | 16 | 15 | <- row imported using cpimport after the fix + ------------+----------------+ 3 rows in set (0.038 sec)
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8 [ 598, 614, 672, 686, 698, 702, 706, 726, 728 ] 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-9 [ 598, 614, 672, 686, 698, 702, 706, 726, 728, 733 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Sprint 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-9 [ 598, 614, 672, 686, 698, 702, 706, 726, 728, 733 ] 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10 [ 598, 614, 672, 686, 698, 702, 706, 726, 728, 734 ]
            tntnatbry Gagan Goel (Inactive) made changes -
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status In Review [ 10002 ] In Testing [ 10301 ]
            dleeyh Daniel Lee (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status In Testing [ 10301 ] Closed [ 6 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s 23.10.0 [ 29422 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s 23.10.1 [ 29105 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 116910

            People

              tntnatbry Gagan Goel (Inactive)
              YK Yakov Kushnirsky
              Roman Roman
              Daniel Lee Daniel Lee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.