Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7421

corrupted text with CSV connect table

Details

    Description

      I want to make a CONNECT table directed to a local CSV file saved in character set utf8.
      The table data is correctly displayed but the header is corrupting.
      Can I configure to fix this or is it a bug?

      /var/lib/mysql/table_utf8.csv

      [root@kc0022 ~]# cat /var/lib/mysql/table_utf8.csv
      "COUPON_NUM","COUPON_NAME","START_DATE","END_DATE"
      "A8772","CAMPAIGN1","2005-03-11","2005-04-01"
      "A6703","ANNIVERSARY","2005-08-15","2005-09-05"
      "A4313","NEW_ITEM","2005-09-13","2005-10-04"

      CONNECT table on MariaDB

      [root@kc0022 ~]# mysql -u root
      Welcome to the MariaDB monitor.  Commands end with ; or \g.
      Your MariaDB connection id is 351
      Server version: 10.0.13-MariaDB MariaDB Server
       
      Copyright (c) 2000, 2014, Oracle, SkySQL Ab and others.
       
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
       
      MariaDB [(none)]> use test;
      Reading table information for completion of table and column names
      You can turn off this feature to get a quicker startup with -A
       
      Database changed
      MariaDB [test]> CREATE TABLE table_utf8
          -> engine=CONNECT table_type=CSV file_name='../table_utf8.csv'
          -> header=1 sep_char=',' quoted=1;
      Query OK, 0 rows affected (0.05 sec)
       
      MariaDB [test]> EXPLAIN table_utf8;
      +----------------------------------------------------------------------+----------+------+-----+---------+-------+
      | Field                                                                | Type     | Null | Key | Default | Extra |
      +----------------------------------------------------------------------+----------+------+-----+---------+-------+
      | i≫?"i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?μi?-"                                  | char(5)  | NO   |     | NULL    |       |
      | i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?!i?-i?\                                    | char(33) | NO   |     | NULL    |       |
      | i?3i?´i?!i?2i?´i??i??i?!i?´i?\                                       | char(10) | NO   |     | NULL    |       |
      | i?\i?Ri??i??i??i?!i?´i?\                                             | char(10) | NO   |     | NULL    |       |
      +----------------------------------------------------------------------+----------+------+-----+---------+-------+
      4 rows in set (0.05 sec)
       
      MariaDB [test]> SELECT * FROM table_utf8;
      +----------------------------------------------------------------------+--------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------+
      | i≫?"i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?μi?-"                                  | i?£i? ̄i?μi?°i? ̄i?Ri??i?Ri?!i?-i?\                                  | i?3i? ´i?!i?2i?´i??i??i?!i?´i?\                               | i?\i?Ri??i??i??i?!i?´i?\                         |
      +----------------------------------------------------------------------+--------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------+
      | A8772                                                                | CAM PAIGN1                                                 | 2005-03-11                                                   | 2005-04-01                                       |
      | A6703                                                                | ANN IVERSARY                                             | 2005-08-15                                                   | 2005-09-05                                       |
      | A4313                                                                | NEW _ITEM                                                   | 2005-09-13                                                   | 2005-10-04                                       |
      +----------------------------------------------------------------------+--------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------+
      3 rows in set (0.00 sec)
       
      MariaDB [test]> show variables like "%char%";
      +--------------------------+----------------------------+
      | Variable_name            | Value                      |
      +--------------------------+----------------------------+
      | character_set_client     | utf8                       |
      | character_set_connection | utf8                       |
      | character_set_database   | utf8                       |
      | character_set_filesystem | binary                     |
      | character_set_results    | utf8                       |
      | character_set_server     | utf8                       |
      | character_set_system     | utf8                       |
      | character_sets_dir       | /usr/share/mysql/charsets/ |
      +--------------------------+----------------------------+
      8 rows in set (0.00 sec)

      Attachments

        Activity

          bertrandop Olivier Bertrand added a comment - - edited

          Indeed, using UltraEdit it was correctly decoded and I could verify if is in UTF-8 (coded in three bytes).

          I did not check yet what happen with column names. They are retrieved by CONNECT from the file header during the discovery process and perhaps not decoded properly. This can be temporarily avoided by manually defining the columns in the create table statement.

          However, I traced what happen with the data part. Because it is in UTF-8 (even the ASCII columns can be regarded as UTF-8) I created the table as:

          CREATE TABLE table_utf8 engine=CONNECT table_type=CSV default charset=UTF8
          file_name='E:/Data/table_utf8.csv' header=1 sep_char=',' quoted=1 data_charset=utf8;

          The data_charset option indicates the coding of the table data.

          What CONNECT does when reading the table is retrieving the column data and I could verify that it get it correctly and call the matching field store function with the charset parameter set by the data_charset option as UTF-8:

            rc= fp->store(p, strlen(p), charset, CHECK_FIELD_WARN);

          From there, all is done by MariaDB. If the default charset of the table is Latin1, MariaDB tries to convert the three bytes characters to Latin1 and fails, then they are displayed as '?' question marks.

          If the table default charset is UTF-8, the data is not converted (as the from charset is equal to the to charset)
          The way it is displayed depends on the client program and settings.

          Therefore this seems to be a MariaDB issue, at least for the data part of the table.

          bertrandop Olivier Bertrand added a comment - - edited Indeed, using UltraEdit it was correctly decoded and I could verify if is in UTF-8 (coded in three bytes). I did not check yet what happen with column names. They are retrieved by CONNECT from the file header during the discovery process and perhaps not decoded properly. This can be temporarily avoided by manually defining the columns in the create table statement. However, I traced what happen with the data part. Because it is in UTF-8 (even the ASCII columns can be regarded as UTF-8) I created the table as: CREATE TABLE table_utf8 engine=CONNECT table_type=CSV default charset=UTF8 file_name='E:/Data/table_utf8.csv' header=1 sep_char=',' quoted=1 data_charset=utf8; The data_charset option indicates the coding of the table data. What CONNECT does when reading the table is retrieving the column data and I could verify that it get it correctly and call the matching field store function with the charset parameter set by the data_charset option as UTF-8: rc= fp->store(p, strlen(p), charset, CHECK_FIELD_WARN); From there, all is done by MariaDB. If the default charset of the table is Latin1, MariaDB tries to convert the three bytes characters to Latin1 and fails, then they are displayed as '?' question marks. If the table default charset is UTF-8, the data is not converted (as the from charset is equal to the to charset) The way it is displayed depends on the client program and settings. Therefore this seems to be a MariaDB issue, at least for the data part of the table.

          Redefining the table as:

          CREATE TABLE `table_utf8` (
            `COUPON_NUM` char(5) NOT NULL,
            `COUPON_NAME` char(33) NOT NULL,
            `START_DATE` char(10) NOT NULL,
            `END_DATE` char(10) NOT NULL
          ) ENGINE=CONNECT DEFAULT CHARSET=UTF8 `TABLE_TYPE`='CSV' `FILE_NAME`='E:/Data/table_utf8.csv' `SEP_CHAR`=',' `DATA_CHARSET`='utf8' `HEADER`=1 `QUOTED`=1;

          I get the following result when character_set_results='latin1':

          +------------+-------------+------------+------------+
          | COUPON_NUM | COUPON_NAME | START_DATE | END_DATE   |
          +------------+-------------+------------+------------+
          | A8772      | ?????????   | 2005-03-11 | 2005-04-01 |
          | A6703      | ??????????? | 2005-08-15 | 2005-09-05 |
          | A4313      | ????????    | 2005-09-13 | 2005-10-04 |
          +------------+-------------+------------+------------+

          This seems normal, those three bytes characters cannot be translated to latin1.
          Now if character_set_results='utf8' the display is:

          +------------+-----------------------------------+------------+------------+
          | COUPON_NUM | COUPON_NAME                       | START_DATE | END_DATE   |
          +------------+-----------------------------------+------------+------------+
          | A8772      | ´╝ú´╝í´╝¡´╝░´╝í´╝®´╝º´╝«´╝æ       | 2005-03-11 | 2005-04-01 |
          | A6703      | ´╝í´╝«´╝«´╝®´╝´╝Ñ´╝▓´╝│´╝í´╝▓´╝╣ | 2005-08-15 | 2005-09-05 |
          | A4313      | ´╝«´╝Ñ´╝À´╝┐´╝®´╝┤´╝Ñ´╝¡          | 2005-09-13 | 2005-10-04 |
          +------------+-----------------------------------+------------+------------+

          To me it looks as if the client program were not decoding the result even the character_set_client is set to UTF-8. So it could also be a problem coming from the client program.

          bertrandop Olivier Bertrand added a comment - Redefining the table as: CREATE TABLE `table_utf8` ( `COUPON_NUM` char(5) NOT NULL, `COUPON_NAME` char(33) NOT NULL, `START_DATE` char(10) NOT NULL, `END_DATE` char(10) NOT NULL ) ENGINE=CONNECT DEFAULT CHARSET=UTF8 `TABLE_TYPE`='CSV' `FILE_NAME`='E:/Data/table_utf8.csv' `SEP_CHAR`=',' `DATA_CHARSET`='utf8' `HEADER`=1 `QUOTED`=1; I get the following result when character_set_results='latin1': +------------+-------------+------------+------------+ | COUPON_NUM | COUPON_NAME | START_DATE | END_DATE | +------------+-------------+------------+------------+ | A8772 | ????????? | 2005-03-11 | 2005-04-01 | | A6703 | ??????????? | 2005-08-15 | 2005-09-05 | | A4313 | ???????? | 2005-09-13 | 2005-10-04 | +------------+-------------+------------+------------+ This seems normal, those three bytes characters cannot be translated to latin1. Now if character_set_results='utf8' the display is: +------------+-----------------------------------+------------+------------+ | COUPON_NUM | COUPON_NAME | START_DATE | END_DATE | +------------+-----------------------------------+------------+------------+ | A8772 | ´╝ú´╝í´╝¡´╝░´╝í´╝®´╝º´╝«´╝æ | 2005-03-11 | 2005-04-01 | | A6703 | ´╝í´╝«´╝«´╝®´╝´╝Ñ´╝▓´╝│´╝í´╝▓´╝╣ | 2005-08-15 | 2005-09-05 | | A4313 | ´╝«´╝Ñ´╝À´╝┐´╝®´╝┤´╝Ñ´╝¡ | 2005-09-13 | 2005-10-04 | +------------+-----------------------------------+------------+------------+ To me it looks as if the client program were not decoding the result even the character_set_client is set to UTF-8. So it could also be a problem coming from the client program.

          Concerning the column names, there was a problem when specified in UTF8 because, MariaDB internally handling column names in UTF8, CONNECT was translating them before creating the table. Now for CSV tables specifying DATA_CHARSET=UTF8 this translation is no more done, considering they already are coded in UTF8.

          For instance, starting from the file:

          Opération,Monnaie,Débit,Crédit
          Chèque,£,100,0
          Virement,€,0,5600

          I can create the table:

          create table tab_csv8 engine=connect table_type=csv file_name='E:/Data/csv8.csv' header=1 lrecl=64 data_charset=utf8;

          Describe tab_csv8 replies:

          Field Type Null Key Default Extra
          Opération char(8) NO <null> <null> <null>
          Monnaie char(3) NO <null> <null> <null>
          Débit int(3) NO <null> <null> <null>
          Crédit int(4) NO <null> <null> <null>

          and

          select * from tab_csv8;

          Returns:

          Opération Monnaie Débit Crédit
          Chèque £ 100 0
          Virement 0 5600

          However, if instead of using my graphic client program I use the mysql client, the table is displayed as:

          MariaDB [test]> select * from tab_csv8;
          +-----------+---------+-------+--------+
          | Opération | Monnaie | Débit | Crédit |
          +-----------+---------+-------+--------+
          | Chèque    | £       |   100 |      0 |
          | Virement  | ?       |     0 |   5600 |
          +-----------+---------+-------+--------+

          See; the Euro character is not recognized and printed as a question mark. This shows that the problem is a problem of character recognition by the mysql client program.

          Note that this is done with:

          MariaDB [test]> show variables like '%char%';
          +--------------------------+-----------------------------------------------------------------+
          | Variable_name            | Value                                                           |
          +--------------------------+-----------------------------------------------------------------+
          | character_set_client     | cp850                                                           |
          | character_set_connection | cp850                                                           |
          | character_set_database   | latin1                                                          |
          | character_set_filesystem | binary                                                          |
          | character_set_results    | cp850                                                           |
          | character_set_server     | latin1                                                          |
          | character_set_system     | utf8                                                            |
          | character_sets_dir       | d:\CommonSource\mariadb-10.0.5\10.0-connect\sql\share\charsets\ |
          +--------------------------+-----------------------------------------------------------------+

          If those character sets are set to UTF8, things are even worst and the table is displayed as:

          MariaDB [test]> select * from tab_csv8;
          +------------+---------+--------+---------+
          | Op├®ration | Monnaie | D├®bit | Cr├®dit |
          +------------+---------+--------+---------+
          | Chèque    | £      |    100 |       0 |
          | Virement   | Ôé¼     |      0 |    5600 |
          +------------+---------+--------+---------+

          Therefore, I think this issue is fixed for CONNECT (column names are now retrieved correctly) but of course there are pending problem for displaying some characters that are not CONNECT wise.

          bertrandop Olivier Bertrand added a comment - Concerning the column names, there was a problem when specified in UTF8 because, MariaDB internally handling column names in UTF8, CONNECT was translating them before creating the table. Now for CSV tables specifying DATA_CHARSET=UTF8 this translation is no more done, considering they already are coded in UTF8. For instance, starting from the file: Opération,Monnaie,Débit,Crédit Chèque,£,100,0 Virement,€,0,5600 I can create the table: create table tab_csv8 engine=connect table_type=csv file_name='E:/Data/csv8.csv' header=1 lrecl=64 data_charset=utf8; Describe tab_csv8 replies: Field Type Null Key Default Extra Opération char(8) NO <null> <null> <null> Monnaie char(3) NO <null> <null> <null> Débit int(3) NO <null> <null> <null> Crédit int(4) NO <null> <null> <null> and select * from tab_csv8; Returns: Opération Monnaie Débit Crédit Chèque £ 100 0 Virement € 0 5600 However, if instead of using my graphic client program I use the mysql client, the table is displayed as: MariaDB [test]> select * from tab_csv8; +-----------+---------+-------+--------+ | Opération | Monnaie | Débit | Crédit | +-----------+---------+-------+--------+ | Chèque | £ | 100 | 0 | | Virement | ? | 0 | 5600 | +-----------+---------+-------+--------+ See; the Euro character is not recognized and printed as a question mark. This shows that the problem is a problem of character recognition by the mysql client program. Note that this is done with: MariaDB [test]> show variables like '%char%'; +--------------------------+-----------------------------------------------------------------+ | Variable_name | Value | +--------------------------+-----------------------------------------------------------------+ | character_set_client | cp850 | | character_set_connection | cp850 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | cp850 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | d:\CommonSource\mariadb-10.0.5\10.0-connect\sql\share\charsets\ | +--------------------------+-----------------------------------------------------------------+ If those character sets are set to UTF8, things are even worst and the table is displayed as: MariaDB [test]> select * from tab_csv8; +------------+---------+--------+---------+ | Op├®ration | Monnaie | D├®bit | Cr├®dit | +------------+---------+--------+---------+ | Ch├¿que | ┬ú | 100 | 0 | | Virement | Ôé¼ | 0 | 5600 | +------------+---------+--------+---------+ Therefore, I think this issue is fixed for CONNECT (column names are now retrieved correctly) but of course there are pending problem for displaying some characters that are not CONNECT wise.

          Indeed, setting default charset=UTF8 and configuring the client solves the problem for the data part.
          This is the same in my test I submitted in 2015-01-14.

          But for the header part, I don't understand what is fixed at the moment.
          Do you mean data is handled correctly inside, but displaying it errors?
          Is it possible to fix the client so that all characters can be shown properly?

          takuya Takuya Aoki (Inactive) added a comment - Indeed, setting default charset=UTF8 and configuring the client solves the problem for the data part. This is the same in my test I submitted in 2015-01-14. But for the header part, I don't understand what is fixed at the moment. Do you mean data is handled correctly inside, but displaying it errors? Is it possible to fix the client so that all characters can be shown properly?

          As I said, the header part was a bug that will be fixed in next version.

          bertrandop Olivier Bertrand added a comment - As I said, the header part was a bug that will be fixed in next version.

          People

            bertrandop Olivier Bertrand
            takuya Takuya Aoki (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.