[MDEV-17685] LOAD DATA LOCAL INFILE does not respect column character set Created: 2018-11-12  Updated: 2018-11-14  Resolved: 2018-11-14

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.3.10
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Remy Fox Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Windows



 Description   

Execute the following statements. Make sure to have a file ready at <file path> containing the string '"Réunion"'.

  • CREATE TABLE test.test_table (test_column VARCHAR(190) COLLATE utf8mb4_unicode_ci NOT NULL)
  • INSERT INTO test.test_table (test_column) VALUES ('Réunion')
  • LOAD DATA LOCAL INFILE '<file path>'
    INTO TABLE test.test_table
    FIELDS ESCAPED BY '
    '
    TERMINATED BY ','
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    (test_column)

You will find that only the first of these two insertions will correctly insert the value. The LOAD DATA LOCAL INFILE statement will not respect the column's character set. It produces something like Réunion. In fact, even if you execute 'SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci', it will not respect the character set. Adding the line 'CHARACTER SET utf8mb4' to the LOAD DATA LOCAL INFILE won't help either.

This causes two problems:

  • Obviously, the inserted data will be corrupted
  • Secondly, uniqueness constraint checks might fail because whatever collation is used by the LOAD DATA INFILE process might not recognize that two values are different according to the column's collation.

Update: my SELECT INTO OUTFILE did not specify a CHARACTER SET so this is what caused the failure. So maybe this is not a bug.



 Comments   
Comment by Sergei Golubchik [ 2018-11-14 ]

According to https://mariadb.com/kb/en/library/load-data-infile/

The character set indicated by the character_set_database system variable is used to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause, which is available. A character set of binary specifies "no conversion."

Generated at Thu Feb 08 08:38:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.