[MDEV-16979] Inconsistent data stored via INSERT INTO and .csv file when using Russian letters Created: 2018-08-14  Updated: 2018-08-16  Resolved: 2018-08-16

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.2.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Martin Landhage Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: tests
Environment:

NAME=Fedora
VERSION="28 (Cloud Edition)"


Attachments: Zip Archive csv-inconsist.zip    

 Description   

I have two tables that I merge through a Python program:
Teltec.Prislista
Teltec.Lager

The selection is done using Teltec.Prislista.Typ the data is merged to Teltec.Ror.

This has been working fine for both Swedish and Ryssian letters when data is inserted with INSERT INTO command for both tables.

Then I loaded data using a csv file. Then my python 2.7 sorting program did not select types with Russian and Swedish letters due to mismatch between the tables.

The data stored are simply diffrent ! I see this as a MariaDB inconsistency.

I have enclosed:
prislista.txt Table Prislista using -e "select * from Prislista"
lager.txt Table Lager using -e "select * from Lager"
prislista.csv csv file loaded
prislista.sql sql commands for loading table Prislista.
sqldump.sql Dumped database where sensitive data is removed.
Ror.py Data sorting python program

Compare prislista.txt and lager.txt, for rows containing Russian and Swedish letters does not look ok. This is why the python program failes to get a match.

Using MySQL 5.7 I got the data sorting to work for Russian and Swedish letters by adding the following lines in prislista.sql:
SET NAMES 'utf8';
SET CHARACTER SET utf8;
SET SESSION collation_connection = 'utf8_unicode_ci';
This does NOT make a change for MariaDB.



 Comments   
Comment by Alice Sherepa [ 2018-08-16 ]

I can repeat the difference, got also the same results with Mysql 5.7.

MariaDB [test1]> CREATE TABLE `t1` (
    ->   `P_id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `Typ` varchar(12) COLLATE utf8_swedish_ci NOT NULL,
    ->   `Kategori` varchar(20) COLLATE utf8_swedish_ci NOT NULL,
    ->   `Pris` int(4) NOT NULL,
    ->   PRIMARY KEY (`P_id`)
    -> ) ENGINE=InnoDB AUTO_INCREMENT=369 DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci;
Query OK, 0 rows affected (0.028 sec)
 
MariaDB [test1]> CREATE TABLE `t2` (
    ->   `P_id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `Typ` varchar(12) COLLATE utf8_swedish_ci NOT NULL,
    ->   `Kategori` varchar(20) COLLATE utf8_swedish_ci NOT NULL,
    ->   `Pris` int(4) NOT NULL,
    ->   PRIMARY KEY (`P_id`)
    -> ) ENGINE=InnoDB AUTO_INCREMENT=369 DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci;
Query OK, 0 rows affected (0.026 sec)
 
MariaDB [test1]> load data local infile '1.csv' into table t1 
    -> fields terminated by ',' lines terminated by '\n';
Query OK, 6 rows affected (0.006 sec)                
Records: 6  Deleted: 0  Skipped: 0  Warnings: 0
 
MariaDB [test1]> INSERT INTO t2 VALUES  
    -> (70,"6П1П-EB","nytslutsteg",160),
    -> (71,"6П3C-E","nytslutsteg",325),
    -> (72,"6П14П","nytslutsteg",300),
    -> (73,"6П14П","fynd",140),
    -> (140,"B7G-KKÅ","rorh",100),
    -> (141,"B8B","rorh",65);
Query OK, 6 rows affected (0.009 sec)
Records: 6  Duplicates: 0  Warnings: 0
 
MariaDB [test1]> select * from t1;
+------+---------------+-------------+------+
| P_id | Typ           | Kategori    | Pris |
+------+---------------+-------------+------+
|   70 | 6П1П-EB     | nytslutsteg |  160 |
|   71 | 6П3C-E       | nytslutsteg |  325 |
|   72 | 6П14П       | nytslutsteg |  300 |
|   73 | 6П14П       | fynd        |  140 |
|  140 | B7G-KKÃ…      | rorh        |  100 |
|  141 | B8B           | rorh        |   65 |
+------+---------------+-------------+------+
6 rows in set (0.000 sec)
 
MariaDB [test1]> select * from t2;
+------+-----------+-------------+------+
| P_id | Typ       | Kategori    | Pris |
+------+-----------+-------------+------+
|   70 | 6П1П-EB   | nytslutsteg |  160 |
|   71 | 6П3C-E    | nytslutsteg |  325 |
|   72 | 6П14П     | nytslutsteg |  300 |
|   73 | 6П14П     | fynd        |  140 |
|  140 | B7G-KKÅ   | rorh        |  100 |
|  141 | B8B       | rorh        |   65 |
+------+-----------+-------------+------+
6 rows in set (0.001 sec)

1.csv :

70,6П1П-EB,nytslutsteg,160
71,6П3C-E,nytslutsteg,325
72,6П14П,nytslutsteg,300
73,6П14П,fynd,140
140,B7G-KKÅ,rorh,100
141,B8B,rorh,65

Comment by Alice Sherepa [ 2018-08-16 ]

"The character set indicated by the character_set_database system variable is used to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause, which is available." (https://mariadb.com/kb/en/library/load-data-infile/)

So the difference was caused by character_set_database (=latin1), while after specifing CHARACTER SET utf8 the results are correct.

ariaDB [test1]> load data local infile '1.csv' into table t1 CHARACTER SET utf8
    -> fields terminated by ',' lines terminated by '\n' ;
Query OK, 6 rows affected (0.01 sec)                 
Records: 6  Deleted: 0  Skipped: 0  Warnings: 0
 
MariaDB [test1]> select * from t1;
+------+-----------+-------------+------+
| P_id | Typ       | Kategori    | Pris |
+------+-----------+-------------+------+
|   70 | 6П1П-EB   | nytslutsteg |  160 |
|   71 | 6П3C-E    | nytslutsteg |  325 |
|   72 | 6П14П     | nytslutsteg |  300 |
|   73 | 6П14П     | fynd        |  140 |
|  140 | B7G-KKÅ   | rorh        |  100 |
|  141 | B8B       | rorh        |   65 |
+------+-----------+-------------+------+

Generated at Thu Feb 08 08:33:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.