[MCOL-1858] An `invalid` records indication when loading the table unsing mcsimport Created: 2018-11-05  Updated: 2023-10-26  Resolved: 2018-11-13

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.2.0
Fix Version/s: 1.2.1

Type: Bug Priority: Major
Reporter: Martin Adamec Assignee: Zdravelina Sokolovska (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File MariaDB ColumnStore mcsimport-1.2.1-1-x64.msi    
Issue Links:
PartOf
includes MCOL-1853 ColumnStore breaks on S3 NFS mounts Closed
Sprint: 2018-20

 Description   

While loading data from CSV files into a ColumnStore tables using mcsimport utility most of the files ended successfully loading as many records as there were rows in the CSV file but indicated a various amount of invalid records.
I was not able to track down what invalid records they are and what was causing the invalid state of each of them as I did not find any log (was looking for something similar to cpimport error logs).
Here is the list of all files loading mcsimport responses. But I am only attaching two smaller files that indicated an issue together with the create table command for each one of them.

/usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert calendar calendar-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert groups groups-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert onelots onelots-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert open_interest open_interest-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert optimal_early_exercise optimal_ex-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert rates rates-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert securities securities-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert underlyings underlyings-datadock_2018_Q1.csv && /usr/local/mariadb/columnstore/tools/mcsimport/mcsimport tradealert usec_hist usec_hist-datadock_2018_Q1.csv

Execution time: 0.817828s
Rows inserted: 10064
Truncation count: 0
Saturated count: 0
Invalid count: 0
cat calendar-datadock_2018_Q1.csv | wc -l
10064

Execution time: 0.161977s
Rows inserted: 2077
Truncation count: 0
Saturated count: 0
Invalid count: 1
cat groups-datadock_2018_Q1.csv | wc -l
2077

Execution time: 4281.18s
Rows inserted: 87483118
Truncation count: 0
Saturated count: 0
Invalid count: 32809561
cat onelots-datadock_2018_Q1.csv | wc -l
87483118

Execution time: 1043.61s
Rows inserted: 53818130
Truncation count: 0
Saturated count: 0
Invalid count: 38551020
cat open_interest-datadock_2018_Q1.csv | wc -l
53818130

Execution time: 0.481345s
Rows inserted: 17387
Truncation count: 0
Saturated count: 0
Invalid count: 159
cat optimal_ex-datadock_2018_Q1.csv | wc -l
17387

Execution time: 2718.69s
Rows inserted: 101144576
Truncation count: 0
Saturated count: 0
Invalid count: 146538555
cat rates-datadock_2018_Q1.csv | wc -l
101144576

Execution time: 91.0303s
Rows inserted: 4278121
Truncation count: 0
Saturated count: 0
Invalid count: 34791
cat securities-datadock_2018_Q1.csv | wc -l
4278121

Execution time: 38.2951s
Rows inserted: 264387
Truncation count: 0
Saturated count: 0
Invalid count: 17
cat underlyings-datadock_2018_Q1.csv | wc -l
264387

Execution time: 26.1987s
Rows inserted: 494710
Truncation count: 61
Saturated count: 0
Invalid count: 1940652
cat usec_hist-datadock_2018_Q1.csv | wc -l
494710

CREATE TABLE `optimal_early_exercise` (
`secid` int(11) DEFAULT NULL,
`date` date DEFAULT NULL,
`net_div` float DEFAULT NULL,
`exdiv` date DEFAULT NULL,
`open_int` int(11) DEFAULT NULL,
`hedged_gains` float DEFAULT NULL,
`put_mid` float DEFAULT NULL,
`call_mid` float DEFAULT NULL,
`funding` float DEFAULT NULL,
`stock_px` float DEFAULT NULL,
`unexercised` int(11) DEFAULT NULL,
`value_lost` float DEFAULT NULL,
`theo_gains` float DEFAULT NULL,
`put_ask` float DEFAULT NULL,
`stamp` char(12) DEFAULT NULL,
`latest` tinyint(1) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=latin1;

CREATE TABLE `usec_hist` (
`date` date NOT NULL DEFAULT '0000-00-00',
`usymbol` varchar(8) DEFAULT '',
`sectype` varchar(4) DEFAULT NULL,
`description` varchar(128) DEFAULT NULL,
`sector` varchar(8) DEFAULT NULL,
`subsector` varchar(8) DEFAULT NULL,
`cap` varchar(10) DEFAULT NULL,
`shares_outstanding` float DEFAULT NULL,
`next_earnings` date DEFAULT NULL,
`next_exdiv` date DEFAULT NULL,
`price` float DEFAULT NULL,
`close` float DEFAULT NULL,
`chg` float DEFAULT NULL,
`high` float DEFAULT NULL,
`low` float DEFAULT NULL,
`high_date` date DEFAULT NULL,
`low_date` date DEFAULT NULL,
`htb` int(11) DEFAULT NULL,
`riskarb` tinyint(4) DEFAULT NULL,
`primary_exch` char(3) DEFAULT NULL,
`has_options` int(11) DEFAULT NULL,
`new_issue` int(11) DEFAULT NULL,
`volatility20day` float DEFAULT NULL,
`volatility60day` float DEFAULT NULL,
`volatility120day` float DEFAULT NULL,
`avg_size` int(11) DEFAULT NULL,
`bats_price` float DEFAULT NULL,
`open` float DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=latin1;



 Comments   
Comment by Martin Adamec [ 2018-11-05 ]

Unfortunately I cannot upload any of these files as they are (even zipped) exceeding the limit for upload.
Please let me know the best way to get those files to you so you can check.

Comment by Andrew Hutchings (Inactive) [ 2018-11-05 ]

Can you please upload them to our write-only FTP server? https://mariadb.com/kb/en/meta/mariadb-ftp-server/

Comment by David Hall (Inactive) [ 2018-11-05 ]

A possible place to look is:
/usr/local/mariadb/columnstore/data/bulk/log
You may find .err logs here showing which rows are in error. It depends on the name of the job files if any got overwritten by subsequent jobs. You may have to do a little hunting to match things up.

Comment by Martin Adamec [ 2018-11-05 ]

Andrew,
Thank you for a hint. I uploaded files to FTP server as anonymous to uploads directory. I added MCOL-1858 identifier to the name of the each file so you can easily identify them.

Comment by Martin Adamec [ 2018-11-05 ]

David,
nothing is being written into that directory. At first we did not even have it on our `worker` host where we are loading data from. We grabbed an installation of mcsimport tool and simply placed it on the non-database host that we use as a processing back-end heavy lifting host running semi-manual scripts and all sorts of scheduled jobs. One of those jobs is pulling (collecting) data from external resources and loading them into our database.

We did not find much documentation about mcsimport. I previously used cpimport and I found it to be really good. The only limitation was that it has to run on UM or PM depending on the mode. Once I found about mcsimport and the possibility of running it from non-database (external) host as a remote-loader I was eager to use it even we did not know much about how to set it up.

Please give us more color on current state of the tool and if we can use it. Especially about the nature of those invalid counts that we saw before.

Thank you

Martin

Comment by David Hall (Inactive) [ 2018-11-05 ]

Sorry. Andrew pointed out to me that you're using the remote mcsimport tool, not cpimport. I was confused. The logs I mentioned won't be generated, as it's a completely different tool than I thought you were using.

Comment by Jens Röwekamp (Inactive) [ 2018-11-07 ]

Hello Martin and thanks for providing us with the sample csv files. They were really helpful.

Based on those, I added three new command line parameter to mcsimport.

  • -err_log can be used to log saturated, truncated and invalid injection values in an extra file
  • -n 1 can be used to treat csv strings of NULL as NULL values. By default they are treated as data (which caused most of the experienced invalid counts)
  • -ignore_malformed_csv can be used to ignore malformed csv rows and continue the ingestion. By default mcsimport rolls back the entire bulk transaction if an malformed csv row is found.

If -err_log and -ignore_malformed_csv are used together all the ignored entries of the csv file will be added to the error log as well.

mcsimports 1.2.0 documentation can be found in our knowledge base.
It is a new tool and currently in Alpha release but intended to be GA with the GA version of ColumnStore 1.2 (most certainly in December 2018)
This changes will most certainly make it in mcsimport 1.2.1 which is scheduled to be released in the upcoming week.

The development usage documentation can be found in the regarding GitHub repository.


For QA:

  • I added three test cases for the null value option and the ignored malformed csv option to our regression test suite
  • Check if more tests are needed and add them to the regression suite
  • Compile mcsimport on Windows, CentOS 7, and one Ubuntu/Debian system
  • Run the regression test suite on all platforms
Comment by Zdravelina Sokolovska (Inactive) [ 2018-11-13 ]

the NULL insertion is done correctely from a CSV file
— case 1: field w/o value in the CSV file
— case 2: 'NULL' as text in the CSV file

[root@cps mcsimport]# cat types_null.csv
NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL
|||||||||||||||||||||||
 
 
[root@cps mcsimport]#  ./mcsimport a  ff10 types_null.csv  -d '|' -n 1
Execution time: 0.45288s
Rows inserted: 2
Truncation count: 0
Saturated count: 0
Invalid count: 0
 
MariaDB [(none)]>  select * from a.ff10 ;
+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| a    | b    | c    | d    | e    | f    | g    | h    | i    | j    | k    | l    | m    | n    | p    | q    | r    | s    | t    | u    | v    | w    | z    | z1   |
+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
2 rows in set, 1 warning (0.033 sec)
 
 
 
MariaDB [(none)]> show create table a.ff10 ;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ff10  | CREATE TABLE `ff10` (
  `a` tinyint(1) DEFAULT NULL,
  `b` tinyint(4) DEFAULT NULL,
  `c` smallint(6) DEFAULT NULL,
  `d` int(11) DEFAULT NULL,
  `e` int(11) DEFAULT NULL,
  `f` bigint(20) DEFAULT NULL,
  `g` decimal(10,0) DEFAULT NULL,
  `h` decimal(10,0) DEFAULT NULL,
  `i` float DEFAULT NULL,
  `j` double DEFAULT NULL,
  `k` double DEFAULT NULL,
  `l` char(1) DEFAULT NULL,
  `m` varchar(700) DEFAULT NULL,
  `n` tinytext DEFAULT NULL,
  `p` tinyblob DEFAULT NULL,
  `q` text DEFAULT NULL,
  `r` blob DEFAULT NULL,
  `s` mediumtext DEFAULT NULL,
  `t` mediumblob DEFAULT NULL,
  `u` longtext DEFAULT NULL,
  `v` longblob DEFAULT NULL,
  `w` date DEFAULT NULL,
  `z` datetime DEFAULT NULL,
  `z1` time DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)
 
 
 
 
 
[root@cps mcsimport]# cat nulls.csv
||NULL
NULL||NULL
NULL|NULL|NULL
|NULL|
[root@cps mcsimport]# ./mcsimport a ff nulls.csv  -d '|'
Execution time: 0.297833s
Rows inserted: 4
Truncation count: 0
Saturated count: 0
Invalid count: 0
[root@cps mcsimport]# mcsmysql -u root -p1
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 48602
Server version: 10.3.10-MariaDB-log Columnstore 1.2.1-1
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> select * from a.ff ;
+------+------+------+
| a    | b    | c    |
+------+------+------+
| NULL | NULL | NULL |
| NULL | NULL | NULL |
| NULL | NULL | NULL |
| NULL | NULL | NULL |
| NULL | NULL | NULL |
+------+------+------+
5 rows in set, 1 warning (0.280 sec)

malformed files could be loaded with mcsimport -ignore_malformed_csv and -err_log options
below tested with mcsimport and the default delimiter ',' and file including mismatched filed separators

cat nulls.csv
,,
||NULL
NULL||NULL
NULL|NULL|NULL
|NULL|
,,
 
 
[root@cps mcsimport]# ./mcsimport a ff nulls.csv  -ignore_malformed_csv -err_log
Execution time: 0.836845s
Rows inserted: 2
Truncation count: 0
Saturated count: 0
Invalid count: 0
Ignored malformed csv count: 4
[root@cps mcsimport]# mcsmysql -u root -p1
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 48696
Server version: 10.3.10-MariaDB-log Columnstore 1.2.1-1
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> select * from a.ff ;
+------+------+------+
| a    | b    | c    |
+------+------+------+
| NULL | NULL | NULL |
| NULL | NULL | NULL |
+------+------+------+
2 rows in set, 1 warning (0.298 sec)
 
 
[root@cps mcsimport]# cat nulls.csv.1542122024432.err
error_type, column_nr, parsed_raw_row_values
MALFORMED_CSV_LINE, -1, ||NULL
MALFORMED_CSV_LINE, -1, NULL||NULL
MALFORMED_CSV_LINE, -1, NULL|NULL|NULL
MALFORMED_CSV_LINE, -1, |NULL|
 

Generated at Thu Feb 08 02:31:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.