[MCOL-501] cpimport string with \ and \\ in the end Created: 2017-01-09  Updated: 2017-06-03  Resolved: 2017-06-03

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport
Affects Version/s: None
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Igor Kostyukovski Assignee: Daniel Lee (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: HTML File test    

 Description   

/usr/local/mariadb/columnstore/bin/cpimport r8 test /var/spin_sync/test1 -E '|' -s '\t' -C '\' -e 4000

we can't load this file without errors.
in case of changing escaped character it is not possible to define escape character that file doesn't contain at the end of the URL field.



 Comments   
Comment by Igor Kostyukovski [ 2017-01-09 ]

i'm sorry
this issue about columnstore engine

Comment by Daniel Lee (Inactive) [ 2017-04-21 ]

In line 4 of the data, there is "php|". If the "|" is part of the data, then line 4 does not have enough data values for the row. If that "|" is meant to be a column separator, then the extra \ needs to be removed. If the "\" meant to be part of the data, then need to use "
", as it is done in line 5.

Comment by David Thompson (Inactive) [ 2017-05-08 ]

Hi Igor, can you confirm if the advice from Daniel addresses the issue or not?

Comment by Igor Kostyukovski [ 2017-05-09 ]

but we load a millions(~200m) urls each minute to database, file just an example, that maria cpimport has a bug.
for example infobright loader load this file without errors.

i misunderstood suggestion from Daniel, i see only one two ways for now
1) sed file by regexp and replace - take near 40s for one file
2) ignore all lines with errors

both of this ways are wrong, as i say problem exists only on columstore engine

Comment by Andrew Hutchings (Inactive) [ 2017-05-09 ]

Hi Igor,

What Daniel is basically saying is the options you supplied do not fit the file format use supplied as the example. You have specified a backslash as the escape character which means you are escaping the enclosed character in line 4 of your example. This will cause an error. If the intention was to have a backslash in the data then this needs to be escaped by having a double-backslash, as is standard in CSV style file formats.

I'm not an Infobright user but if you are specifying the same parameters to it I can only imagine it is working around this by pattern matching the enclosed characters first before processing escape characters. This has its own set of problems, for example an unescaped enclosed character or two in the data can cause the pattern matching to pick the wrong data for a column.

Unfortunately the problem is either in the data or the options supplied, not in ColumnStore in this instance. ColumnStore is behaving as intended / documented. To code a workaround to this specific use case would incur a performance hit for all imports and would still have other potential related issues.

What you could do is pipe the data through sed and then through cpimport. Some of the performance hit of sed you observed would be gone because the extra file write step would not be needed.

Comment by Igor Kostyukovski [ 2017-05-09 ]

Thanks for your answers!

May be you can give advise how we can format csv to load this data to Columnstore without errors? For now we skip ~4% of collected urls with this problem. This ticket was created a long time ago, and we spent much time trying to load data and to be honest i don't remember details.
For our DPI solution 4% is a lot. This is only one reason why we cant use maria columnstore on production systems.
And it seems like i created this file by myself, but this is just a part of collected data

Comment by Andrew Hutchings (Inactive) [ 2017-05-09 ]

Hi Igor,

No problem. If your intention is to enclose with pipes and escape with backslashes then with the paths in column 8 if you could filter them so backslashes in the paths are themselves escaped with backslashes then this will load fine. Your paths look like URL paths, if they are you could consider URL encoding them under RFC 3986, then backslashes in the paths become "%5C". Alternatively if you don't ever have any non-printable ASCII characters in the paths you could use these as escape and enclose characters.

Comment by David Thompson (Inactive) [ 2017-06-03 ]

System is working as designed.

Generated at Thu Feb 08 02:21:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.