[MCOL-4576] cpimport from S3 Slower When Using Flags/Parameters Created: 2021-03-03 Updated: 2022-06-27 Resolved: 2022-04-04 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 5.5.2 |
| Fix Version/s: | 6.3.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Todd Stoffel (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Sprint: | 2021-4, 2021-5, 2021-6, 2021-7, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17 |
| Description |
|
This first example uses the command line flags which can be found by running cpimport -h: Method 1 [root@ip-172-31-43-62 cpimport]# cpimport -d3 -e 1 -H s3.us-west-2.amazonaws.com -y FAKEAUTHKEY -K FAKEAUTHSECRET -t sample-columnstore-data -g us-west-2 bts flights all.csv -s ',' -E '"'
which takes 153.233 seconds. The next example uses the method described here: https://mariadb.com/kb/en/columnstore-bulk-data-loading/#bulk-loading-from-aws-s3 Method 2 [root@ip-172-31-43-62 cpimport]# aws s3 cp --quiet s3://sample-columnstore-data/all.csv - | cpimport -d 3 -e 1 bts flights -s ',' -E '"'
which takes 41.367 seconds. We can see that Method 1 is almost 4x slower. |
| Comments |
| Comment by Ben Thompson (Inactive) [ 2022-03-11 ] | |||||||||||||||||||||||||
|
As option 2 method of using aws-cli to pipe file into cpimport will always be better on larger object files because of how the libmarias3 library is written, the options to load S3 objects as input files via command line arguments has been removed from cpimport. Possible to bring this back in the future with some major rewrites but seems unnecessary work given alternative options. The second reason to remove this is that the current implementation will crash cpimport with OOM if an S3 object file larger than system memory is attempted for input data. Simple solution now is to use alternative tooling for downloading S3 objects as input for cpimport. Removing option for configurations that could potentially lead to crash of cpimport. | |||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2022-03-21 ] | |||||||||||||||||||||||||
|
QA: After this patch, method 1 above will no longer be accepted. | |||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-03-24 ] | |||||||||||||||||||||||||
|
Build tested: 6.3.1-1 (#4139) Verified that cpimport's S3 option has been removed and help text has been updated. There is an issues with "load data infile" after the change When the columnstore_use_import_for_batchinsert variable is set to ON, "load data infile" uses cpimport to perform batch insert for better performance. This method of data loading now returns an error. Turning columnstore_use_import_for_batchinsert OFF helps to avoid the issue. By default, columnstore_use_import_for_batchinsert is set to ON. This error also occurs when S3 is not used.
| |||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-31 ] | |||||||||||||||||||||||||
|
ben.thompson David.Hall please update the ticket
| |||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-04-04 ] | |||||||||||||||||||||||||
|
Build verified: 6.3.1-1 (#4234) Verified that | |||||||||||||||||||||||||
| Comment by Leonid Fedorov [ 2022-04-05 ] | |||||||||||||||||||||||||
|
Hi guys. For dataload I need cpimport S3 supports so can we just delete documentation for S3 flags, but keep their support in cpimport? | |||||||||||||||||||||||||
| Comment by Leonid Fedorov [ 2022-04-05 ] | |||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-04-05 ] | |||||||||||||||||||||||||
|
I think leonid.fedorov agreed to uncomment the feature for his own private testing . so we should be finished with this ticket and NOT require any new ones | |||||||||||||||||||||||||
| Comment by Leonid Fedorov [ 2022-04-05 ] | |||||||||||||||||||||||||
|
That's correct. I'll do my testing features in my branch for cpimport, no needs for any extra issues or anybody work. |