[MCOL-4576] cpimport from S3 Slower When Using Flags/Parameters Created: 2021-03-03  Updated: 2022-06-27  Resolved: 2022-04-04

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 5.5.2
Fix Version/s: 6.3.1

Type: Bug Priority: Minor
Reporter: Todd Stoffel (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2021-4, 2021-5, 2021-6, 2021-7, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17

 Description   

This first example uses the command line flags which can be found by running cpimport -h:

Method 1

[root@ip-172-31-43-62 cpimport]# cpimport -d3 -e 1 -H s3.us-west-2.amazonaws.com -y FAKEAUTHKEY -K FAKEAUTHSECRET -t sample-columnstore-data -g us-west-2 bts flights all.csv -s ',' -E '"'

Debug level set to 3
Column delimiter : ,
ModuleType pm
InstallType 2
PM Count 1
CPI CmdLineArgs : /usr/bin/cpimport.bin -L /var/log/mariadb/columnstore/cpimport/ -e 1 -d 3 -s , -E " -P pm1-17231 -T SYSTEM -y AKIA4TJNISLRYEQX6IXL -K QJHhQ9/dHAav6UvFBEQ5QhyGgSqtD4fbyC5kL5YG -t sample-columnstore-data -g us-west-2 -H s3.us-west-2.amazonaws.com -u38ea2fef-adc0-4d82-b567-88dfadb788e4 bts flights all.csv
Locale = en_US.UTF-8
Debug level is set to 3
Column delimiter : ,
Enclosed by Character : "
Using table OID 3016 as the default JOB ID
Input file will be read from S3 Bucket : sample-columnstore-data, file/object : /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T175823_S757723_Job_3016.xml
Job description file : /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T175823_S757723_Job_3016.xml
Log file for this job: /var/log/mariadb/columnstore/cpimport/Job_3016.log
2021-03-03 17:58:23 (17231:140074827606144) INFO : successfully loaded job file /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T175823_S757723_Job_3016.xml
2021-03-03 17:58:23 (17231:140074827606144) INFO : Job file loaded, run time for this step : 0.0347769 seconds
2021-03-03 17:58:23 (17231:140074827606144) INFO : PreProcessing check starts
2021-03-03 17:58:23 (17231:140074827606144) INFO : PreProcessing check completed
2021-03-03 17:58:23 (17231:140074827606144) INFO : preProcess completed, run time for this step : 0.180334 seconds
2021-03-03 17:58:23 (17231:140074827606144) INFO : No of Read Threads Spawned = 1
2021-03-03 17:58:23 (17231:140074827606144) INFO : No of Parse Threads Spawned = 3
2021-03-03 18:00:56 (17231:140073043490560) INFO : Dictionary cache flush: 483328, 483328, 491520, 491520, 499712, 499894, 475136, 481816, 507904, 508716, 516096, 516939, 524288, 525293, 532480, 533172, 540672, 541684,
2021-03-03 18:00:56 (17231:140073043490560) INFO : For table bts.flights: 38083735 rows processed and 38083735 rows inserted.
2021-03-03 18:00:56 (17231:140074827606144) INFO : Bulk load completed, total run time : 153.233 seconds

which takes 153.233 seconds.

The next example uses the method described here: https://mariadb.com/kb/en/columnstore-bulk-data-loading/#bulk-loading-from-aws-s3

Method 2

[root@ip-172-31-43-62 cpimport]# aws s3 cp --quiet s3://sample-columnstore-data/all.csv - | cpimport -d 3 -e 1 bts flights -s ',' -E '"'

Debug level set to 3
Column delimiter : ,
ModuleType pm
InstallType 2
PM Count 1
CPI CmdLineArgs : /usr/bin/cpimport.bin -L /var/log/mariadb/columnstore/cpimport/ -e 1 -d 3 -s , -E " -P pm1-17281 -T SYSTEM -u9069b047-adf4-432f-87fe-62149c7302c7 bts flights
Locale = en_US.UTF-8
Debug level is set to 3
Column delimiter : ,
Enclosed by Character : "
Using table OID 3016 as the default JOB ID
Input file(s) will be read from : STDIN
Job description file : /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T180157_S709597_Job_3016.xml
Log file for this job: /var/log/mariadb/columnstore/cpimport/Job_3016.log
2021-03-03 18:01:57 (17281:140276622887040) INFO : successfully loaded job file /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T180157_S709597_Job_3016.xml
2021-03-03 18:01:57 (17281:140276622887040) INFO : Job file loaded, run time for this step : 0.0354359 seconds
2021-03-03 18:01:57 (17281:140276622887040) INFO : PreProcessing check starts
2021-03-03 18:01:58 (17281:140276622887040) INFO : PreProcessing check completed
2021-03-03 18:01:58 (17281:140276622887040) INFO : preProcess completed, run time for this step : 0.325933 seconds
2021-03-03 18:01:58 (17281:140276622887040) INFO : No of Read Threads Spawned = 1
2021-03-03 18:01:58 (17281:140275337512704) INFO : Reading input from STDIN to import into table bts.flights...
2021-03-03 18:01:58 (17281:140276622887040) INFO : No of Parse Threads Spawned = 3
2021-03-03 18:02:38 (17281:140275329120000) INFO : Dictionary cache flush: 483328, 699392, 874496, 1098752, 491520, 674816, 882688, 1123328, 500020, 683177, 899282, 1107259, 1275591, 714151, 897700, 1121671, 508919, 692684, 908058, 1091086, 517140, 717157, 916277, 1132055, 525585, 725868, 932899, 1140478, 533465, 732979, 924348, 1075005, 541975, 742210, 941112, 1083145,
2021-03-03 18:02:38 (17281:140275329120000) INFO : For table bts.flights: 38083735 rows processed and 38083735 rows inserted.
2021-03-03 18:02:39 (17281:140276622887040) INFO : Bulk load completed, total run time : 41.367 seconds

which takes 41.367 seconds.

We can see that Method 1 is almost 4x slower.



 Comments   
Comment by Ben Thompson (Inactive) [ 2022-03-11 ]

As option 2 method of using aws-cli to pipe file into cpimport will always be better on larger object files because of how the libmarias3 library is written, the options to load S3 objects as input files via command line arguments has been removed from cpimport. Possible to bring this back in the future with some major rewrites but seems unnecessary work given alternative options. MCOL-5013 has also been proposed for future development.

The second reason to remove this is that the current implementation will crash cpimport with OOM if an S3 object file larger than system memory is attempted for input data. Simple solution now is to use alternative tooling for downloading S3 objects as input for cpimport. Removing option for configurations that could potentially lead to crash of cpimport.

Comment by David Hall (Inactive) [ 2022-03-21 ]

QA: After this patch, method 1 above will no longer be accepted.

Comment by Daniel Lee (Inactive) [ 2022-03-24 ]

Build tested: 6.3.1-1 (#4139)

Verified that cpimport's S3 option has been removed and help text has been updated.

There is an issues with "load data infile" after the change

When the columnstore_use_import_for_batchinsert variable is set to ON, "load data infile" uses cpimport to perform batch insert for better performance. This method of data loading now returns an error. Turning columnstore_use_import_for_batchinsert OFF helps to avoid the issue.

By default, columnstore_use_import_for_batchinsert is set to ON.

This error also occurs when S3 is not used.

MariaDB [mytest]> show variables like "%columnstore_use_import%";
+----------------------------------------+-------+
| Variable_name                          | Value |
+----------------------------------------+-------+
| columnstore_use_import_for_batchinsert | ON    |
+----------------------------------------+-------+
1 row in set (0.001 sec)
 
MariaDB [mytest]> load data infile "/tmp/lineitem.tbl" into table lineitem fields terminated by "|";
ERROR 1030 (HY000): Got error -1 "Internal error < 0 (Not system error)" from storage engine ColumnStore
 
MariaDB [mytest]> set columnstore_use_import_for_batchinsert = OFF;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [mytest]> show variables like "%columnstore_use_import%";
+----------------------------------------+-------+
| Variable_name                          | Value |
+----------------------------------------+-------+
| columnstore_use_import_for_batchinsert | OFF   |
+----------------------------------------+-------+
1 row in set (0.001 sec)
 
MariaDB [mytest]> load data infile "/tmp/lineitem.tbl" into table lineitem fields terminated by "|";
Query OK, 6001215 rows affected (2 min 8.020 sec)    
Records: 6001215  Deleted: 0  Skipped: 0  Warnings: 0

Comment by alexey vorovich (Inactive) [ 2022-03-31 ]

ben.thompson David.Hall please update the ticket

  1. was the initial change by Ben removed ?
  2. So in the current code Does load data infile use cpimport by default which uses libmarias3 libary ?
Comment by Daniel Lee (Inactive) [ 2022-04-04 ]

Build verified: 6.3.1-1 (#4234)

Verified that
-H option is no longer available
cpimport help text has been updated
LDI works correctly

Comment by Leonid Fedorov [ 2022-04-05 ]

Hi guys. For dataload I need cpimport S3 supports so can we just delete documentation for S3 flags, but keep their support in cpimport?

Comment by Leonid Fedorov [ 2022-04-05 ]

alexey.vorovich

Comment by alexey vorovich (Inactive) [ 2022-04-05 ]

I think leonid.fedorov agreed to uncomment the feature for his own private testing . so we should be finished with this ticket and NOT require any new ones

Comment by Leonid Fedorov [ 2022-04-05 ]

That's correct. I'll do my testing features in my branch for cpimport, no needs for any extra issues or anybody work.

Generated at Thu Feb 08 02:51:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.