[MCOL-3514] Make cpimport read from data in S3 buckets Created: 2019-09-24  Updated: 2020-02-11  Resolved: 2020-02-11

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: None
Fix Version/s: 1.4.0

Type: New Feature Priority: Major
Reporter: Andrew Hutchings (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MCOL-3520 Fix cpimport S3 multi-PM usage Sub-Task Closed Daniel Lee  
MCOL-3861 CLONE - Fix cpimport S3 multi-PM usage Sub-Task Closed Daniel Lee  
Sprint: 2019-06, 2020-1, 2020-2

 Description   

cpimport needs new options to allow it to read a source file from an Amazon S3 bucket.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2019-09-24 ]

Implementation details...

New options for cpimport:

        -y      S3 Authentication Key (for S3 imports)
        -K      S3 Authentication Secret (for S3 imports)
        -t      S3 Bucket (for S3 imports)
        -H      S3 Hostname (for S3 imports, Amazon's S3 default)
        -g      S3 Region (for S3 imports)

The hostname only needs to be supplied if the S3 server is not Amazon's.

It will then use the path/filename to retrieve the file from the S3 bucket into memory and apply it. You will need enough RAM spare to take the entire CSV file.

Comment by Daniel Lee (Inactive) [ 2019-09-27 ]

Build tested: 1.4.0-1

[dlee@master centos7]$ cat gitversionInfo.txt
engine commit:
1f47534

Running test on multi-node (1um2pm) returned an error

/usr/local/mariadb/columnstore/bin/cpimport mytest lineitem lineitem.tbl -y [mykey] -K [mysecret] -t dleeqatest -g us-west-2

2019-09-27 18:25:08 (9124) ERR : Could not open Input file lineitem.tbl

It worked in single node stack:

/usr/local/mariadb/columnstore/bin/cpimport mytest lineitem lineitem.tbl -y [mykey] -K [mysecret] -t dleeqatest -g us-west-2
Locale is : C

Using table OID 3017 as the default JOB ID
Input file will be read from S3 Bucket : dleeqatest, file/object : /usr/local/mariadb/columnstore/data/bulk/tmpjob/3017_D20190927_T185039_S235758_Job_3017.xml
Job description file : /usr/local/mariadb/columnstore/data/bulk/tmpjob/3017_D20190927_T185039_S235758_Job_3017.xml
Log file for this job: /usr/local/mariadb/columnstore/data/bulk/log/Job_3017.log
2019-09-27 18:50:39 (16701) INFO : successfully loaded job file /usr/local/mariadb/columnstore/data/bulk/tmpjob/3017_D20190927_T185039_S235758_Job_3017.xml
2019-09-27 18:50:39 (16701) INFO : Job file loaded, run time for this step : 0.21343 seconds
2019-09-27 18:50:39 (16701) INFO : PreProcessing check starts
2019-09-27 18:50:55 (16701) INFO : PreProcessing check completed
2019-09-27 18:50:55 (16701) INFO : preProcess completed, run time for this step : 15.8133 seconds
2019-09-27 18:50:55 (16701) INFO : No of Read Threads Spawned = 1
2019-09-27 18:50:55 (16701) INFO : No of Parse Threads Spawned = 3
2019-09-27 18:50:56 (16701) INFO : For table mytest.lineitem: 6005 rows processed and 6005 rows inserted.
2019-09-27 18:50:57 (16701) INFO : Bulk load completed, total run time : 18.0366 seconds

[root@localhost ~]# mcsmysql mytest
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 13
Server version: 10.4.8-3-MariaDB-log Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [mytest]> select count from lineitem;
----------

count

----------

6005

----------
1 row in set (0.119 sec)

Comment by Daniel Lee (Inactive) [ 2020-02-11 ]

Verified sub-tasks. Closing this ticket now.

Generated at Thu Feb 08 02:43:17 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.