[MCOL-5013] Support Load data from AWS S3 : UDF : columnstore_info.load_from_s3 Created: 2022-03-09  Updated: 2023-07-21  Resolved: 2022-11-21

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 6.2.3
Fix Version/s: 23.02.1

Type: New Feature Priority: Major
Reporter: Todd Stoffel (Inactive) Assignee: Leonid Fedorov
Resolution: Fixed Votes: 0
Labels: None

Attachments: File create_table.sql     PNG File hangs-though-debug.log-shows-failed.png     File test_data.csv    
Issue Links:
Blocks
is blocked by MDEV-28395 LOAD DATA transfer plugins Open
Duplicate
duplicates MCOL-2038 mcsimport load time is significantly... Closed
duplicates MCOL-2040 mcsimport load is executed with wors... Closed
duplicates MCOL-2080 mcsimport hangs towards mcs system wi... Closed
duplicates MCOL-2226 Improve performance of mcsimport - ma... Closed
is duplicated by MCOL-5032 cpimport via UDAF(user defined functi... Closed
PartOf
includes MCOL-5271 Google cloud data load Closed
Relates
relates to MCOL-3928 Add Option To Read From a Lakehouse (... Open
relates to MCOL-5419 load from S3 | when columns dont mat... Open
relates to MCOL-5420 load from s3 | inserted count incorrect Open
relates to MCOL-5506 columnstore_info.load_from_s3 returns... Closed
relates to MXS-4618 Load data from S3 Closed
relates to MCOL-4139 Replace calpontsys schema name with n... Closed
relates to MCOL-5506 columnstore_info.load_from_s3 returns... Closed
relates to MCOL-5509 columnstore_info.load_from_s3 | Misle... Closed
relates to MCOL-5510 columnstore_info.load_from_s3 | Conne... Closed
relates to MXS-4653 Integrate LDI filter with ColumnStore... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MCOL-5155 CMAPI : support load data from s3 Sub-Task Closed Alan Mologorsky  
Sprint: 2021-17, 2022-22
Assigned for Review: Alan Mologorsky Alan Mologorsky
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

Columnstore should support a plain SQL syntax for loading data directly from s3 buckets.

Current implementation for 2208 is

usage:

AWS

set columnstore_s3_key='<s3_key>';
set columnstore_s3_secret='<s3_secret>';
set columnstore_s3_region='region';
 
CALL columnstore_info.load_from_s3("<bucket>", "<file_name>", "<db_name>", "<table_name>", "<terminated_by>",  "<enclosed_by>", "<escaped_by>");
 
EXAMPLE
 
CALL columnstore_info.load_from_s3("s3://columnstore-test", "data1.csv", "d1", "t1", ",", "", "" )

or for Google Storage

set columnstore_s3_key='GOOGXXXxxxx';
set columnstore_s3_secret='XXXXXXXXXX;
CALL columnstore_info.load_from_s3("gs://columnstore-test", "test.tbl", "test", "gg", "|", "", "");

last three params are the same as cpimport -s , -E , -C

 terminated_by:  

is the delimiter between column values. mandatory

 enclosed_by:

Enclosed by character if field values are enclosed. optional . can be "" empty string

 escaped_by: 

Escape character used in conjunction with 'enclosed by'.optional . can be "" empty string
character, or as part of NULL escape sequence ('\N');
default is '\'
enclosed_by and escaped_by can be set blank to use defaults

EXAMPLE

CALL columnstore_info.load_from_s3("s3://avorovich2", "data1.csv", "d1", "t1", ",", "", "" )

--------
future options maybe (not implemented in 2208)

mariadb> LOAD DATA S3 FROM 's3://blah.blah/mydata.dat' into table xyz FIELDS TERMINATED BY '|';

This command should invoke the backend columnstore api and an s3 client to stream data directly from a remote source into cpimport.

This gives us a rapid data ingestion technique that is compatible with SkySQL and provides UAC (Unlike our previous utility - mcsimport).



 Comments   
Comment by Todd Stoffel (Inactive) [ 2022-03-09 ]

Assigning to David.Hall for triage.

Comment by David Hall (Inactive) [ 2022-03-09 ]

Currently, Load Data Infile is implemented by the server. It opens the file, converts the rows to a binary format and streams that binary to the engine who then stores it according to their own rules.
To implement something like this would require coordination with the server team. It could be done two ways. I highly favor the first because it's universal and will work with all engines.

1) MDB would need to learn about s3 in order to open the file and do all the s3 streaming.
2) MDB would have to be taught the new syntax (in the parser) and know to pass the thing to us in some fashion.

In the first case, all the effort is on the Server team. For engines, it would just work.
In the second case, Columnstore team would do the lion's share but some Server work is needed to parse and pass.

Comment by alexey vorovich (Inactive) [ 2022-04-01 ]

David.Hall thnks

1. Does the server read the large CVS file in memory and then calls us with a full buffer or they pass a stream descriptor that we use ?
2 who in server group do we need to discuss the change (any change) in this area

Comment by alexey vorovich (Inactive) [ 2022-04-01 ]

and 3.

lets say we adopt your method 2

will we have to implement all the parsing options of LDI

Comment by Leonid Fedorov [ 2022-07-08 ]

I think it's about drone build setting, but don't worry, feature is merged into current branch
We can use either this build, or any latest current build

Comment by Daniel Lee (Inactive) [ 2022-07-22 ]

Build tested: Engine 22.08-1 (#5040), cmapi (#694)

Verified the following

1. loaded 1gb lineitem from aws s3
2. compared all columnstore data to lineitem1, which was imported locally.
3. loaded 10gb lineitem from aws s3

Currently having issues to update large datasets to aws S3
I will continue test when the issue is resolved.

Negative tests:

Error messages should be more informative

*** No credential setup
 
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "1g/lineitem.tbl", "dleeqadata", "mytest");
+-----------------------------------------------------------+
| columnstore_dataload(tablename, filename, bucket, dbname) |
+-----------------------------------------------------------+
| {"error": "key not provided"}                             |
+-----------------------------------------------------------+
1 row in set (0.025 sec)
 
Query OK, 0 rows affected (0.026 sec)
 
 
*** Invalid access and secret keys
 
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "lineitem.tbl", "dleeqadata", "mytest");
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| columnstore_dataload(tablename, filename, bucket, dbname)                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"error": {"success": false, "error": "2022-07-22 19:58:09 (1201) ERR  : Error retrieving file lineitem.tbl from S3: The AWS Access Key Id you provided does not exist in our records. [1052]"}} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (2.204 sec)
 
 
*** Correct credential, but wrong S3 data bucket
 
MariaDB [mytest]>  call calpontsys.columnstore_load_from_s3("lineitem", "lineitem.tbl", "helloworld", "mytest");
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| columnstore_dataload(tablename, filename, bucket, dbname)                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"error": {"success": false, "error": "2022-07-22 20:00:16 (1272) ERR  : Error retrieving file lineitem.tbl from S3: The authorization header is malformed; the region 'us-west-2' is wrong; expecting 'ap-northeast-1' [1052]"}} |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.192 sec)

Comment by Daniel Lee (Inactive) [ 2022-07-27 ]

Close per last test results.

Comment by Daniel Lee (Inactive) [ 2022-08-02 ]

Successfully loaded 75GB of lineitem dataset from an AWS S3 bucket. It took almost 4.5 hours.

MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "1g/lineitem.tbl", "dleeqadata", "mytest");
+------------------------------------------------------------------+
| columnstore_dataload(tablename, filename, bucket, dbname)        |
+------------------------------------------------------------------+
| {"success": true, "inserted": "6001215", "processed": "6001215"} |
+------------------------------------------------------------------+
1 row in set (40.242 sec)
 
Query OK, 0 rows affected (40.242 sec)
 
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "dbt3/100g/lineitem.tbl", "mcsmtrdata", "mytest");
 
+----------------------------------------------------------------------+
| columnstore_dataload(tablename, filename, bucket, dbname)            |
+----------------------------------------------------------------------+
| {"success": true, "inserted": "600037902", "processed": "600037902"} |
+----------------------------------------------------------------------+
1 row in set (4 hours 26 min 25.025 sec)
 
Query OK, 0 rows affected (4 hours 26 min 25.027 sec)
 
MariaDB [mytest]> 
MariaDB [mytest]> select count(*) from lineitem;
+-----------+
| count(*)  |
+-----------+
| 606039117 |
+-----------+
1 row in set (29.650 sec)

Comment by alexey vorovich (Inactive) [ 2022-08-16 ]

leonid.fedorov, i suspect this (1 and 2) is a small change. could u do this tmrw.
Testing can be brief as well

Comment by Leonid Fedorov [ 2022-08-18 ]

I've changed the order of params, function name and schema name.
Session variables are hard to change, because columnstore_ prefix is added by
server to variables declared by plugin

so syntax is

CALL columnstore_info.load_from_s3("<bucket>", "<file_name>", "<db_name>", "<table_name>");

but settings are

set columnstore_s3_key='<s3_key>';
set columnstore_s3_secret='<s3_secret>';
set columnstore_s3_region='region';??

Comment by Leonid Fedorov [ 2022-08-18 ]

Changes can be tested within build 5319

Comment by Daniel Lee (Inactive) [ 2022-08-18 ]

Build 22.08-1 (#5319, #5321)

The change is not in these builds. I noticed that the change was merged about 1 hour go. I will wait for the next PR build or the cron build tomorrow.

Comment by alexey vorovich (Inactive) [ 2022-08-19 ]

there was a discussion last night about performance switches for cpimport https://mariadb.slack.com/archives/C03UMC283ND/p1660852966907919

leonid.fedorov please post the progress

Comment by Todd Stoffel (Inactive) [ 2022-10-04 ]

The goal for our load data from s3 project is to provide the same options as here:
https://mariadb.com/kb/en/load-data-infile/
terminated by, enclosed by, escaped by
in cpimport that would be -s , -E , -C

Comment by Leonid Fedorov [ 2022-10-19 ]

7 parameters are implemented, and using of AWS CLI to download s3 data as well
To test PR use
cmapi https://cspkg.s3.amazonaws.com/index.html?prefix=cmapi/develop/pull_request/757/amd64/
columnstore https://cspkg.s3.amazonaws.com/index.html?prefix=develop/pull_request/5742/10.6-enterprise/amd64/

example of usage

create table tt (a int, b text) engine columnstore;
 
set columnstore_s3_region='us-east-1';
set columnstore_s3_key='correct key;
set columnstore_s3_secret='correct secret;
CALL columnstore_info.load_from_s3("s3://mdb-s3-test", "test.cvs", "test", "tt", "|", "", "");

Comment by Leonid Fedorov [ 2022-10-19 ]

function is defined as

CREATE OR REPLACE PROCEDURE load_from_s3 (in bucket varchar(256) CHARACTER SET utf8,
                                          in filename varchar(256) CHARACTER SET utf8,
                                          in dbname varchar(256) CHARACTER SET utf8,
                                          in table_name varchar(256) CHARACTER SET utf8,
                                          in terminated_by varchar(256) CHARACTER SET utf8,
                                          in enclosed_by varchar(1) CHARACTER SET utf8,
                                          in escaped_by varchar(1) CHARACTER SET utf8
                                          )

Comment by Leonid Fedorov [ 2022-10-20 ]

Yes, defaults can be set with empty string. Functions don't support default params, so this is workaround

Comment by Leonid Fedorov [ 2022-10-20 ]

Column separatornhas nondefault value, just set it 5th parameter

Comment by Daniel Lee (Inactive) [ 2022-10-21 ]

Build tested: See drone link provided above

Verified terminated_by, enclosed_by, escaped_by characters. terminated_by character is required and the last two are optional

I have tried loading a 1gb lineitem table, data size is about 750mb and the job never finish over night. I have tried files with "|" or "," as terminated_by character.

Comment by alexey vorovich (Inactive) [ 2022-10-21 ]

I tied a smaller file 10k , one column . seems ok . will try bigger files later

2022-10-21 17:59:46.715517 connecting 0 localhost 3307 pgmabv99 Lena8484!
2022-10-21 17:59:46.718683 Connected ok ihost= 0
2022-10-21 17:59:46.718809 i_host 0 : SHOW STATUS LIKE 'columnstore%';
2022-10-21 17:59:46.719682 {'Variable_name': 'Columnstore_commit_hash', 'Value': '57df5e9-dirty'}
2022-10-21 17:59:46.719794 {'Variable_name': 'Columnstore_version', 'Value': '22.08.2'}
2022-10-21 17:59:46.719872 i_host 0 : drop  database  IF EXISTS d1
2022-10-21 17:59:47.010703 i_host 0 : create database d1
2022-10-21 17:59:47.011583 i_host 0 : use d1
2022-10-21 17:59:47.012231 i_host 0 : drop table IF EXISTS t1
2022-10-21 17:59:47.052555 i_host 0 : create table   t1 (f1 integer) engine=columnstore
2022-10-21 17:59:47.383463 i_host 0 : set columnstore_s3_key="xxxxxxH3XSA7W";
2022-10-21 17:59:47.384267 i_host 0 : set columnstore_s3_secret="xxxxxxxxxxxxxxxX+mJQLbg9rup";
2022-10-21 17:59:47.384953 i_host 0 : set columnstore_s3_region='us-east-2';
2022-10-21 17:59:47.385480 i_host 0 : CALL columnstore_info.load_from_s3("s3://avorovich2", "data1.csv", "d1", "t1", ",", "", "" )
2022-10-21 17:59:49.678133 {'columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by)': b'{"success": true, "inserted": "10000", "processed": "10000"}'}
2022-10-21 17:59:49.680068 i_host 0 : SELECT COUNT(*)  FROM t1
2022-10-21 17:59:49.730002 {'COUNT(*)': 10000}
2022-10-21 17:59:49.730170 i_host 0 : select * from information_schema.columnstore_tables
2022-10-21 17:59:49.746288 {'TABLE_SCHEMA': 'd1', 'TABLE_NAME': 't1', 'OBJECT_ID': 3015, 'CREATION_DATE': datetime.datetime(2022, 10, 21, 0, 0), 'COLUMN_COUNT': 1, 'AUTOINCREMENT': None}
2022-10-21 17:59:49.746433 i_host 0 : SELECT dbroot,min_value,max_value , high_water_mark, status  FROM information_schema.columnstore_extents order by status
2022-10-21 17:59:49.753978 {'dbroot': 2, 'min_value': Decimal('1'), 'max_value': Decimal('1'), 'high_water_mark': 2, 'status': 'Available'}
2022-10-21 17:59:49.754245 {'dbroot': 2, 'min_value': Decimal('0'), 'max_value': Decimal('9999'), 'high_water_mark': 8, 'status': 'Available'}
2022-10-21 17:59:49.754357 saving logs ++++++++++++++++++++++++++++++++
2022-10-21 17:59:49.761419 
2022-10-21 17:59:49.763677 
2022-10-21 17:59:50.275734 tar -cvf /root/tmp/docker-logs-dbms.tar /root/tmp/docker_logs_dbms

Comment by Daniel Lee (Inactive) [ 2022-10-25 ]

Retested the build and at first, I still had issues loading 1 mb and 1 gb lineitem datasets, with 6005 and 6001215 rows respectively. The commands hung for over 10 minutes without any terminal output and I could not find any log files. After few tries, things just started to work. I was even able to load a 10 gb lineitem dataset. I am not sure what the hanging issue was and I have not been able to reproduce it.

Successful tests:

MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1m/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
+----------------------------------------------------------------------------------------------------+
| columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
+----------------------------------------------------------------------------------------------------+
| {"success": true, "inserted": "6005", "processed": "6005"}                                         |
+----------------------------------------------------------------------------------------------------+
1 row in set (2.214 sec)
 
Query OK, 0 rows affected (2.214 sec)
 
MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1g/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
+----------------------------------------------------------------------------------------------------+
| columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
+----------------------------------------------------------------------------------------------------+
| {"success": true, "inserted": "6001215", "processed": "6001215"}                                   |
+----------------------------------------------------------------------------------------------------+
1 row in set (16.243 sec)
 
 
MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "10g/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
+----------------------------------------------------------------------------------------------------+
| columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
+----------------------------------------------------------------------------------------------------+
| {"success": true, "inserted": "59986052", "processed": "59986052"}                                 |
+----------------------------------------------------------------------------------------------------+
1 row in set (2 min 25.325 sec)

Comment by Daniel Lee (Inactive) [ 2022-11-21 ]

Build verified:

engine: 84bb4e56b81c2f1e99151f198bf939612b09004e
server: e3ed2f0ab1e287218ae72ecf77d95cc745795353
buildNo: 6088

Comment by Todd Stoffel (Inactive) [ 2023-01-24 ]

allen.herrera This feature is only for SkySQL. We'll document it when it goes GA in March.

Generated at Thu Feb 08 02:54:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.