|
Assigning to David.Hall for triage.
|
|
Currently, Load Data Infile is implemented by the server. It opens the file, converts the rows to a binary format and streams that binary to the engine who then stores it according to their own rules.
To implement something like this would require coordination with the server team. It could be done two ways. I highly favor the first because it's universal and will work with all engines.
1) MDB would need to learn about s3 in order to open the file and do all the s3 streaming.
2) MDB would have to be taught the new syntax (in the parser) and know to pass the thing to us in some fashion.
In the first case, all the effort is on the Server team. For engines, it would just work.
In the second case, Columnstore team would do the lion's share but some Server work is needed to parse and pass.
|
|
David.Hall thnks
1. Does the server read the large CVS file in memory and then calls us with a full buffer or they pass a stream descriptor that we use ?
2 who in server group do we need to discuss the change (any change) in this area
|
|
and 3.
lets say we adopt your method 2
will we have to implement all the parsing options of LDI
|
|
I think it's about drone build setting, but don't worry, feature is merged into current branch
We can use either this build, or any latest current build
|
|
Build tested: Engine 22.08-1 (#5040), cmapi (#694)
Verified the following
1. loaded 1gb lineitem from aws s3
2. compared all columnstore data to lineitem1, which was imported locally.
3. loaded 10gb lineitem from aws s3
Currently having issues to update large datasets to aws S3
I will continue test when the issue is resolved.
Negative tests:
Error messages should be more informative
*** No credential setup
|
|
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "1g/lineitem.tbl", "dleeqadata", "mytest");
|
+-----------------------------------------------------------+
|
| columnstore_dataload(tablename, filename, bucket, dbname) |
|
+-----------------------------------------------------------+
|
| {"error": "key not provided"} |
|
+-----------------------------------------------------------+
|
1 row in set (0.025 sec)
|
|
Query OK, 0 rows affected (0.026 sec)
|
|
|
*** Invalid access and secret keys
|
|
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "lineitem.tbl", "dleeqadata", "mytest");
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| columnstore_dataload(tablename, filename, bucket, dbname) |
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| {"error": {"success": false, "error": "2022-07-22 19:58:09 (1201) ERR : Error retrieving file lineitem.tbl from S3: The AWS Access Key Id you provided does not exist in our records. [1052]"}} |
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
1 row in set (2.204 sec)
|
|
|
*** Correct credential, but wrong S3 data bucket
|
|
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "lineitem.tbl", "helloworld", "mytest");
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| columnstore_dataload(tablename, filename, bucket, dbname) |
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| {"error": {"success": false, "error": "2022-07-22 20:00:16 (1272) ERR : Error retrieving file lineitem.tbl from S3: The authorization header is malformed; the region 'us-west-2' is wrong; expecting 'ap-northeast-1' [1052]"}} |
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
1 row in set (1.192 sec)
|
|
|
|
Close per last test results.
|
|
Successfully loaded 75GB of lineitem dataset from an AWS S3 bucket. It took almost 4.5 hours.
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "1g/lineitem.tbl", "dleeqadata", "mytest");
|
+------------------------------------------------------------------+
|
| columnstore_dataload(tablename, filename, bucket, dbname) |
|
+------------------------------------------------------------------+
|
| {"success": true, "inserted": "6001215", "processed": "6001215"} |
|
+------------------------------------------------------------------+
|
1 row in set (40.242 sec)
|
|
Query OK, 0 rows affected (40.242 sec)
|
|
MariaDB [mytest]> call calpontsys.columnstore_load_from_s3("lineitem", "dbt3/100g/lineitem.tbl", "mcsmtrdata", "mytest");
|
|
+----------------------------------------------------------------------+
|
| columnstore_dataload(tablename, filename, bucket, dbname) |
|
+----------------------------------------------------------------------+
|
| {"success": true, "inserted": "600037902", "processed": "600037902"} |
|
+----------------------------------------------------------------------+
|
1 row in set (4 hours 26 min 25.025 sec)
|
|
Query OK, 0 rows affected (4 hours 26 min 25.027 sec)
|
|
MariaDB [mytest]>
|
MariaDB [mytest]> select count(*) from lineitem;
|
+-----------+
|
| count(*) |
|
+-----------+
|
| 606039117 |
|
+-----------+
|
1 row in set (29.650 sec)
|
|
|
|
leonid.fedorov, i suspect this (1 and 2) is a small change. could u do this tmrw.
Testing can be brief as well
|
|
I've changed the order of params, function name and schema name.
Session variables are hard to change, because columnstore_ prefix is added by
server to variables declared by plugin
so syntax is
CALL columnstore_info.load_from_s3("<bucket>", "<file_name>", "<db_name>", "<table_name>");
|
but settings are
set columnstore_s3_key='<s3_key>';
|
set columnstore_s3_secret='<s3_secret>';
|
set columnstore_s3_region='region';??
|
|
|
Changes can be tested within build 5319
|
|
Build 22.08-1 (#5319, #5321)
The change is not in these builds. I noticed that the change was merged about 1 hour go. I will wait for the next PR build or the cron build tomorrow.
|
|
there was a discussion last night about performance switches for cpimport https://mariadb.slack.com/archives/C03UMC283ND/p1660852966907919
leonid.fedorov please post the progress
|
|
The goal for our load data from s3 project is to provide the same options as here:
https://mariadb.com/kb/en/load-data-infile/
terminated by, enclosed by, escaped by
in cpimport that would be -s , -E , -C
|
|
7 parameters are implemented, and using of AWS CLI to download s3 data as well
To test PR use
cmapi https://cspkg.s3.amazonaws.com/index.html?prefix=cmapi/develop/pull_request/757/amd64/
columnstore https://cspkg.s3.amazonaws.com/index.html?prefix=develop/pull_request/5742/10.6-enterprise/amd64/
example of usage
create table tt (a int, b text) engine columnstore;
|
|
set columnstore_s3_region='us-east-1';
|
set columnstore_s3_key='correct key;
|
set columnstore_s3_secret='correct secret;
|
CALL columnstore_info.load_from_s3("s3://mdb-s3-test", "test.cvs", "test", "tt", "|", "", "");
|
|
|
function is defined as
CREATE OR REPLACE PROCEDURE load_from_s3 (in bucket varchar(256) CHARACTER SET utf8,
|
in filename varchar(256) CHARACTER SET utf8,
|
in dbname varchar(256) CHARACTER SET utf8,
|
in table_name varchar(256) CHARACTER SET utf8,
|
in terminated_by varchar(256) CHARACTER SET utf8,
|
in enclosed_by varchar(1) CHARACTER SET utf8,
|
in escaped_by varchar(1) CHARACTER SET utf8
|
)
|
|
|
Yes, defaults can be set with empty string. Functions don't support default params, so this is workaround
|
|
Column separatornhas nondefault value, just set it 5th parameter
|
|
Build tested: See drone link provided above
Verified terminated_by, enclosed_by, escaped_by characters. terminated_by character is required and the last two are optional
I have tried loading a 1gb lineitem table, data size is about 750mb and the job never finish over night. I have tried files with "|" or "," as terminated_by character.
|
|
I tied a smaller file 10k , one column . seems ok . will try bigger files later
2022-10-21 17:59:46.715517 connecting 0 localhost 3307 pgmabv99 Lena8484!
|
2022-10-21 17:59:46.718683 Connected ok ihost= 0
|
2022-10-21 17:59:46.718809 i_host 0 : SHOW STATUS LIKE 'columnstore%';
|
2022-10-21 17:59:46.719682 {'Variable_name': 'Columnstore_commit_hash', 'Value': '57df5e9-dirty'}
|
2022-10-21 17:59:46.719794 {'Variable_name': 'Columnstore_version', 'Value': '22.08.2'}
|
2022-10-21 17:59:46.719872 i_host 0 : drop database IF EXISTS d1
|
2022-10-21 17:59:47.010703 i_host 0 : create database d1
|
2022-10-21 17:59:47.011583 i_host 0 : use d1
|
2022-10-21 17:59:47.012231 i_host 0 : drop table IF EXISTS t1
|
2022-10-21 17:59:47.052555 i_host 0 : create table t1 (f1 integer) engine=columnstore
|
2022-10-21 17:59:47.383463 i_host 0 : set columnstore_s3_key="xxxxxxH3XSA7W";
|
2022-10-21 17:59:47.384267 i_host 0 : set columnstore_s3_secret="xxxxxxxxxxxxxxxX+mJQLbg9rup";
|
2022-10-21 17:59:47.384953 i_host 0 : set columnstore_s3_region='us-east-2';
|
2022-10-21 17:59:47.385480 i_host 0 : CALL columnstore_info.load_from_s3("s3://avorovich2", "data1.csv", "d1", "t1", ",", "", "" )
|
2022-10-21 17:59:49.678133 {'columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by)': b'{"success": true, "inserted": "10000", "processed": "10000"}'}
|
2022-10-21 17:59:49.680068 i_host 0 : SELECT COUNT(*) FROM t1
|
2022-10-21 17:59:49.730002 {'COUNT(*)': 10000}
|
2022-10-21 17:59:49.730170 i_host 0 : select * from information_schema.columnstore_tables
|
2022-10-21 17:59:49.746288 {'TABLE_SCHEMA': 'd1', 'TABLE_NAME': 't1', 'OBJECT_ID': 3015, 'CREATION_DATE': datetime.datetime(2022, 10, 21, 0, 0), 'COLUMN_COUNT': 1, 'AUTOINCREMENT': None}
|
2022-10-21 17:59:49.746433 i_host 0 : SELECT dbroot,min_value,max_value , high_water_mark, status FROM information_schema.columnstore_extents order by status
|
2022-10-21 17:59:49.753978 {'dbroot': 2, 'min_value': Decimal('1'), 'max_value': Decimal('1'), 'high_water_mark': 2, 'status': 'Available'}
|
2022-10-21 17:59:49.754245 {'dbroot': 2, 'min_value': Decimal('0'), 'max_value': Decimal('9999'), 'high_water_mark': 8, 'status': 'Available'}
|
2022-10-21 17:59:49.754357 saving logs ++++++++++++++++++++++++++++++++
|
2022-10-21 17:59:49.761419
|
2022-10-21 17:59:49.763677
|
2022-10-21 17:59:50.275734 tar -cvf /root/tmp/docker-logs-dbms.tar /root/tmp/docker_logs_dbms
|
|
|
Retested the build and at first, I still had issues loading 1 mb and 1 gb lineitem datasets, with 6005 and 6001215 rows respectively. The commands hung for over 10 minutes without any terminal output and I could not find any log files. After few tries, things just started to work. I was even able to load a 10 gb lineitem dataset. I am not sure what the hanging issue was and I have not been able to reproduce it.
Successful tests:
MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1m/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
|
+----------------------------------------------------------------------------------------------------+
|
| columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
|
+----------------------------------------------------------------------------------------------------+
|
| {"success": true, "inserted": "6005", "processed": "6005"} |
|
+----------------------------------------------------------------------------------------------------+
|
1 row in set (2.214 sec)
|
|
Query OK, 0 rows affected (2.214 sec)
|
|
MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1g/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
|
+----------------------------------------------------------------------------------------------------+
|
| columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
|
+----------------------------------------------------------------------------------------------------+
|
| {"success": true, "inserted": "6001215", "processed": "6001215"} |
|
+----------------------------------------------------------------------------------------------------+
|
1 row in set (16.243 sec)
|
|
|
MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "10g/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
|
+----------------------------------------------------------------------------------------------------+
|
| columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
|
+----------------------------------------------------------------------------------------------------+
|
| {"success": true, "inserted": "59986052", "processed": "59986052"} |
|
+----------------------------------------------------------------------------------------------------+
|
1 row in set (2 min 25.325 sec)
|
|
|
Build verified:
engine: 84bb4e56b81c2f1e99151f198bf939612b09004e
server: e3ed2f0ab1e287218ae72ecf77d95cc745795353
buildNo: 6088
|
|
allen.herrera This feature is only for SkySQL. We'll document it when it goes GA in March.
|