Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5013

Support Load data from AWS S3 : UDF : columnstore_info.load_from_s3

Details

    • New Feature
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 6.2.3
    • 23.02.1
    • None
    • None
    • 2021-17, 2022-22

    Description

      Columnstore should support a plain SQL syntax for loading data directly from s3 buckets.

      Current implementation for 2208 is

      usage:

      AWS

      set columnstore_s3_key='<s3_key>';
      set columnstore_s3_secret='<s3_secret>';
      set columnstore_s3_region='region';
       
      CALL columnstore_info.load_from_s3("<bucket>", "<file_name>", "<db_name>", "<table_name>", "<terminated_by>",  "<enclosed_by>", "<escaped_by>");
       
      EXAMPLE
       
      CALL columnstore_info.load_from_s3("s3://columnstore-test", "data1.csv", "d1", "t1", ",", "", "" )
      

      or for Google Storage

      set columnstore_s3_key='GOOGXXXxxxx';
      set columnstore_s3_secret='XXXXXXXXXX;
      CALL columnstore_info.load_from_s3("gs://columnstore-test", "test.tbl", "test", "gg", "|", "", "");
      

      last three params are the same as cpimport -s , -E , -C

       terminated_by:  

      is the delimiter between column values. mandatory

       enclosed_by:

      Enclosed by character if field values are enclosed. optional . can be "" empty string

       escaped_by: 

      Escape character used in conjunction with 'enclosed by'.optional . can be "" empty string
      character, or as part of NULL escape sequence ('\N');
      default is '\'
      enclosed_by and escaped_by can be set blank to use defaults

      EXAMPLE

      CALL columnstore_info.load_from_s3("s3://avorovich2", "data1.csv", "d1", "t1", ",", "", "" )

      --------
      future options maybe (not implemented in 2208)

      mariadb> LOAD DATA S3 FROM 's3://blah.blah/mydata.dat' into table xyz FIELDS TERMINATED BY '|';
      

      This command should invoke the backend columnstore api and an s3 client to stream data directly from a remote source into cpimport.

      This gives us a rapid data ingestion technique that is compatible with SkySQL and provides UAC (Unlike our previous utility - mcsimport).

      Attachments

        Issue Links

          Activity

            Build tested: See drone link provided above

            Verified terminated_by, enclosed_by, escaped_by characters. terminated_by character is required and the last two are optional

            I have tried loading a 1gb lineitem table, data size is about 750mb and the job never finish over night. I have tried files with "|" or "," as terminated_by character.

            dleeyh Daniel Lee (Inactive) added a comment - Build tested: See drone link provided above Verified terminated_by, enclosed_by, escaped_by characters. terminated_by character is required and the last two are optional I have tried loading a 1gb lineitem table, data size is about 750mb and the job never finish over night. I have tried files with "|" or "," as terminated_by character.
            alexey.vorovich alexey vorovich (Inactive) added a comment - - edited

            I tied a smaller file 10k , one column . seems ok . will try bigger files later

            2022-10-21 17:59:46.715517 connecting 0 localhost 3307 pgmabv99 Lena8484!
            2022-10-21 17:59:46.718683 Connected ok ihost= 0
            2022-10-21 17:59:46.718809 i_host 0 : SHOW STATUS LIKE 'columnstore%';
            2022-10-21 17:59:46.719682 {'Variable_name': 'Columnstore_commit_hash', 'Value': '57df5e9-dirty'}
            2022-10-21 17:59:46.719794 {'Variable_name': 'Columnstore_version', 'Value': '22.08.2'}
            2022-10-21 17:59:46.719872 i_host 0 : drop  database  IF EXISTS d1
            2022-10-21 17:59:47.010703 i_host 0 : create database d1
            2022-10-21 17:59:47.011583 i_host 0 : use d1
            2022-10-21 17:59:47.012231 i_host 0 : drop table IF EXISTS t1
            2022-10-21 17:59:47.052555 i_host 0 : create table   t1 (f1 integer) engine=columnstore
            2022-10-21 17:59:47.383463 i_host 0 : set columnstore_s3_key="xxxxxxH3XSA7W";
            2022-10-21 17:59:47.384267 i_host 0 : set columnstore_s3_secret="xxxxxxxxxxxxxxxX+mJQLbg9rup";
            2022-10-21 17:59:47.384953 i_host 0 : set columnstore_s3_region='us-east-2';
            2022-10-21 17:59:47.385480 i_host 0 : CALL columnstore_info.load_from_s3("s3://avorovich2", "data1.csv", "d1", "t1", ",", "", "" )
            2022-10-21 17:59:49.678133 {'columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by)': b'{"success": true, "inserted": "10000", "processed": "10000"}'}
            2022-10-21 17:59:49.680068 i_host 0 : SELECT COUNT(*)  FROM t1
            2022-10-21 17:59:49.730002 {'COUNT(*)': 10000}
            2022-10-21 17:59:49.730170 i_host 0 : select * from information_schema.columnstore_tables
            2022-10-21 17:59:49.746288 {'TABLE_SCHEMA': 'd1', 'TABLE_NAME': 't1', 'OBJECT_ID': 3015, 'CREATION_DATE': datetime.datetime(2022, 10, 21, 0, 0), 'COLUMN_COUNT': 1, 'AUTOINCREMENT': None}
            2022-10-21 17:59:49.746433 i_host 0 : SELECT dbroot,min_value,max_value , high_water_mark, status  FROM information_schema.columnstore_extents order by status
            2022-10-21 17:59:49.753978 {'dbroot': 2, 'min_value': Decimal('1'), 'max_value': Decimal('1'), 'high_water_mark': 2, 'status': 'Available'}
            2022-10-21 17:59:49.754245 {'dbroot': 2, 'min_value': Decimal('0'), 'max_value': Decimal('9999'), 'high_water_mark': 8, 'status': 'Available'}
            2022-10-21 17:59:49.754357 saving logs ++++++++++++++++++++++++++++++++
            2022-10-21 17:59:49.761419 
            2022-10-21 17:59:49.763677 
            2022-10-21 17:59:50.275734 tar -cvf /root/tmp/docker-logs-dbms.tar /root/tmp/docker_logs_dbms
            

            alexey.vorovich alexey vorovich (Inactive) added a comment - - edited I tied a smaller file 10k , one column . seems ok . will try bigger files later 2022 - 10 - 21 17 : 59 : 46.715517 connecting 0 localhost 3307 pgmabv99 Lena8484! 2022 - 10 - 21 17 : 59 : 46.718683 Connected ok ihost= 0 2022 - 10 - 21 17 : 59 : 46.718809 i_host 0 : SHOW STATUS LIKE 'columnstore%' ; 2022 - 10 - 21 17 : 59 : 46.719682 { 'Variable_name' : 'Columnstore_commit_hash' , 'Value' : '57df5e9-dirty' } 2022 - 10 - 21 17 : 59 : 46.719794 { 'Variable_name' : 'Columnstore_version' , 'Value' : '22.08.2' } 2022 - 10 - 21 17 : 59 : 46.719872 i_host 0 : drop database IF EXISTS d1 2022 - 10 - 21 17 : 59 : 47.010703 i_host 0 : create database d1 2022 - 10 - 21 17 : 59 : 47.011583 i_host 0 : use d1 2022 - 10 - 21 17 : 59 : 47.012231 i_host 0 : drop table IF EXISTS t1 2022 - 10 - 21 17 : 59 : 47.052555 i_host 0 : create table t1 (f1 integer) engine=columnstore 2022 - 10 - 21 17 : 59 : 47.383463 i_host 0 : set columnstore_s3_key= "xxxxxxH3XSA7W" ; 2022 - 10 - 21 17 : 59 : 47.384267 i_host 0 : set columnstore_s3_secret= "xxxxxxxxxxxxxxxX+mJQLbg9rup" ; 2022 - 10 - 21 17 : 59 : 47.384953 i_host 0 : set columnstore_s3_region= 'us-east-2' ; 2022 - 10 - 21 17 : 59 : 47.385480 i_host 0 : CALL columnstore_info.load_from_s3( "s3://avorovich2" , "data1.csv" , "d1" , "t1" , "," , "" , "" ) 2022 - 10 - 21 17 : 59 : 49.678133 { 'columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by)' : b '{"success": true, "inserted": "10000", "processed": "10000"}' } 2022 - 10 - 21 17 : 59 : 49.680068 i_host 0 : SELECT COUNT(*) FROM t1 2022 - 10 - 21 17 : 59 : 49.730002 { 'COUNT(*)' : 10000 } 2022 - 10 - 21 17 : 59 : 49.730170 i_host 0 : select * from information_schema.columnstore_tables 2022 - 10 - 21 17 : 59 : 49.746288 { 'TABLE_SCHEMA' : 'd1' , 'TABLE_NAME' : 't1' , 'OBJECT_ID' : 3015 , 'CREATION_DATE' : datetime.datetime( 2022 , 10 , 21 , 0 , 0 ), 'COLUMN_COUNT' : 1 , 'AUTOINCREMENT' : None} 2022 - 10 - 21 17 : 59 : 49.746433 i_host 0 : SELECT dbroot,min_value,max_value , high_water_mark, status FROM information_schema.columnstore_extents order by status 2022 - 10 - 21 17 : 59 : 49.753978 { 'dbroot' : 2 , 'min_value' : Decimal( '1' ), 'max_value' : Decimal( '1' ), 'high_water_mark' : 2 , 'status' : 'Available' } 2022 - 10 - 21 17 : 59 : 49.754245 { 'dbroot' : 2 , 'min_value' : Decimal( '0' ), 'max_value' : Decimal( '9999' ), 'high_water_mark' : 8 , 'status' : 'Available' } 2022 - 10 - 21 17 : 59 : 49.754357 saving logs ++++++++++++++++++++++++++++++++ 2022 - 10 - 21 17 : 59 : 49.761419 2022 - 10 - 21 17 : 59 : 49.763677 2022 - 10 - 21 17 : 59 : 50.275734 tar -cvf /root/tmp/docker-logs-dbms.tar /root/tmp/docker_logs_dbms
            dleeyh Daniel Lee (Inactive) added a comment - - edited

            Retested the build and at first, I still had issues loading 1 mb and 1 gb lineitem datasets, with 6005 and 6001215 rows respectively. The commands hung for over 10 minutes without any terminal output and I could not find any log files. After few tries, things just started to work. I was even able to load a 10 gb lineitem dataset. I am not sure what the hanging issue was and I have not been able to reproduce it.

            Successful tests:

            MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1m/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
            +----------------------------------------------------------------------------------------------------+
            | columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
            +----------------------------------------------------------------------------------------------------+
            | {"success": true, "inserted": "6005", "processed": "6005"}                                         |
            +----------------------------------------------------------------------------------------------------+
            1 row in set (2.214 sec)
             
            Query OK, 0 rows affected (2.214 sec)
             
            MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1g/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
            +----------------------------------------------------------------------------------------------------+
            | columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
            +----------------------------------------------------------------------------------------------------+
            | {"success": true, "inserted": "6001215", "processed": "6001215"}                                   |
            +----------------------------------------------------------------------------------------------------+
            1 row in set (16.243 sec)
             
             
            MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "10g/lineitem.tbl", "mytest", "lineitem", "|", "", "" );
            +----------------------------------------------------------------------------------------------------+
            | columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) |
            +----------------------------------------------------------------------------------------------------+
            | {"success": true, "inserted": "59986052", "processed": "59986052"}                                 |
            +----------------------------------------------------------------------------------------------------+
            1 row in set (2 min 25.325 sec)
            

            dleeyh Daniel Lee (Inactive) added a comment - - edited Retested the build and at first, I still had issues loading 1 mb and 1 gb lineitem datasets, with 6005 and 6001215 rows respectively. The commands hung for over 10 minutes without any terminal output and I could not find any log files. After few tries, things just started to work. I was even able to load a 10 gb lineitem dataset. I am not sure what the hanging issue was and I have not been able to reproduce it. Successful tests: MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1m/lineitem.tbl", "mytest", "lineitem", "|", "", "" ); +----------------------------------------------------------------------------------------------------+ | columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) | +----------------------------------------------------------------------------------------------------+ | {"success": true, "inserted": "6005", "processed": "6005"} | +----------------------------------------------------------------------------------------------------+ 1 row in set (2.214 sec)   Query OK, 0 rows affected (2.214 sec)   MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "1g/lineitem.tbl", "mytest", "lineitem", "|", "", "" ); +----------------------------------------------------------------------------------------------------+ | columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) | +----------------------------------------------------------------------------------------------------+ | {"success": true, "inserted": "6001215", "processed": "6001215"} | +----------------------------------------------------------------------------------------------------+ 1 row in set (16.243 sec)     MariaDB [mytest]> CALL columnstore_info.load_from_s3("s3://dleeqadata", "10g/lineitem.tbl", "mytest", "lineitem", "|", "", "" ); +----------------------------------------------------------------------------------------------------+ | columnstore_dataload(bucket, filename, dbname, table_name, terminated_by, enclosed_by, escaped_by) | +----------------------------------------------------------------------------------------------------+ | {"success": true, "inserted": "59986052", "processed": "59986052"} | +----------------------------------------------------------------------------------------------------+ 1 row in set (2 min 25.325 sec)

            Build verified:

            engine: 84bb4e56b81c2f1e99151f198bf939612b09004e
            server: e3ed2f0ab1e287218ae72ecf77d95cc745795353
            buildNo: 6088

            dleeyh Daniel Lee (Inactive) added a comment - Build verified: engine: 84bb4e56b81c2f1e99151f198bf939612b09004e server: e3ed2f0ab1e287218ae72ecf77d95cc745795353 buildNo: 6088

            allen.herrera This feature is only for SkySQL. We'll document it when it goes GA in March.

            toddstoffel Todd Stoffel (Inactive) added a comment - allen.herrera This feature is only for SkySQL. We'll document it when it goes GA in March.

            People

              leonid.fedorov Leonid Fedorov
              toddstoffel Todd Stoffel (Inactive)
              Alan Mologorsky Alan Mologorsky
              Daniel Lee Daniel Lee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.