[MDEV-28395] LOAD DATA plugins - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Server
Labels:
- gsoc25

Description

LOAD DATA currently supports reading from files and from the client (LOAD DATA LOCAL).

There's a certain interest in the ability to load data from more sources, in particular from AWS S3, but also from http[s]. This could also enable us to handle compressed files that is using any of the compression format the server supports.

We'll solve it by abstracting file reading code into the plugin. Initially there will be two plugins, file and "local".

The syntax

LOAD { DATA | XML } [ LOCAL ] INFILE ...

will be generalized to

LOAD { DATA | XML } [ plugin ] INFILE ...

We might need some kind of plugin-specific syntax extension for LOAD, so that AWS plugin would be able to specify the credentials. Or may be not, if everything can be part of the "filename", like in http://user:password@host.name/path/to/file

Preferably it should work for SELECT ... INTO OUTFILE too.

Attachments

Issue Links

blocks

MCOL-5013 Support Load data from AWS S3 : UDF : columnstore_info.load_from_s3

Closed

is duplicated by

MDEV-23814 LOAD DATA INFILE 's3://mybucket/myobject.json'

Closed

relates to

MDEV-33188 Enhance mariadb-dump and mariadb-import capabilities similar to MyDumper

Stalled

MXS-4618 Load data from S3

Closed

Activity

Ascending order - Click to sort in descending order

Sergei Golubchik added a comment - 2024-01-22 17:35 - edited

strictly speaking, there're three kinds of actions here.

fetching the raw data from somewhere, file, client, aws, http, whatever
- input: source specification, e.g. url
- output: data stream
filtering the data (e.g. uncompress)
- input: raw data
- output: raw data
parsing the data, e.d. XML, CSV, etc
- input: raw data
- output: column values

Which gives the syntax

LOAD plugin1 [LOW_PRIORITY | CONCURRENT] [plugin2] INFILE 'location'

  [ REPLACE | IGNORE ] INFO table

  [CHARACTER SET charset_name]

  plugin1params

  [IGNORE number {LINES|ROWS}]

  [(col_name_or_user_var,...)]

  [SET col_name = expr,...]

pligin1 is something like DATA, XML, JSON, etc. plugin2 could be LOAD, S3, etc. filtering would need an additional syntax not included above. the syntax above will likely be impossible due to bison limitations and will need to be adjusted during implementation

Sergei Golubchik added a comment - 2024-01-22 17:35 - edited strictly speaking, there're three kinds of actions here. fetching the raw data from somewhere, file, client, aws, http, whatever input: source specification, e.g. url output: data stream filtering the data (e.g. uncompress) input: raw data output: raw data parsing the data, e.d. XML, CSV, etc input: raw data output: column values Which gives the syntax LOAD plugin1 [LOW_PRIORITY | CONCURRENT] [plugin2] INFILE 'location' [ REPLACE | IGNORE ] INFO table [CHARACTER SET charset_name] plugin1params [IGNORE number {LINES|ROWS}] [(col_name_or_user_var,...)] [SET col_name = expr,...] pligin1 is something like DATA, XML, JSON, etc. plugin2 could be LOAD, S3, etc. filtering would need an additional syntax not included above. the syntax above will likely be impossible due to bison limitations and will need to be adjusted during implementation

Vladislav Vaintroub added a comment - 2024-03-07 18:18 - edited

Can't this all be done using client-side tools, e.g gnuzip, aws-cli, awk, sed, and piping CSV to a client, that accepts input on stdin.
Apparently, it would also be crucial to enhance protocol such that server does not ask client to open a file. Client already has all the input in its stdin.

Vladislav Vaintroub added a comment - 2024-03-07 18:18 - edited Can't this all be done using client-side tools, e.g gnuzip, aws-cli, awk, sed, and piping CSV to a client, that accepts input on stdin. Apparently, it would also be crucial to enhance protocol such that server does not ask client to open a file. Client already has all the input in its stdin.

Sergei Golubchik added a comment - 2024-03-07 23:12

Depends. If the client can easily access the data — yes. If the server is in RDS and data is an S3 bucket, one may not necessarily want to pull the data from AWS and then feed it back.

Anyway, this task it mostly about restructuring the existing code, moving XML parser and configurable DATA parser (ESCAPED, TERMINATED, ENCLOSED) into plugins (mandatory, statically linked into the server). Adding actual plugins is not necessarily a part of this task, except that we might want a couple of examples, to prove the viability of the API

Sergei Golubchik added a comment - 2024-03-07 23:12 Depends. If the client can easily access the data — yes. If the server is in RDS and data is an S3 bucket, one may not necessarily want to pull the data from AWS and then feed it back. Anyway, this task it mostly about restructuring the existing code, moving XML parser and configurable DATA parser (ESCAPED, TERMINATED, ENCLOSED) into plugins (mandatory, statically linked into the server). Adding actual plugins is not necessarily a part of this task, except that we might want a couple of examples, to prove the viability of the API

People

Assignee:: Unassigned

Reporter:: Sergei Golubchik

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2022-04-22 14:37

Updated:: 2025-03-17 01:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server