Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Server

    Description

      LOAD DATA currently supports reading from files and from the client (LOAD DATA LOCAL).

      There's a certain interest in the ability to load data from more sources, in particular from AWS S3, but also from http[s]. This could also enable us to handle compressed files that is using any of the compression format the server supports.

      We'll solve it by abstracting file reading code into the plugin. Initially there will be two plugins, file and "local".

      The syntax

      LOAD { DATA | XML } [ LOCAL ] INFILE ...
      

      will be generalized to

      LOAD { DATA | XML } [ plugin ] INFILE ...
      

      We might need some kind of plugin-specific syntax extension for LOAD, so that AWS plugin would be able to specify the credentials. Or may be not, if everything can be part of the "filename", like in http://user:password@host.name/path/to/file

      Preferably it should work for SELECT ... INTO OUTFILE too.

      Attachments

        Issue Links

          Activity

            serg Sergei Golubchik added a comment - - edited

            strictly speaking, there're three kinds of actions here.

            • fetching the raw data from somewhere, file, client, aws, http, whatever
              • input: source specification, e.g. url
              • output: data stream
            • filtering the data (e.g. uncompress)
              • input: raw data
              • output: raw data
            • parsing the data, e.d. XML, CSV, etc
              • input: raw data
              • output: column values

            Which gives the syntax

            LOAD plugin1 [LOW_PRIORITY | CONCURRENT] [plugin2] INFILE 'location'
              [ REPLACE | IGNORE ] INFO table
              [CHARACTER SET charset_name]
              plugin1params
              [IGNORE number {LINES|ROWS}]
              [(col_name_or_user_var,...)]
              [SET col_name = expr,...]
            

            pligin1 is something like DATA, XML, JSON, etc. plugin2 could be LOAD, S3, etc. filtering would need an additional syntax not included above. the syntax above will likely be impossible due to bison limitations and will need to be adjusted during implementation

            serg Sergei Golubchik added a comment - - edited strictly speaking, there're three kinds of actions here. fetching the raw data from somewhere, file, client, aws, http, whatever input: source specification, e.g. url output: data stream filtering the data (e.g. uncompress) input: raw data output: raw data parsing the data, e.d. XML, CSV, etc input: raw data output: column values Which gives the syntax LOAD plugin1 [LOW_PRIORITY | CONCURRENT] [plugin2] INFILE 'location' [ REPLACE | IGNORE ] INFO table [CHARACTER SET charset_name] plugin1params [IGNORE number {LINES|ROWS}] [(col_name_or_user_var,...)] [SET col_name = expr,...] pligin1 is something like DATA, XML, JSON, etc. plugin2 could be LOAD, S3, etc. filtering would need an additional syntax not included above. the syntax above will likely be impossible due to bison limitations and will need to be adjusted during implementation
            wlad Vladislav Vaintroub added a comment - - edited

            Can't this all be done using client-side tools, e.g gnuzip, aws-cli, awk, sed, and piping CSV to a client, that accepts input on stdin.
            Apparently, it would also be crucial to enhance protocol such that server does not ask client to open a file. Client already has all the input in its stdin.

            wlad Vladislav Vaintroub added a comment - - edited Can't this all be done using client-side tools, e.g gnuzip, aws-cli, awk, sed, and piping CSV to a client, that accepts input on stdin. Apparently, it would also be crucial to enhance protocol such that server does not ask client to open a file. Client already has all the input in its stdin.

            Depends. If the client can easily access the data — yes. If the server is in RDS and data is an S3 bucket, one may not necessarily want to pull the data from AWS and then feed it back.

            Anyway, this task it mostly about restructuring the existing code, moving XML parser and configurable DATA parser (ESCAPED, TERMINATED, ENCLOSED) into plugins (mandatory, statically linked into the server). Adding actual plugins is not necessarily a part of this task, except that we might want a couple of examples, to prove the viability of the API

            serg Sergei Golubchik added a comment - Depends. If the client can easily access the data — yes. If the server is in RDS and data is an S3 bucket, one may not necessarily want to pull the data from AWS and then feed it back. Anyway, this task it mostly about restructuring the existing code, moving XML parser and configurable DATA parser (ESCAPED, TERMINATED, ENCLOSED) into plugins (mandatory, statically linked into the server). Adding actual plugins is not necessarily a part of this task, except that we might want a couple of examples, to prove the viability of the API

            People

              Unassigned Unassigned
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.