[MDEV-22571] Add extra data to ZIP table to allow self referencing Created: 2020-05-15  Updated: 2020-07-16

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - Connect
Fix Version/s: None

Type: Epic Priority: Major
Reporter: Jean-Baptiste BUGEAUD Assignee: Olivier Bertrand
Resolution: Unresolved Votes: 0
Labels: None

Epic Name: Self Referencing Zip Table

 Description   

On interfacing between system, you need access to MDM datas generated by tiers systems. Those system might have generated archives in Zip format to group file data together (version, functional domain).

Zip table type enables to list the content of such a zip file in a nice way.

Current version as per the documentation (https://mariadb.com/kb/en/connect-zipped-file-tables/) Zip table is created like :

{{create table xzipinfo2 (
fn varchar(256)not null,
cmpsize bigint not null flag=1,
uncsize bigint not null flag=2,
method int not null flag=3,
date datetime not null flag=4)
engine=connect table_type=ZIP file_name='E:/Data/Json/cities.zip';
}}

At this time there is no way to get direct access to the file name inside the table without parsing the CREATE_OPTIONS of that specific table. But when multiple=1 there will be no way to know which file is actually holding the specific entry.

Having an extra column to hold the archive file name would be a nice addition. Something like :

{{create table xzipinfo2 (
fn varchar(256)not null,
cmpsize bigint not null flag=1,
uncsize bigint not null flag=2,
method int not null flag=3,
date datetime not null flag=4,
afn varchar(256)not null)
engine=connect table_type=ZIP file_name='E:/Data/Json/cities.zip';
}}

Where afn column would hold the complete archive file name along with its path to allow direct reference to it.

Doing so, you could use afn either to create a connect table from some of the files entries directly pointing to the Zip files using afn column, whatever number of archive matching you got.

This epic would boost reinforce the use of Connect engine as a viable ETL alternative. This would benefit to MariaDB ecosystem.



 Comments   
Comment by Olivier Bertrand [ 2020-05-18 ]

This would normally be done by a special column specifying SPECIAL="FILEID" but unfortunately ZIP tables did not support it. I have added it and it will be available in next releases.

Meanwhile, if you are able to compile MariaDB from source, just add in tabzip.h after line 56 one line saying:

virtual PCSZ GetFile(PGLOBAL) {return zfn;}

About allowing internal file name research on multiple zip files, I have closed MDEV-22572 because I thought it was pointless. But now I see your need. What could be done is to list all existing internal files and with the addition of this special column, it will be possible to find which zip file contains a specific file. This is not as trivial as adding this special column so tell me if you really need it.

Also I noticed that your sample zip files contain DBF files. As said in the documentation, this file type is not supported zipped. Perhaps this could be added but again this is not a trivial addition, thus I cannot guaranty I can do it rapidly.

Comment by Olivier Bertrand [ 2020-05-18 ]

Finally I could fix MDEV-22572. If you need urgently these fixes and can compile MariaDB from source, I can send you the modified files (tabzip.h and tabzip.cpp).

Generated at Thu Feb 08 09:15:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.