[MDEV-17841] S3 Storage engine - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.5.3
Component/s: Storage Engine - Aria, Storage Engine - MyISAM, Storage Engine - S3
Labels:
None

Description

The task is to create a read-only storage engine that will allow one
to archived MariaDB tables in Amazon S3, or any third-party public or
private cloud that implements S3 API (of which there are many), but still
have them accessible for reading in MariaDB.

The typical use case would be that there exists tables that after some
time would become fairly inactive, but not so inactive that they could
be removed. In that case an option is to move such a table to an
archiving service, which is accessible through an S3 API.

Notice that S3 means the Cloud Object Storage API defined by Amazon
AWS. Often the whole of Amazon’s Cloud Object Storage is referred to
as S3. In the context of the S3 archive storage engine, it refers to
the API itself that defines how to store objects in a cloud service
being it Amazon’s or someone else’s. OpenStack for example provides an
S3 API for storing objects.

The main benefit of storing things in a S3 compatible storage is that the
cost of storage is much cheaper than many other alternatives. Many S3
implementations also provides very reliable long time storage.

It should be trivial to move an existing table to be stored in S3 (and
read only) and move it back to local storage.

The current idea is to use MyISAM or Aria as a base for this storage
engine, as these are ideally suitable for read-only. Having MyISAM or
Aria as a base allows the S3 storage engine to support all MariaDB
types, indexes etc.

The suggested way to move a table to S3 would be something like:

CREATE SERVER 'S3_SERVER' FOREIGN DATA WRAPPER 'AWS' OPTIONS
(USER 'aws id string', PASSWORD 'aws secret string');

ALTER TABLE test.old_table STORAGE_ENGINE=MyISAM ROW_FORMAT=S3
CONNECTION="S3_SERVER" BUCKET="test.old_table" BUCKET_SIZE=4*1024*1024;

The ALTER TABLE will first create a local table in the normal MySQL/Aria
format and then move both index and data to S3 in buckets of BUCKET_SIZE.
The .frm file will stay on the main server. We will also write an
external C program that can be used to copy any MyISAM/Aria table to
S3. Same code will also be used in the server for doing the copy.

One advantage of this approach is that we can create all indexes one by one
with sorting and thus get an optimal layout on storage for key accesses.

If one wants to change the table to a "normal writable table" one can
do that with another ALTER TABLE;

ALTER TABLE test.old_table ROW_FORMAT=page;
or
ALTER TABLE test.old_table ENGINE=InnoDB;

One of the properties of many S3 implementations is that it favors big reads.
It's said that 4M gives the best performance.
Write performance for S3 is not great, but in this project that is not a
problem.

What makes MyISAM and Aria good candidates for having S3 support is that both
have their own extendable 'storage modules' (a bit like storage engines
in MariaDB). It's relatively easy to add a new format to either of them.

One of the first tasks is decide which one to choose as a base for the new
storage module.
There is no planned changes in the storage engine API for this.

MyISAM
======

Advantages:

Supports different concurrent block sizes for the key cache
Has support for MERGE tables, which can be very useful with
with monthly archived data.
Has a segmented key cache which gives better concurrent multi-user support

Disadvantages:

No cache for rows (must be developed if we want to add S3 support)
Aria gets more development resources (Aria is basically a better version
of MyISAM with more storage modules and optionally crash-safe).

Aria
====

Has row cache for ROW_FORMAT=block and we could re-use that for S3

Disadvantages:

No support for different concurrent block sizes in the Aria page cache
(we need to support both 8K and 4M in the same cache).
May be solved by having a different cache for S3, but in that case
all S3 tables in one installation must have the same block size.
Other option would be to add an upper layer function that would split
a read block into multiple smaller blocks.
No support for MERGE tables
This something we should have added a long time ago and not that hard
to do based on the original MERGE table code.
No segmented page cache
Would be a nice to have but not critical in a read only scenario and can
be added later

Currently I am slight in favour of using Aria as a base as the Aria
code base is newer. There will be a little more work for adding the
missing pieces but nothing major.

S3 Interface
============
We are looking at using the AWS interface to S3 from
https://github.com/aws/aws-sdk-cpp.git

Example code to use from:
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/cpp/example_code/s3
Buckets for a table will be stored as

base/database/table/block_number
Special characters in the database and table will be converted to hex.
'base' we will get from SERVER object.

The main API to use are:
s3_client.CreateBucket(request); # For insert
s3_client.GetObject(object_request); # For read
s3_client.DeleteBucket(bucket_request); # For drop

Another option is to use http request directly, but then we would need
to get or create a library for simplify the interface.

The code for the above C++ objects can be found at
aws-cpp-sdk-s3/source/S3Client.cpp in the above git repository.

Other things
============

Internally we would store the b-tree and data in 8K pages, as this minimizes
scan for index entries. This also allows us to get better cache allocation
as we can flush out 8K pieces that are not accessed.
Optionally we could compress the table with myisampack/aria_pack before we
store it in A3. The benefit of this is that we will take 30% less space and
we get more performance as we get more rows read with each S3 read request.
In theory we could also support one to create a 'S3 table' and then
do one bulk load to it before it's moved to S3 and read only.

Attachments

Issue Links

causes

MDEV-23362 Undefined symbol "uncompress" in s3_get_object()

Closed

includes

MDEV-19867 fix mysqldump to by default not copy S3 tables

Closed

MDEV-19963 Add MERGE support for S3 (and Aria)

Closed

MDEV-19964 S3 replication support

Closed

MDEV-19965 Add cache for S3 connections

Open

relates to

MDEV-20279 Increase Aria index length limit

Closed

MDEV-21808 Adding batch updates to S3 engine

Open

MDEV-22606 Include the S3 storage engine in MariaDB Community Server binary packages

Closed

MDEV-22980 S3 storage engine fails to load due to plugin_maturity

Closed

MDEV-20302 Assertion `page_st != 2' failed in read_big_block or server hangs upon concurrent SELECT from partitioned S3 table

Closed

MDEV-20306 Assertion `!(end_of_data > info->scan.dir_end)' failed in _ma_scan_block_record upon converting table from S3 to Aria with encryption enabled

Closed

MDEV-22088 S3 partitioning support

Closed

MDEV-22925 ALTER TABLE s3_table ENGINE=Aria can cause failure on slave

Open

MDEV-23377 libmarias3 doesn't compile in OpenBSD

Closed

(9 relates to)

Sub-Tasks

1.	Assertion failure upon altering temporary S3 table	Closed	Michael Widenius
2.	Altering sequence to S3 leaves unremovable garbage behind	Closed	Michael Widenius
3.	Altering partitioned table into S3 causes an obscure error	Closed	Vladislav Lesin
4.	Server crashes in s3_block_read upon IN query	Closed	Michael Widenius
5.	Syntax error in the error message: Expected object ... didn't exists	Closed	Michael Widenius
6.	Assertion `page_st == 1' failed upon SELECT from S3 table which is being converted into Aria	Closed	Michael Widenius
7.	Assertion `!is_set() \|\| (m_status == DA_OK_BULK && is_bulk_op())' failed upon SELECT from S3 table with concurrent BACKUP stage	Closed	Michael Widenius
8.	Assertion `!table->pos_in_locked_tables' failed in tc_release_table upon altering table into S3 under lock	Closed	Michael Widenius

S3 Storage engine

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Git Integration