[MDEV-17841] S3 Storage engine Created: 2018-11-26  Updated: 2023-11-22  Resolved: 2020-06-08

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Aria, Storage Engine - MyISAM, Storage Engine - S3
Fix Version/s: 10.5.3

Type: Task Priority: Major
Reporter: Michael Widenius Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
includes MDEV-19867 fix mysqldump to by default not copy ... Closed
includes MDEV-19963 Add MERGE support for S3 (and Aria) Closed
includes MDEV-19964 S3 replication support Closed
includes MDEV-19965 Add cache for S3 connections Open
Problem/Incident
causes MDEV-23362 Undefined symbol "uncompress" in s3_g... Closed
Relates
relates to MDEV-20279 Increase Aria index length limit Closed
relates to MDEV-21808 Adding batch updates to S3 engine Open
relates to MDEV-22606 Include the S3 storage engine in Mari... Closed
relates to MDEV-22980 S3 storage engine fails to load due t... Closed
relates to MDEV-20302 Assertion `page_st != 2' failed in re... Closed
relates to MDEV-20306 Assertion `!(end_of_data > info->scan... Stalled
relates to MDEV-22088 S3 partitioning support Closed
relates to MDEV-22925 ALTER TABLE s3_table ENGINE=Aria can ... Open
relates to MDEV-23377 libmarias3 doesn't compile in OpenBSD Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDEV-19461 Assertion failure upon altering tempo... Technical task Closed Michael Widenius  
MDEV-19463 Altering sequence to S3 leaves unremo... Technical task Closed Michael Widenius  
MDEV-19464 Altering partitioned table into S3 ca... Technical task Closed Vladislav Lesin  
MDEV-19465 Server crashes in s3_block_read upon ... Technical task Closed Michael Widenius  
MDEV-19466 Syntax error in the error message: Ex... Technical task Closed Michael Widenius  
MDEV-19575 Assertion `page_st == 1' failed upon ... Technical task Closed Michael Widenius  
MDEV-19585 Assertion `!is_set() || (m_status == ... Technical task Closed Michael Widenius  
MDEV-19591 Assertion `!table->pos_in_locked_tabl... Technical task Closed Michael Widenius  

 Description   

The task is to create a read-only storage engine that will allow one
to archived MariaDB tables in Amazon S3, or any third-party public or
private cloud that implements S3 API (of which there are many), but still
have them accessible for reading in MariaDB.

The typical use case would be that there exists tables that after some
time would become fairly inactive, but not so inactive that they could
be removed. In that case an option is to move such a table to an
archiving service, which is accessible through an S3 API.

Notice that S3 means the Cloud Object Storage API defined by Amazon
AWS. Often the whole of Amazon’s Cloud Object Storage is referred to
as S3. In the context of the S3 archive storage engine, it refers to
the API itself that defines how to store objects in a cloud service
being it Amazon’s or someone else’s. OpenStack for example provides an
S3 API for storing objects.

The main benefit of storing things in a S3 compatible storage is that the
cost of storage is much cheaper than many other alternatives. Many S3
implementations also provides very reliable long time storage.

It should be trivial to move an existing table to be stored in S3 (and
read only) and move it back to local storage.

The current idea is to use MyISAM or Aria as a base for this storage
engine, as these are ideally suitable for read-only. Having MyISAM or
Aria as a base allows the S3 storage engine to support all MariaDB
types, indexes etc.

The suggested way to move a table to S3 would be something like:

CREATE SERVER 'S3_SERVER' FOREIGN DATA WRAPPER 'AWS' OPTIONS
(USER 'aws id string', PASSWORD 'aws secret string');

ALTER TABLE test.old_table STORAGE_ENGINE=MyISAM ROW_FORMAT=S3
CONNECTION="S3_SERVER" BUCKET="test.old_table" BUCKET_SIZE=4*1024*1024;

The ALTER TABLE will first create a local table in the normal MySQL/Aria
format and then move both index and data to S3 in buckets of BUCKET_SIZE.
The .frm file will stay on the main server. We will also write an
external C program that can be used to copy any MyISAM/Aria table to
S3. Same code will also be used in the server for doing the copy.

One advantage of this approach is that we can create all indexes one by one
with sorting and thus get an optimal layout on storage for key accesses.

If one wants to change the table to a "normal writable table" one can
do that with another ALTER TABLE;

ALTER TABLE test.old_table ROW_FORMAT=page;
or
ALTER TABLE test.old_table ENGINE=InnoDB;

One of the properties of many S3 implementations is that it favors big reads.
It's said that 4M gives the best performance.
Write performance for S3 is not great, but in this project that is not a
problem.

What makes MyISAM and Aria good candidates for having S3 support is that both
have their own extendable 'storage modules' (a bit like storage engines
in MariaDB). It's relatively easy to add a new format to either of them.

One of the first tasks is decide which one to choose as a base for the new
storage module.
There is no planned changes in the storage engine API for this.

MyISAM
======

Advantages:

  • Supports different concurrent block sizes for the key cache
  • Has support for MERGE tables, which can be very useful with
    with monthly archived data.
  • Has a segmented key cache which gives better concurrent multi-user support

Disadvantages:

  • No cache for rows (must be developed if we want to add S3 support)
  • Aria gets more development resources (Aria is basically a better version
    of MyISAM with more storage modules and optionally crash-safe).

Aria
====

  • Has row cache for ROW_FORMAT=block and we could re-use that for S3

Disadvantages:

  • No support for different concurrent block sizes in the Aria page cache
    (we need to support both 8K and 4M in the same cache).
  • May be solved by having a different cache for S3, but in that case
    all S3 tables in one installation must have the same block size.
  • Other option would be to add an upper layer function that would split
    a read block into multiple smaller blocks.
  • No support for MERGE tables
  • This something we should have added a long time ago and not that hard
    to do based on the original MERGE table code.
  • No segmented page cache
  • Would be a nice to have but not critical in a read only scenario and can
    be added later

Currently I am slight in favour of using Aria as a base as the Aria
code base is newer. There will be a little more work for adding the
missing pieces but nothing major.

S3 Interface
============
We are looking at using the AWS interface to S3 from
https://github.com/aws/aws-sdk-cpp.git

Example code to use from:
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/cpp/example_code/s3
Buckets for a table will be stored as

base/database/table/block_number
Special characters in the database and table will be converted to hex.
'base' we will get from SERVER object.

The main API to use are:
s3_client.CreateBucket(request); # For insert
s3_client.GetObject(object_request); # For read
s3_client.DeleteBucket(bucket_request); # For drop

Another option is to use http request directly, but then we would need
to get or create a library for simplify the interface.

The code for the above C++ objects can be found at
aws-cpp-sdk-s3/source/S3Client.cpp in the above git repository.

Other things
============

  • Internally we would store the b-tree and data in 8K pages, as this minimizes
    scan for index entries. This also allows us to get better cache allocation
    as we can flush out 8K pieces that are not accessed.
  • Optionally we could compress the table with myisampack/aria_pack before we
    store it in A3. The benefit of this is that we will take 30% less space and
    we get more performance as we get more rows read with each S3 read request.
  • In theory we could also support one to create a 'S3 table' and then
    do one bulk load to it before it's moved to S3 and read only.


 Comments   
Comment by Michael Widenius [ 2020-06-08 ]

S3 is already stable and has all the required features so I am now closing this overview task.

Generated at Thu Feb 08 08:39:31 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.