[MCOL-3875] Columnstore write cache Created: 2020-03-10  Updated: 2023-12-15  Due: 2020-04-06

Status: Stalled
Project: MariaDB ColumnStore
Component/s: MariaDB Server
Affects Version/s: 1.5.3
Fix Version/s: 23.10

Type: New Feature Priority: Major
Reporter: Michael Widenius Assignee: Roman
Resolution: Unresolved Votes: 0
Labels: Performance

Issue Links:
Blocks
is blocked by MCOL-4769 Complete ColumnStore Insert Cache Part 1 Closed
Duplicate
is duplicated by MCOL-4221 HTAP Deployment - poor performance on... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MCOL-3964 Replace CSV engine with MCS in the in... Sub-Task Closed Gagan Goel  
MCOL-3965 Research data insertion paths: cpimpo... Sub-Task Closed Gagan Goel  
MCOL-4285 Implement Stage 1 of the ColumnStore ... Sub-Task Closed Todd Stoffel  
MCOL-4914 Complete ColumnStore Cache Part 2 Sub-Task Open Leonid Fedorov  
Epic Link: ColumnStore Performance Improvements
Sprint: 2020-7

 Description   

Background:
ColumnStore is very slow in inserting single rows but fast to insert multiple rows. This is especially a problem in a master-slave setup where ColumnStore is is used on the slave for some of the tables in the master.

What we are trying to solve:

  • Speed up inserts to ColumnStore to help ColumnStore on a slave to keep up
    with replication.
  • The solution will also speed up single or few row insert's.

Requirements:

  • Cache should be totally invisible for the normal user
  • Cache usage should be crash safe. In no case will we loose any committed
    data.
  • Cache should be able to hold data from thousands of independent insert
    transactions as the master may have done each insert in a separate
    transaction. The transactions can be intermixed with other transactions
    for other tables.

Assumptions for ColumnStore usage:

  • ColumnStore is most often used on a slave
  • Lots of inserts, few updates and deletes.
  • Occasional selects (much fewer than inserts)

Note that the above requirements require us to store the inserted data on a
persistent medium at commits.

Considerations:

  • In this draft we are suggesting to use MyISAM because it's trivial to
    have watermarks in MyISAM dynamic row format that tells us what is the
    last committed transaction, even after a crash. This is not possible
    with CVS. We can also not use MEMORY tables as it would not allow us
    to meet the requirements.
  • We may choose to instead use Aria instead of MyISAM. However even if
    Aria would be used, we would not use crash safe tables as the overhead
    for storing undo's for inserts is notable and not needed.
  • The MyISAM table will not use any indexes, which make inserts very fast.
    We need two syncs per commit (one for committing the data and one for
    committing the water mark). All row data will be cached by the file
    system, there is no internal cache.

Solution:

  • A new table handler, COLUMNSTORE_CACHE, that inheritens from
    ColumnStore and overrides the following methods:
  • Create table
  • Creates table in MyISAM and then an identical table in ColumnStore
  • ha_write_row
  • Insert row into MyISAM
  • ha_delete_row and ha_update
  • Flush committed rows in cache as a separate transaction.
  • Copy all not committed rows to ColumnStore.
  • Run rest of transaction as normal transaction (slow but safe).
  • external_lock
  • In case of read lock, flush insert cache to ColumnStore up to
    last commit
  • Call columnstore::external_lock
  • close
  • drop
  • rename

Stages:

1)

  • Inserts will limited to one transaction at a time. This can be crash safe
    in MyISAM. In case of crash all not committed rows can trivially be
    discarded.
  • Will not be 100% crash safe in the case where we crash exactly after
    copying data but before we have time to truncate the cache
    (very small window)
  • In cache of flush, wait until currently running transaction using
    ColumnStore has ended and then flush everything at once and truncate cache.
  • Assumption is that there are no update and delete's in a transaction that
    starts with an insert. (Most common case)

2)

  • Add handling of delete and update in a transaction that also used inserts
    according to 'Solution'.

3) Add support for batched inserts without cpimport to ColumnStore.
This will speed up flushing of insert cache, LOAD DATA and INSERT SELECT.

4)

  • Add 'XA' kind of security around 'copy to ColumnStore, commit and truncate
    MyISAM. This will make the cache totally crash safe.
  • Add an interface for ColumnStore engine to ask for current transaction id
    and after crash be able to ask if the id was committed.
  • Add crash recovery to ensure that an interrupted the copy was fully done
    and, if not done, redo it.

5)

  • Allow multiple inserts at same time.
  • In case of flush, block new inserts, wait until old transactions has
    committed and move all rows to ColumnStore at once.

6)

  • Don't truncate the table as part of copy. Only add a watermark in the
    MyISAM table what is copied. This allows us to do copying of data while
    other transactions are doing inserts.
  • Truncate can be done when there is no inserts since last copy or
    when the cached files has reached some size limit.

7)

  • Add a timer that flushes committed transactions from cache once in a while
    (once per hour?)
  • Add logic to know when inserts are idle and flush in idle time.
  • FLUSH TABLE table_name on a cached table should wait until current
    transactions has ended and flush all data to ColumnStore.
  • Option to flush data to ColumnStore on server shutdown

Generated at Thu Feb 08 02:46:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.