[MCOL-4529] Design the proper way of doing extent elimination on varchars with utf8 or utf8mb4 Created: 2021-02-02  Updated: 2021-04-15  Resolved: 2021-03-11

Status: Closed
Project: MariaDB ColumnStore
Component/s: Documentation
Affects Version/s: None
Fix Version/s: N/A

Type: Task Priority: Critical
Reporter: Gregory Dorman (Inactive) Assignee: Sergey Zefirov
Resolution: Done Votes: 0
Labels: None

Issue Links:
Blocks
blocks MCOL-4522 calGetTrace shows double LIO from Com... Closed
Duplicate
duplicates MCOL-1090 Extent Elimination for char/varchar c... Closed
PartOf
is part of MCOL-4580 Extent's approximate range keeping fo... In Testing
Epic Link: ColumnStore Performance Improvements
Sprint: 2021-2, 2021-3, 2021-4, 2021-5

 Description   

Overall - the top ticket (4522) is a massive regression.

Starting in 5.5, the true length of varchar and char conforms to SQL standard, and incorporates byte counts for character sets such as utf8, and utf8mb4. Up to and including 5.4, it was wrong, and only reserved the bytes equal to the number of characters defined in the column DDL. The side effect of it is exemplified in the customer complaint: a varchar(5) under olf utf8 used to fit, and filters on it were subject to extent elimination. With 5.5, it no longer fits, which causes columnstore to essentially ignore an otherwise tight filter, and to proceed with the full scan.



 Comments   
Comment by Sergey Zefirov [ 2021-02-04 ]

Roman proposed to split extent map by column width and I would like to piggy back that idea.

I would like to propose to keep ranges in maps indexed by the the length.

Right now the structure as follows:

struct EMCasualPartition_struct
{
    int32_t sequenceNum;
    char isValid; //CP_INVALID - No min/max and no DML in progress. CP_UPDATING - Update in progress. CP_VALID- min/max is valid
    union
    {
        int128_t bigLoVal; // These need to be reinterpreted as unsigned for uint64_t/uint128_t column types.
        int64_t loVal;
    };
    union
    {
        int128_t bigHiVal;
        int64_t hiVal;
    };
    // some constructors.
};

What I propose is the following:

dstruct EMCasualPartition_struct
{
    int32_t sequenceNum;
    char isValid; //CP_INVALID - No min/max and no DML in progress. CP_UPDATING - Update in progress. CP_VALID- min/max is valid
    uint8_t keyLen; // up to 256 bytes
    uint8_t minMax[]; // memcmp-friendly (big-endian for integers) encoding of min and max prefixes.
                                        // The array is twice the keyLen field long, first goes min, then max.
 
    // here we need methods to compute record size, get the min/max keys, etc.
};

And extentmap will now contains several "extentarrays", one for each key length.

It is easy to define maximum and minimum values for integers. Floating point numbers also can be handled, possible with one exception of negative zero.

For strings of arbitrary length, the minimum and maximum can be defined as follows:

  1. if string is not longer than key length, we extend the string with zeroes up to key length and this extended string forms minimum and maximum value to be added to a range. For string "ab" and key length 4 we will have minimum and maximum both equal to "ab\0\0".
  2. if string is longer than key length we cut it up to to key length. For string "abcde" and key length 4 the minimum and maximum are "abcd".

Please note that we consider minimum prefix to be extended to the right with zeroes infinitely and maximum to be extended to the right with the 0xff infinitely. That makes comparison of key length prefixes of strings with maximums and minimums valid: minimum is equal or less than any string in the range and maximum is equal or greater to any string in the range.

The prefix of the string preserves utf8 encoding, except for some last symbol. That last symbol can be thrown off and several first symbols can be used with collation for better extent use or elimination.

Comment by Sergey Zefirov [ 2021-02-04 ]

Extentmap internally is defined as preallocated array due to the way changes are handled: these changes are recorded as n-byte values starting at some memory address(es). When something is about to be changed, its value (n bytes of it) first recorded in the rollback record and then it is changed. The ExtentMap inherits from Undoable class and uses default implementation.

Thus, we should keep "array of records" design if we do not want to change in-memory rollback mechanism.

We can do that by having several arrays instead of one. And, possibly, even save some memory, I'll expand that soon.

Atomic update of several ranges with different key lengths is also possible: we can acquire locks for each individual key length sequentially, by using some order on key length (ascending or any other). The same ordering of lock acquisition between several mutator processes prevents dead lock.

Comment by Sergey Zefirov [ 2021-02-04 ]

We can use "array of arrays" storage to keep allocation smaller. The idea is described here in Compact Dynamic Arrays section. Instead of having dynamic arrays, we can allocate needed parts of static array, having sqrt(N) array for pointers to several sqrt(N) subarrays.

N above is maximum number of extent records kept.

This way we will have same address for our implementation of Undoable and not more than O(sqrt(N)) overhead over maximum number of extents actually used.

Comment by Roman [ 2021-02-05 ]

sergey.zefirov There will be no uint128 based dtypes so we don't need this integral type.

Comment by Sergey Zefirov [ 2021-02-08 ]

The locking in ExtentMap is done through a call to grabEMEntryTable(OPS op), where operation op can be READ or WRITE. It does so for the whole table at once.

For more efficient split access we need to perform locking on multiple tables and for that we need to establish locking order for tables. Without lock ordering we can have deadlocks.

The easiest ordering for extent map operation is by length of a key in an extent. We have fixed number of key lengths (1, 2, 4, 8 and 16 bytes) and set of key lengths for locking can be provided with the single integer values with some bits set.

Thus, the grabEMEntryTable will change to grabEMEntryTables(OPS op, smallintset keyLenSet). The operation is still READ or WRITE, the key length set is a set of small integers i (i=0..5) for key lengths 2^i^, represented as a structure with single field.

Corresponding release operation should receive same parameters and unlock tables in reverse locking order. The introduction of smallintset structure helps with threading values with types: we can provide singleton, union and enumeration operations, as well as operations for collections of LBID-associated ranges.

Comment by Sergey Zefirov [ 2021-02-10 ]

I definitely should record what Roman suggested: we may check extents with textual information with simple comparison of prefixes:

extent_range_min <= SUBSTR(string, 0, LENGTH(extent_range_min)) && SUBSTR(string, 0, LENGTH(extent_range_max)) <= extent_range_max

The "less or equal" comparison is collation-aware one.

We keep ranges for TOKEN column extents in 16-byte (128-bit) keys. These TOKEN columns provide information about strings layout in dictionary file. Please note that TOKEN columns are 8-bytes wide and this can be a source of confusion.

The string prefixes in ranges are binary - no collation is applied at the time of writing the strings into dictionary files. Collation, if any, is applied at the extent elimination phase.

Comment by Sergey Zefirov [ 2021-02-10 ]

Elena Starikova from server team shared this interesting bug from MySQL: https://bugs.mysql.com/bug.php?id=20247

Binary collation for strings "B", "D" and "Č" (accented C) will result in range "B".."Č" - both excessively large and, when used with proper collation, will not include "D" at all. The use of proper collation would result in "B".."D" but will preclude the use of any other collations.

Comment by Sergey Zefirov [ 2021-02-10 ]

Alexander Barkov explained that MariaDB does not optimize "some_column COLLATE collation" if collation specified differ from the collation specified for column.

But still, we need to store binary (utf-8) representation of strings for different collations to be applied.

Comment by Sergey Zefirov [ 2021-02-11 ]

Started a document on this, shared it with Greg and Roman.

Comment by Gregory Dorman (Inactive) [ 2021-03-11 ]

This task was to produce the design. It was accomplished. It is now MCOL-4580, implementation time.

Generated at Thu Feb 08 02:51:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.