[MCOL-3758] Parallel sorting 2nd phase and on disk spill capability. - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 23.10
Component/s: ExeMgr
Labels:
- gsoc20

Description

As of 1.4.1 CS uses two-phase sorting. Here are the phases:

Presort partial runs of data.
Merge the presorted partial runs produced during the 1st phase.

Here is more detailed explanation of how sorting works as of 1.4.1

CS gets a portion of records data from previous steps of the query execution(RGData instance from the ring buffer of RGData-s) and produces a sorting run out of it using existing sorting class LimitedOrderBy. If the query contains LIMIT then we apply it at this phase. This allows to significantly reduce the data set cardinality. If the query contains LIMIT + OFFSET then CS builds a sorted run of the records that is up to LIMIT+OFFSET size. CS does this step in parallel dividing the whole data set into k runs where k is governed by a session variable - columnstore_orderby_threads. At this phase CS tries to preallocate memory in QUEUE_RESERVE_SIZE batches.
CS merges and sorts k presorted partial runs produced by a previous phase in a single thread. If the query contains DISTINCT keyword CS rebuilds a hash map to preserve uniqueness.

We want to make 2nd phase also parallel using range partitioning of the presorted runs produced by the 1st phase. After 1st phase finishes we know the distribution of the sorting key values thus can divide the thread key values run into regions - buckets. Every 2nd phase thread takes values from corresponding region buckets (contains the same values region) from every 1st phase sorted run. Then all 2nd phase threads sorts its runs in parallel. In the end we put the sorted regions in a requested order(ascending/descending) of the key values into output stream.

The sorting must also has on disk spill capability.

Attachments

Issue Links

causes

MCOL-5804 Disk-based ORDER BY

Open

is part of

MCOL-4343 umbrella for tech debt issues

Open

Activity

People

Assignee:: Roman

Reporter:: Roman

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020-02-04 15:44

Updated:: 2024-10-03 15:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.