[MCOL-2105] Improve disk join behavior for unfortunate data distributions - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Do
Affects Version/s: 1.0.15, 1.1.6, 1.2.2
Fix Version/s: Icebox
Component/s: N/A
Labels:
None

Description

We've recently run into a couple cases where the disk join was triggered, but the data distribution was such that there were so many rows with a single value in the join column that it overflowed a partition in the loading stage. We currently reject the query when that happens, but it should be easy to handle as a special case.

For reference, the disk join algorithm is called 'GRACE', you can google 'grace join algorithm' to get more understanding of what it's doing.

In the code, to find the place where it detects the data distribution problem, grep for 'ERR_DBJ_DATA_DISTRIBUTION' in joinpartition.cpp.

My initial thoughts.
1) We can let partitions with a very small range of join values grow unbounded up to the total disk usage limit.
2) Set a flag on such partitions to indicate they shouldn't be loaded into a hash table when the join phase start.
3) Instead, it should be possible to stream those rows from disk when a row in the 'other' table matches, doing effectively a nested loop join instead of a hash join for those partitions.

It'll take some understanding of the partitioning structures & behavior; whoever gets this one, feel free to ask me.

Attachments

Issue Links

is part of

MCOL-4343 umbrella for tech debt issues

Open

Activity

People

Assignee:: Unassigned

Reporter:: Patrick LeBlanc (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-01-24 14:22

Updated:: 2024-07-08 02:31

Resolved:: 2023-07-02 02:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.