[MCOL-5758] Reduce the computations in JOINS by simpler Bloom-filter-based pre-joins - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- gsoc25

Description

Joins are very heavy algorithms, both in computation and/or in memory use. They need to hold a substantial amount of data in memory and perform hashing and other operations on that data. Joins can overflow memory limits and keeping balance between memory use and performance is tricky. Thus we have to filter information thaat is going into joins as much as possible. Columnstore already does great work in that regard, pushing WHERE filters before joins. This particular task is also concerned with that, adding Bloom filters' operations that approximate JOIN results and perform a secondary read to feed into joins data that is highly likely will be used in a join.

I did an experiment on query 96 from TPC-DS.

The query:

select top 100 count(*)

from store_sales

    ,household_demographics

    ,time_dim, store

where ss_sold_time_sk = time_dim.t_time_sk

    and ss_hdemo_sk = household_demographics.hd_demo_sk

    and ss_store_sk = s_store_sk

    and time_dim.t_hour = 8

    and time_dim.t_minute >= 30

    and household_demographics.hd_dep_count = 5

    and store.s_store_name = 'ese'

order by count(*)

The idea is to perform a double scan.

First, read of all columns needed, filter the data by constant filters and compute compatible Bllom filters (equal parameters - expected number of elements number of hashes and filter size) for columns that will be used in JOIN. As one example, we have join on ss_sold_time_sk = time_dim.t_time_sk,, so for column ss_sold_time_sk we compute a Bloom filter that is compatible with the time_dim.t_time_sk. The same is done for two other join equalities, ,but parameters of filters can be different.

Then, after these filters are computed, we compute their JOIN approximation, by performing an intersection operation (bitwise AND) on Bloom filters for ss_sold_time_sk and time_dim.t_time_sk. This JOIN approximation filter will not pass through information that will most probably not result in a match (we can control for false positives).

Then, we perform a second read and filtering of the same columns, now we add a predicate "is in JOIN approximation" into the list of predicates. The result of this second scan goes into actual JOIN algorithm, but amount of data is expected to be significantly less voluminous.

This trick, performed on the 1G TPC-DS query 96 data shows that from 2.88 millon rows in stores_sales table only 949 rows pass all filters with join approximations (there are three, joined by AND), or 0.033%. As store_sales table does not have any WHERE clause filters that are not joins, all rows has to be read in our current implementation.

The comments contain plan discussion and linked tasks will constitute the plan. The approach is to write as less code as possible, reusing what we have already.

Attachments

Activity

No workflow transitions have been executed yet.

People

Assignee:: Sergey Zefirov

Reporter:: Sergey Zefirov

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2024-05-27 11:22

Updated:: 4 days ago 23:40

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB ColumnStore