Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5758

Reduce the computations in JOINS by simpler Bloom-filter-based pre-joins

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Joins are very heavy algorithms, both in computation and/or in memory use. They need to hold a substantial amount of data in memory and perform hashing and other operations on that data. Joins can overflow memory limits and keeping balance between memory use and performance is tricky. Thus we have to filter information thaat is going into joins as much as possible. Columnstore already does great work in that regard, pushing WHERE filters before joins. This particular task is also concerned with that, adding Bloom filters' operations that approximate JOIN results and perform a secondary read to feed into joins data that is highly likely will be used in a join.

      I did an experiment on query 96 from TPC-DS.

      The query:

      select top 100 count(*)
      from store_sales
          ,household_demographics
          ,time_dim, store
      where ss_sold_time_sk = time_dim.t_time_sk
          and ss_hdemo_sk = household_demographics.hd_demo_sk
          and ss_store_sk = s_store_sk
          and time_dim.t_hour = 8
          and time_dim.t_minute >= 30
          and household_demographics.hd_dep_count = 5
          and store.s_store_name = 'ese'
      order by count(*)
      ;
      

      The idea is to perform a double scan.

      First, read of all columns needed, filter the data by constant filters and compute compatible Bllom filters (equal parameters - expected number of elements number of hashes and filter size) for columns that will be used in JOIN. As one example, we have join on ss_sold_time_sk = time_dim.t_time_sk,, so for column ss_sold_time_sk we compute a Bloom filter that is compatible with the time_dim.t_time_sk. The same is done for two other join equalities, ,but parameters of filters can be different.

      Then, after these filters are computed, we compute their JOIN approximation, by performing an intersection operation (bitwise AND) on Bloom filters for ss_sold_time_sk and time_dim.t_time_sk. This JOIN approximation filter will not pass through information that will most probably not result in a match (we can control for false positives).

      Then, we perform a second read and filtering of the same columns, now we add a predicate "is in JOIN approximation" into the list of predicates. The result of this second scan goes into actual JOIN algorithm, but amount of data is expected to be significantly less voluminous.

      This trick, performed on the 1G TPC-DS query 96 data shows that from 2.88 millon rows in stores_sales table only 949 rows pass all filters with join approximations (there are three, joined by AND), or 0.033%. As store_sales table does not have any WHERE clause filters that are not joins, all rows has to be read in our current implementation.

      The comments contain plan discussion and linked tasks will constitute the plan. The approach is to write as less code as possible, reusing what we have already.

      Attachments

        Activity

          No workflow transitions have been executed yet.

          People

            sergey.zefirov Sergey Zefirov
            sergey.zefirov Sergey Zefirov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.