Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9526

Compute Aggregate functions as window functions

Details

    • 10.2.0-7, 10.2.12

    Description

      One can use regular aggregate functions as window function. Something like

       SUM(col2) OVER (PARTITION BY col1 ORDER BY col2 
                       ROWS BETWEEN 10 PRECEDING AND 10 FOLLOWING)
      

      Task #1: Window frames

      When the window frame moves, we have

      • A position where rows enter the fame
      • A position where rows leave the frame
      • A position where the current row is.

      That is, we need to maintain three positions.
      Note that we're scanning the file by scanning filesort's result (a sequence of rowids). That is, these three positions are positions in filesort's result.

      Task# 2: Handle aggregate functions that support removal

      The difference from computing an aggregate function in GROUP BY context is the use of window frame. Rows enter the window frame (this is not new) and leave the window frame (this is new).

      Aggregate functions have code to update the function value after a row has been added. There is no code to update the function value after a row has been removed. Some functions can update their value after a row has been removed. These are

      • SUM
      • COUNT (need to check for NULL and then remove)
      • AVG (this is SUM and COUNT)
      • BIT_OR (just make every bit a counter)
      • BIT_AND (can be reduced to BIT_OR)
      • BIT_XOR (removing the value is the same as adding it another time)

      Aggregates that maybe support removal #1 - numeric

      I am not sure if it's possible to use removal

      • VAR_POP() (Implemented without removal so far)
      • VARIANCE() (Implemented without removal so far)
      • VAR_SAMP() (Implemented without removal so far)
        These use a recursive formula to update the current value. It is possible reverse the computation and do the removal, but I am not sure whether this computation is numerically stable.

      These functions are just SQRT() of their VARIANCE() counterpart:

      • STD() (Implemented without removal so far)
      • STDDEV() (Implemented without removal so far)
      • STDDEV_POP() (Implemented without removal so far)
      • STDDEV_SAMP() (Implemented without removal so far)

      Aggregates that maybe support removal #2

      • GROUP_CONCAT()

      The result is accumulated in String Item_func_group_concat::result. Generally, this allows for removal.
      There is a problem with @@group_concat_max_len. When the length of the string exceeds this value,
      then a warning is produced, and the new value IS NOT ADDED to the string. Thus, the information about the new value is lost, and we will not be able to produce a correct result after remove() frees space.

      Idea from Igor: if we run out of space at the end of the buffer, start re-using free space at the front. So that the buffer is kind-of circular.

      Task #3: Aggregate functions that don't support value removal

      • MIN() (Implemented without removal)
      • MAX() (Implemented without removal)

      These functions do not support removal. (If we remove the value that is equal to min, what should be the next value?). Possible options
      1. Use a custom solution that works for just MIN and MAX (something like a priority queue that spills to a Unique object?)
      2. Use a generic O(N^2) algorithm.

      Attachments

        Issue Links

          Activity

            Some info about window frames:

            There are two kinds, ROWS-based and RANGE-based.

            ROWS-based frame counts physical rows. CURRENT ROW means the current row, $n PRECEDING or $n FOLLOWING means $n preceding or following rows in the partition.

            RANGE-based frame counts values. CURRENT ROW means current row, as well as all rows that sort together with the current row.

            psergei Sergei Petrunia added a comment - Some info about window frames: There are two kinds, ROWS-based and RANGE-based. ROWS-based frame counts physical rows. CURRENT ROW means the current row, $n PRECEDING or $n FOLLOWING means $n preceding or following rows in the partition. RANGE-based frame counts values. CURRENT ROW means current row, as well as all rows that sort together with the current row.

            Working on getting COUNT() to work on ROWS-based frames.

            COUNT() is a simple aggregate function that supports value removal.

            Using this : https://gist.github.com/spetrunia/518c6ff06bd413ab202d for cursor cloning.

            psergei Sergei Petrunia added a comment - Working on getting COUNT() to work on ROWS-based frames. COUNT() is a simple aggregate function that supports value removal. Using this : https://gist.github.com/spetrunia/518c6ff06bd413ab202d for cursor cloning.
            psergei Sergei Petrunia added a comment - - edited

            On the call yesterday, it was discussed that we only need to check for partition bound when advancing the first cursor (this is always the frame start cursor).
            The question is, how to synchronize that with other cursors? If the cursor is "CURRENT ROW", it's trival. "ROWS $n FOLLOWING" is always $n rows ahead,
            "ROWS UNBOUNDED FOLLOWING" immediately moves to the frame end and stays there.

            Maybe, we could count number of rows in the partition. The first cursor that reaches partition end sets the number. Other cursors can only check if #rows_seen_in_partition == total_rows_in_partition.

            psergei Sergei Petrunia added a comment - - edited On the call yesterday, it was discussed that we only need to check for partition bound when advancing the first cursor (this is always the frame start cursor). The question is, how to synchronize that with other cursors? If the cursor is "CURRENT ROW", it's trival. "ROWS $n FOLLOWING" is always $n rows ahead, "ROWS UNBOUNDED FOLLOWING" immediately moves to the frame end and stays there. Maybe, we could count number of rows in the partition. The first cursor that reaches partition end sets the number. Other cursors can only check if #rows_seen_in_partition == total_rows_in_partition.

            People

              cvicentiu Vicențiu Ciorbaru
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.