[MDEV-9526] Compute Aggregate functions as window functions Created: 2016-02-06  Updated: 2019-04-11  Resolved: 2019-04-11

Status: Closed
Project: MariaDB Server
Component/s: Optimizer - Window functions
Fix Version/s: 10.2.12

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Vicențiu Ciorbaru
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
includes MDEV-9634 Window function produces incorrect value Closed
is part of MDEV-6115 window functions as in the SQL standard Closed
Relates
relates to MDEV-9676 RANGE-type frames for window functions Closed
Sprint: 10.2.0-7, 10.2.12

 Description   

One can use regular aggregate functions as window function. Something like

 SUM(col2) OVER (PARTITION BY col1 ORDER BY col2 
                 ROWS BETWEEN 10 PRECEDING AND 10 FOLLOWING)

Task #1: Window frames

When the window frame moves, we have

  • A position where rows enter the fame
  • A position where rows leave the frame
  • A position where the current row is.

That is, we need to maintain three positions.
Note that we're scanning the file by scanning filesort's result (a sequence of rowids). That is, these three positions are positions in filesort's result.

Task# 2: Handle aggregate functions that support removal

The difference from computing an aggregate function in GROUP BY context is the use of window frame. Rows enter the window frame (this is not new) and leave the window frame (this is new).

Aggregate functions have code to update the function value after a row has been added. There is no code to update the function value after a row has been removed. Some functions can update their value after a row has been removed. These are

  • SUM
  • COUNT (need to check for NULL and then remove)
  • AVG (this is SUM and COUNT)
  • BIT_OR (just make every bit a counter)
  • BIT_AND (can be reduced to BIT_OR)
  • BIT_XOR (removing the value is the same as adding it another time)

Aggregates that maybe support removal #1 - numeric

I am not sure if it's possible to use removal

  • VAR_POP() (Implemented without removal so far)
  • VARIANCE() (Implemented without removal so far)
  • VAR_SAMP() (Implemented without removal so far)
    These use a recursive formula to update the current value. It is possible reverse the computation and do the removal, but I am not sure whether this computation is numerically stable.

These functions are just SQRT() of their VARIANCE() counterpart:

  • STD() (Implemented without removal so far)
  • STDDEV() (Implemented without removal so far)
  • STDDEV_POP() (Implemented without removal so far)
  • STDDEV_SAMP() (Implemented without removal so far)

Aggregates that maybe support removal #2

  • GROUP_CONCAT()

The result is accumulated in String Item_func_group_concat::result. Generally, this allows for removal.
There is a problem with @@group_concat_max_len. When the length of the string exceeds this value,
then a warning is produced, and the new value IS NOT ADDED to the string. Thus, the information about the new value is lost, and we will not be able to produce a correct result after remove() frees space.

Idea from Igor: if we run out of space at the end of the buffer, start re-using free space at the front. So that the buffer is kind-of circular.

Task #3: Aggregate functions that don't support value removal

  • MIN() (Implemented without removal)
  • MAX() (Implemented without removal)

These functions do not support removal. (If we remove the value that is equal to min, what should be the next value?). Possible options
1. Use a custom solution that works for just MIN and MAX (something like a priority queue that spills to a Unique object?)
2. Use a generic O(N^2) algorithm.



 Comments   
Comment by Sergei Petrunia [ 2016-02-09 ]

Some info about window frames:

There are two kinds, ROWS-based and RANGE-based.

ROWS-based frame counts physical rows. CURRENT ROW means the current row, $n PRECEDING or $n FOLLOWING means $n preceding or following rows in the partition.

RANGE-based frame counts values. CURRENT ROW means current row, as well as all rows that sort together with the current row.

Comment by Sergei Petrunia [ 2016-02-17 ]

Working on getting COUNT() to work on ROWS-based frames.

COUNT() is a simple aggregate function that supports value removal.

Using this : https://gist.github.com/spetrunia/518c6ff06bd413ab202d for cursor cloning.

Comment by Sergei Petrunia [ 2016-02-17 ]

On the call yesterday, it was discussed that we only need to check for partition bound when advancing the first cursor (this is always the frame start cursor).
The question is, how to synchronize that with other cursors? If the cursor is "CURRENT ROW", it's trival. "ROWS $n FOLLOWING" is always $n rows ahead,
"ROWS UNBOUNDED FOLLOWING" immediately moves to the frame end and stays there.

Maybe, we could count number of rows in the partition. The first cursor that reaches partition end sets the number. Other cursors can only check if #rows_seen_in_partition == total_rows_in_partition.

Generated at Thu Feb 08 07:35:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.