Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.2.12
Component/s: Optimizer - Window functions
Labels:
None

Sprint:
10.2.0-7, 10.2.12

Description

One can use regular aggregate functions as window function. Something like

 SUM(col2) OVER (PARTITION BY col1 ORDER BY col2

                 ROWS BETWEEN 10 PRECEDING AND 10 FOLLOWING)

Task #1: Window frames

When the window frame moves, we have

A position where rows enter the fame
A position where rows leave the frame
A position where the current row is.

That is, we need to maintain three positions.
Note that we're scanning the file by scanning filesort's result (a sequence of rowids). That is, these three positions are positions in filesort's result.

Task# 2: Handle aggregate functions that support removal

The difference from computing an aggregate function in GROUP BY context is the use of window frame. Rows enter the window frame (this is not new) and leave the window frame (this is new).

Aggregate functions have code to update the function value after a row has been added. There is no code to update the function value after a row has been removed. Some functions can update their value after a row has been removed. These are

~~SUM~~
~~COUNT (need to check for NULL and then remove)~~
~~AVG (this is SUM and COUNT)~~
~~BIT_OR (just make every bit a counter)~~
~~BIT_AND (can be reduced to BIT_OR)~~
~~BIT_XOR (removing the value is the same as adding it another time)~~

Aggregates that maybe support removal #1 - numeric

I am not sure if it's possible to use removal

~~VAR_POP()~~ (Implemented without removal so far)
~~VARIANCE()~~ (Implemented without removal so far)
~~VAR_SAMP()~~ (Implemented without removal so far)
These use a recursive formula to update the current value. It is possible reverse the computation and do the removal, but I am not sure whether this computation is numerically stable.

These functions are just SQRT() of their VARIANCE() counterpart:

~~STD()~~ (Implemented without removal so far)
~~STDDEV()~~ (Implemented without removal so far)
~~STDDEV_POP()~~ (Implemented without removal so far)
~~STDDEV_SAMP()~~ (Implemented without removal so far)

Aggregates that maybe support removal #2

GROUP_CONCAT()

The result is accumulated in String Item_func_group_concat::result. Generally, this allows for removal.
There is a problem with @@group_concat_max_len. When the length of the string exceeds this value,
then a warning is produced, and the new value IS NOT ADDED to the string. Thus, the information about the new value is lost, and we will not be able to produce a correct result after remove() frees space.

Idea from Igor: if we run out of space at the end of the buffer, start re-using free space at the front. So that the buffer is kind-of circular.

Task #3: Aggregate functions that don't support value removal

~~MIN()~~ (Implemented without removal)
~~MAX()~~ (Implemented without removal)

These functions do not support removal. (If we remove the value that is equal to min, what should be the next value?). Possible options
1. Use a custom solution that works for just MIN and MAX (something like a priority queue that spills to a Unique object?)
2. Use a generic O(N^2) algorithm.

Attachments

Issue Links

includes

MDEV-9634 Window function produces incorrect value

Closed

is part of

MDEV-6115 window functions as in the SQL standard

Closed

relates to

MDEV-9676 RANGE-type frames for window functions

Closed

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2016-02-09 21:57

Some info about window frames:

There are two kinds, ROWS-based and RANGE-based.

ROWS-based frame counts physical rows. CURRENT ROW means the current row, $n PRECEDING or $n FOLLOWING means $n preceding or following rows in the partition.

RANGE-based frame counts values. CURRENT ROW means current row, as well as all rows that sort together with the current row.

Sergei Petrunia added a comment - 2016-02-09 21:57 Some info about window frames: There are two kinds, ROWS-based and RANGE-based. ROWS-based frame counts physical rows. CURRENT ROW means the current row, $n PRECEDING or $n FOLLOWING means $n preceding or following rows in the partition. RANGE-based frame counts values. CURRENT ROW means current row, as well as all rows that sort together with the current row.

Sergei Petrunia added a comment - 2016-02-17 18:06

Working on getting COUNT() to work on ROWS-based frames.

COUNT() is a simple aggregate function that supports value removal.

Using this : https://gist.github.com/spetrunia/518c6ff06bd413ab202d for cursor cloning.

Sergei Petrunia added a comment - 2016-02-17 18:06 Working on getting COUNT() to work on ROWS-based frames. COUNT() is a simple aggregate function that supports value removal. Using this : https://gist.github.com/spetrunia/518c6ff06bd413ab202d for cursor cloning.

Sergei Petrunia added a comment - 2016-02-17 18:11 - edited

On the call yesterday, it was discussed that we only need to check for partition bound when advancing the first cursor (this is always the frame start cursor).
The question is, how to synchronize that with other cursors? If the cursor is "CURRENT ROW", it's trival. "ROWS $n FOLLOWING" is always $n rows ahead,
"ROWS UNBOUNDED FOLLOWING" immediately moves to the frame end and stays there.

Maybe, we could count number of rows in the partition. The first cursor that reaches partition end sets the number. Other cursors can only check if #rows_seen_in_partition == total_rows_in_partition.

Sergei Petrunia added a comment - 2016-02-17 18:11 - edited On the call yesterday, it was discussed that we only need to check for partition bound when advancing the first cursor (this is always the frame start cursor). The question is, how to synchronize that with other cursors? If the cursor is "CURRENT ROW", it's trival. "ROWS $n FOLLOWING" is always $n rows ahead, "ROWS UNBOUNDED FOLLOWING" immediately moves to the frame end and stays there. Maybe, we could count number of rows in the partition. The first cursor that reaches partition end sets the number. Other cursors can only check if #rows_seen_in_partition == total_rows_in_partition.

People

Assignee:: Vicențiu Ciorbaru

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2016-02-06 23:54

Updated:: 2019-04-11 12:29

Resolved:: 2019-04-11 12:29

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Compute Aggregate functions as window functions